(biochemistry) The establishment of statistical correlations between the potencies of a series of structurally related compounds and one or more quantitative structural parameters, such as lipophilicity, polarity, and molecular size, by using multilinear regression analysis.
Quantitative structure–activity relationship models (QSAR models) are regression models used in the chemical and biological sciences and engineering. Like other regression models, QSAR models relate measurements on a set of "predictor" variables to the behavior of the response variable. In QSAR modeling, the predictors consist of properties of chemicals; the QSAR response-variable is a the biological activity of the chemicals. QSAR models first summarize a supposed relationship between chemical structures and biological activity in a data-set of chemicals. Second QSAR models predict the activities of new chemicals. Related terms include quantitative structure–property relationships (QSPR).
For example, biological activity can be expressed quantitatively as the concentration of a substance required to give a certain biological response. Additionally, when physicochemical properties or structures are expressed by numbers, one can form a mathematical relationship, or quantitative structure-activity relationship, between the two. The mathematical expression can then be used to predict the biological response of other chemical structures.
A QSAR has the form of a mathematical model:

The error includes model error (bias) and observational variability, that is, the variability in observations even on a correct model.
|
Contents
|
The basic assumption for all molecule based hypotheses is that similar molecules have similar activities. This principle is also called Structure–Activity Relationship (SAR). The underlying problem is therefore how to define a small difference on a molecular level, since each kind of activity, e.g. reaction ability, biotransformation ability, solubility, target activity, and so on, might depend on another difference. A good example was given in the bioisosterism review of Patanie/LaVoie.[1]
In general, one is more interested in finding strong trends. Created hypotheses usually rely on a finite number of chemical data. Thus, the induction principle should be respected to avoid overfitted hypotheses and deriving overfitted and useless interpretations on structural/molecular data.
The SAR paradox refers to the fact that it is not the case that all similar molecules have similar activities.
The structure (and hence the activity) of a molecule could be defined as the sum of its individual atoms, but it is better defined for QSAR purposes as the sum of its chemical fragments. Analogously, the "partition coefficient" -- a measurement of differential solubility and itself a component of SAR predictions -- can be predicted either by atomic methods (known as "XLogP" or "ALogP") or by chemical fragment methods (known as "CLogP" and other variations). It has been shown that the logP of compound can be determined by the sum of its fragments; fragment-based methods are generally accepted as better predictors than atomic-based methods.[2] Fragmentary logP values have been determined statistically, based on empirical data for known logP values. This method gives mixed results and is generally not trusted to have accuracy of more than ±0.1 units.[3]
Group or Fragment based QSAR is also known as GQSAR.[4] GQSAR allows flexibility to study various molecular fragments of interest in relation to the variation in biological response. The molecular fragments could be substituents at various substitution sites in congeneric set of molecules or could be on the basis of pre-defined chemical rules in case of non-congeneric set. GQSAR also considers cross-terms fragment descriptors, which could be helpful in identification of key fragment interactions in determining variation of activity.[4] Lead discovery using Fragnomics is an emerging paradigm. In this context FB-QSAR proves to be a promising strategy for fragment library design and in fragment-to-lead identification endeavours.[5]
3D-QSAR refers to the application of force field calculations requiring three-dimensional structures, e.g. based on protein crystallography or molecule superimposition. It uses computed potentials, e.g. the Lennard-Jones potential, rather than experimental constants and is concerned with the overall molecule rather than a single substituent. It examines the steric fields (shape of the molecule), the hydrophobic regions (water-soluble surfaces),[6] and the electrostatic fields.[7]
The created data space is then usually reduced by a following feature extraction (see also dimensionality reduction). The following learning method can be any of the already mentioned machine learning methods, e.g. support vector machines.[8] An alternative approach uses multiple-instance learning by encoding molecules as sets of data instances, each of which represents a possible molecular conformation. A label or response is assigned to each set corresponding to the activity of the molecule, which is assumed to be determined by at least one instance in the set (i.e. some conformation of the molecule).[9]
On June 18th 2011 the CoMFA patent has dropped any restriction on the use of GRID and PLS technologies and the RCMD team (www.rcmd.it) has opened a 3D QSAR web server (www.3d-qsar.com).
In the literature it can be often found that chemists have a preference for partial least squares (PLS) methods,[citation needed] since it applies the feature extraction and induction in one step.
Computer SAR models typically calculate a relatively large number of features. Because those lack structural interpretation ability, the preprocessing steps face a feature selection problem (i.e., which structural features should be interpreted to determine the structure-activity relationship). Feature selection can be accomplished by visual inspection (qualitative selection by a human); by data mining; or by molecule mining.
A typical data mining based prediction uses e.g. support vector machines, decision trees, neural networks for inducing a predictive learning model.
Molecule mining approaches, a special case of structured data mining approaches, apply a similarity matrix based prediction or an automatic fragmentation scheme into molecular substructures. Furthermore there exist also approaches using maximum common subgraph searches or graph kernels.[10][11]
QSAR modeling produces predictive models derived from application of statistical tools correlating biological activity (including desirable therapeutic effect and undesirable side effects) of chemicals (drugs/toxicants/environmental pollutants) with descriptors representative of molecular structure and/or properties. QSARs are being applied in many disciplines for example risk assessment, toxicity prediction, and regulatory decisions[12] in addition to drug discovery and lead optimization.[13] Obtaining a good quality QSAR model depends on many factors, such as the quality of biological data, the choice of descriptors and statistical methods. Any QSAR modeling should ultimately lead to statistically robust models capable of making accurate and reliable predictions of biological activities of new compounds.
For validation of QSAR models usually four strategies are adopted:[14]
The success of any QSAR model depends on accuracy of the input data, selection of appropriate descriptors and statistical tools, and most importantly validation of the developed model. Validation is the process by which the reliability and relevance of a procedure are established for a specific purpose.[15] Leave one-out cross-validation generally leads to an overestimation of predictive capacity, and even with external validation, no one can be sure whether the selection of training and test sets was manipulated to maximize the predictive capacity of the model being published. Different aspects of validation of QSAR models that need attention includes methods of selection of training set compounds,[16] setting training set size[17] and impact of variable selection[18] for training set models for determining the quality of prediction. Development of novel validation parameters for judging quality of QSAR models is also important.[19]
One of the first historical QSAR applications was to predict boiling points.[20]
It is well known for instance that within a particular family of chemical compounds, especially of organic chemistry, that there are strong correlations between structure and observed properties. A simple example is the relationship between the number of carbons in alkanes and their boiling points. There is a clear trend in the increase of boiling point with an increase in the number carbons and this serves as a means for predicting the boiling points of higher alkanes.
A still very interesting application is the Hammett equation, Taft equation and pKa prediction methods.[21]
The biological activity of molecules is usually measured in assays to establish the level of inhibition of particular signal transduction or metabolic pathways. Chemicals can also be biologically active by being toxic. Drug discovery often involves the use of QSAR to identify chemical structures that could have good inhibitory effects on specific targets and have low toxicity (non-specific activity). Of special interest is the prediction of partition coefficient log P, which is an important measure used in identifying "druglikeness" according to Lipinski's Rule of Five.
While many quantitative structure activity relationship analyses involve the interactions of a family of molecules with an enzyme or receptor binding site, QSAR can also be used to study the interactions between the structural domains of proteins. Protein-protein interactions can be quantitatively analyzed for structural variations resulted from site-directed mutagenesis.[22]
It is part of the machine learning method to reduce the risk for a SAR paradox, especially taking into account that only a finite amount of data is available (see also MVUE). In general all QSAR problems can be divided into a coding[23] and learning.[24]
(Q)SAR models have been used for the risk management of chemicals risk. QSARS are suggested by regulatory authorities; in the European Union, QSARs are suggested by the REACH regulation, where "REACH" abbreviates "[[Registration, Evaluation, Authorisation and Restriction of Chemicals,".
The chemical descriptor space whose convex hull is generated by a particular training set of chemicals is called the training set's applicability domain. Prediction of properties of novel chemicals that are located outside the applicability domain uses extrapolation, and so is less reliable (on average) than prediction within the applicability domain. The assessment of the reliability of QSAR predictions remains a research topic.
|
|||||
This entry is from Wikipedia, the leading user-contributed encyclopedia. It may not have been reviewed by professional editors (see full disclaimer)