Basic Requirements for QSAR Analysis

| Home | | Medicinal Chemistry |

Chapter: Medicinal Chemistry : Structure-Activity Relationship and Quantitative Structure Activity Relationship

Some basic requirements are very essential for best model development. They are the following:


Some basic requirements are very essential for best model development. They are the following:

• All analogues belong to a congeneric series (classical QSAR studies) exerting the same mechanism of action. This is a series of compounds with a similar basic structure, but with varying substituents. Noncongeneric series are widely used for higher dimensional (3D and 4D) studies.

• Also, the set of compounds with same mechanism of action is essential.

• Biological response should be distributed over a wide range.

• Observed biological activity should be in specific units (concentration in molar units or IC30  or percentage inhibition).

A simple rule is that the total number of compounds in the training set divided by the number of variables in the final model should be greater than approximately five or six.

This will assure that a data set will not be ‘over predicted’ and that the model will have a better chance to retain the predictive value.

Steps Involved in QSAR Studies

The QSAR methodology enables the development of mathematical models, which can be used to predict the biological activity of newly designed compounds. There are three steps involved in this procedure; the first step is the creation of a database in which calculation of various physicochemical and structural parameters of a congeneric series takes place followed by regression analyses leading to model development between biological activities versus derived physiochemical descriptors. The third step involves the validation of the models and prediction of the biological activity of the designed compounds.

Statistical Methods Used in QSAR Analysis

Statistical methods are an essential component of QSAR work. They help to build models, estimate a model’s predictive abilities, validate an already existing model, and find the relationships and co-relationship among the variables and the activities. Data analysis methods are used to recombine data into forms and groups and observations into hierarchies.


It is a mathematical procedure, which co-relates dependent (X) variable with the independent (Y) variables. There can be different forms of regression analysis:

Simple linear regression analysis: An independent variable is correlated with a dependent variable and produces a linear one-term equation. It is useful for discovering some of the most important descriptors.

MLR analysis: More than one independent variable is correlated with a dependent variable and a single multiterm equation is formed. The number of variables should be one-fifth of the molecules in a series, that is, for each five molecules in the series one can have one variable.

Stepwise linear regression analysis: This is useful when the number of independent variables is very high and is thus correlated in a stepwise manner with the dependent variable producing a multiterm linear equation.


Hundreds or even thousands of independent variables (X-block) can be correlated with one or several dependent variables (Y-block). PLS is used when X data contain co-linearities or when N is less than 5M, where N is the number of compounds and M is the number of independent variables. Often perfect correlations are obtained in PLS analysis, due to the usually large number of X variables and cross-validation procedure must be used to select the model that is having the highest predictive values. Several PLS are performed in which one or several objects are eliminated from the data set. It is the method of choice in 3D QSAR method.


It provides multiple models that are created by evolving random initial models using a genetic algorithm. Models are improved by performing a cross over operation to recombine better sorting models. This method is used when dealing with a large numbers of descriptors.


This method combines the best of GFA and PLS. Each generation has a PLS applied to it instead of MLR and so each model can have more terms in it without fear of overfilling. G/PLS retains the ease of interpretations of GFA by back transforming the PLS component to the original variable.


PCA is a data reduction method, using mathematical techniques to identify the pattern in a data matrix. The main element of this approach consists of the construction of a small set of new orthogonal, that is, uncorrelated variables derived from a linear combination of the original variables.

Statistical Measures Commonly Used in Regression Analysis

Correlation coefficient (r)/Square of the correlation coefficient (r2): The correlation coefficient ‘r’ and square of the correlation coefficient (r2) are measures of the quality of the fit of the model. It is computed using the following equation

r = √ 1ΣΔ2/SSY

r2 = l-ΣΔ2 /SSY

Where, SSY = Σ (Yobs - Ymean )

 ΣΔ2 = Σ (Yobs – Ycal)2

Where SSY is the overall variance, that is, S = Σ (Yobs - Ymean) Y is observed biological activities

Ymean is mean of biological activities value

Ycal is calculated biological activity used in the equation.

A high value of correlation coefficient (r) indicates the statistical significance of the regression equation and thereby the participating substituent constants. The squared correlation r2 is a measure of the explained variance, most often presented as a percentage value, for example, r = 0.8, then r2 = 0.664 or 66.4% as the variance accounted by regression parameters.

Standard error of the estimate (S): This is a measure of how well the function derived by the QSAR analysis predicts the observed biological activity. Its value considers the number of objects n and the number of variable k. Therefore, S depends not only on the quality of fit, but also on the number of degrees of freedom. The smaller the value of S the better is the QSAR.

DF = n k – 1

S = √Σ (Yobs – Y cal )2 /n k – 1

F-value: It is a measure of the statistical significance of the regression model, the influence of the number of variables included in the model is even larger than the standard deviation.

F-value = r2 (n k – 1)/k(1 – r2)

Predicted sum of squares (PRESS): The sum of the overall compounds of the square difference between the actual and the predicted value of dependent variables.

 P = Σ (Yobs - Ypred)2

Cross-validation r2 (q2): Cross-validation is an approach for assessing the predictive value of a model. The cross validation r2(q2) is generated during a validation procedure. It is calculated using the formula

q2 = 1.0 – Σ ( Ypred– Yobs)2 / Σ (Yobs– Ymean )2

Where Ypred is a predicted value; Yobs is an actual value or experimental value; Ymean is the best estimate of the mean of all values that might be predicted.

A cross-validated r2 is usually smaller than the overall r2 for a QSAR equation. It is used as a diagnostic tool to evaluate the predictive power of an equation. Cross-validation proceeds by omitting one or more rows of input data, re-deriving the model, and predicting the target property values of the omitted rows. The re-derivation and predicting cycle continues until all the target property values have been predicted at least once. The root mean square error of all the target predictions, the predictive sum of squares (PRESS) is the basis for evaluating the model.

Outliers: An outlier is defined as a structure with a residual greater than two times the standard deviation.

Bootstrapping: Bootstrapping is another technique for model validation. It is based on simulating a large number of data sets sampled from the original data set that are of the same size as the original. The same data can be sampled more than once. The statistical analysis is performed on each of the simulating data sets. The component model with consistent results is then chosen as the final model.

Contact Us, Privacy Policy, Terms and Compliant, DMCA Policy and Compliant

TH 2019 - 2025; Developed by Therithal info.