Basic Requirements for QSAR Analysis

Chapter: Medicinal Chemistry : Structure-Activity Relationship and Quantitative Structure Activity Relationship

Some basic requirements are very essential for best model development. They are the following:

BASIC REQUIREMENTS FOR QSAR ANALYSIS

Some basic requirements are very essential for best model development. They are the following:

• All analogues belong to a congeneric series (classical QSAR studies) exerting the same mechanism of action. This is a series of compounds with a similar basic structure, but with varying substituents. Noncongeneric series are widely used for higher dimensional (3D and 4D) studies.

• Also, the set of compounds with same mechanism of action is essential.

• Biological response should be distributed over a wide range.

• Observed biological activity should be in speciﬁc units (concentration in molar units or IC30 or percentage inhibition).

• A simple rule is that the total number of compounds in the training set divided by the number of variables in the ﬁnal model should be greater than approximately ﬁve or six.

This will assure that a data set will not be ‘over predicted’ and that the model will have a better chance to retain the predictive value.

Steps Involved in QSAR Studies

The QSAR methodology enables the development of mathematical models, which can be used to predict the biological activity of newly designed compounds. There are three steps involved in this procedure; the ﬁrst step is the creation of a database in which calculation of various physicochemical and structural parameters of a congeneric series takes place followed by regression analyses leading to model development between biological activities versus derived physiochemical descriptors. The third step involves the validation of the models and prediction of the biological activity of the designed compounds.

Statistical Methods Used in QSAR Analysis

Statistical methods are an essential component of QSAR work. They help to build models, estimate a model’s predictive abilities, validate an already existing model, and ﬁnd the relationships and co-relationship among the variables and the activities. Data analysis methods are used to recombine data into forms and groups and observations into hierarchies.

REGRESSION METHODS

It is a mathematical procedure, which co-relates dependent (X) variable with the independent (Y) variables. There can be different forms of regression analysis:

Simple linear regression analysis: An independent variable is correlated with a dependent variable and produces a linear one-term equation. It is useful for discovering some of the most important descriptors.

MLR analysis: More than one independent variable is correlated with a dependent variable and a single multiterm equation is formed. The number of variables should be one-ﬁfth of the molecules in a series, that is, for each ﬁve molecules in the series one can have one variable.

Stepwise linear regression analysis: This is useful when the number of independent variables is very high and is thus correlated in a stepwise manner with the dependent variable producing a multiterm linear equation.

PARTIAL LEAST SQUARE (PLS)

Hundreds or even thousands of independent variables (X-block) can be correlated with one or several dependent variables (Y-block). PLS is used when X data contain co-linearities or when N is less than 5M, where N is the number of compounds and M is the number of independent variables. Often perfect correlations are obtained in PLS analysis, due to the usually large number of X variables and cross-validation procedure must be used to select the model that is having the highest predictive values. Several PLS are performed in which one or several objects are eliminated from the data set. It is the method of choice in 3D QSAR method.

GENETIC FUNCTION APPROXIMATION (GFA)

It provides multiple models that are created by evolving random initial models using a genetic algorithm. Models are improved by performing a cross over operation to recombine better sorting models. This method is used when dealing with a large numbers of descriptors.

GENETIC PARTIAL LEAST SQUARES (G/PLSS)

This method combines the best of GFA and PLS. Each generation has a PLS applied to it instead of MLR and so each model can have more terms in it without fear of overﬁlling. G/PLS retains the ease of interpretations of GFA by back transforming the PLS component to the original variable.

PRINCIPAL COMPONENT ANALYSIS (PCA)

PCA is a data reduction method, using mathematical techniques to identify the pattern in a data matrix. The main element of this approach consists of the construction of a small set of new orthogonal, that is, uncorrelated variables derived from a linear combination of the original variables.

Statistical Measures Commonly Used in Regression Analysis

Correlation coefﬁcient (r)/Square of the correlation coefﬁcient (r2): The correlation coefﬁcient ‘r’ and square of the correlation coefﬁcient (r2) are measures of the quality of the ﬁt of the model. It is computed using the following equation

r = √ 1ΣΔ2/SSY

r2 = l-ΣΔ2 /SSY

Where, SSY = Σ (Yobs - Ymean )

ΣΔ2 = Σ (Yobs – Ycal)2

Where SSY is the overall variance, that is, S = Σ (Yobs - Ymean) Y is observed biological activities

Ymean is mean of biological activities value

Ycal is calculated biological activity used in the equation.

A high value of correlation coefﬁcient (r) indicates the statistical signiﬁcance of the regression equation and thereby the participating substituent constants. The squared correlation r2 is a measure of the explained variance, most often presented as a percentage value, for example, r = 0.8, then r2 = 0.664 or 66.4% as the variance accounted by regression parameters.

Standard error of the estimate (S): This is a measure of how well the function derived by the QSAR analysis predicts the observed biological activity. Its value considers the number of objects n and the number of variable k. Therefore, S depends not only on the quality of ﬁt, but also on the number of degrees of freedom. The smaller the value of S the better is the QSAR.

DF = n – k – 1

S = √Σ (Yobs – Y cal )2 /n – k – 1

F-value: It is a measure of the statistical signiﬁcance of the regression model, the inﬂuence of the number of variables included in the model is even larger than the standard deviation.

F-value = r2 (n – k – 1)/k(1 – r2)

Predicted sum of squares (PRESS): The sum of the overall compounds of the square difference between the actual and the predicted value of dependent variables.

P = Σ (Yobs - Ypred)2

Cross-validation r2 (q2): Cross-validation is an approach for assessing the predictive value of a model. The cross validation r2(q2) is generated during a validation procedure. It is calculated using the formula

q2 = 1.0 – Σ ( Ypred– Yobs)2 / Σ (Yobs– Ymean )2

Where Ypred is a predicted value; Yobs is an actual value or experimental value; Ymean is the best estimate of the mean of all values that might be predicted.

A cross-validated r2 is usually smaller than the overall r2 for a QSAR equation. It is used as a diagnostic tool to evaluate the predictive power of an equation. Cross-validation proceeds by omitting one or more rows of input data, re-deriving the model, and predicting the target property values of the omitted rows. The re-derivation and predicting cycle continues until all the target property values have been predicted at least once. The root mean square error of all the target predictions, the predictive sum of squares (PRESS) is the basis for evaluating the model.

Outliers: An outlier is deﬁned as a structure with a residual greater than two times the standard deviation.

Bootstrapping: Bootstrapping is another technique for model validation. It is based on simulating a large number of data sets sampled from the original data set that are of the same size as the original. The same data can be sampled more than once. The statistical analysis is performed on each of the simulating data sets. The component model with consistent results is then chosen as the ﬁnal model.

<< Prev Page

Next Page >>

Basic Requirements for QSAR Analysis

Chapter: Medicinal Chemistry : Structure-Activity Relationship and Quantitative Structure Activity Relationship

Steps Involved in QSAR Studies

Statistical Methods Used in QSAR Analysis

REGRESSION METHODS

PARTIAL LEAST SQUARE (PLS)

GENETIC FUNCTION APPROXIMATION (GFA)

GENETIC PARTIAL LEAST SQUARES (G/PLSS)

PRINCIPAL COMPONENT ANALYSIS (PCA)

Statistical Measures Commonly Used in Regression Analysis

Computer-Aided Drug Design

Bioinformatics Hub

Structure-Activity Relationship and Quantitative Structure Activity Relationship

Historical Development of QSAR

Advantages and Disadvantages of QSAR

Basic Requirements for QSAR Analysis

Model Development Procedures

Combinatorial Chemistry

Combinatorial Compound Libraries

Pro-Drugs

Classification of Pro-drugs

Application of Pro-drugs

Central Nervous System