| VolSurf manual | ||
|---|---|---|
| <<< Previous | Next >>> | |
Chapter 7. Statistical Tools
Complexity reduction and data simplification are two of the most important features of chemometric packages. VolSurf includes some valuable tools to enable a simple and straightforward chemical interpretation of the descriptor matrix; we will include here a brief introduction of the statistical tools present in the VolSurf package.
7.1. Principal Components Analysis (PCA)
Principal Components Analysis (PCA) is a technique extremely useful to "summarise" all the information contained in the X-matrix in a form understandable by human beings. The PCA works by decomposing the X-matrix as the product of two smaller matrices, which are called loading and score matrices.
The loading matrix (P) contains information about the variables: it is composed of a few vectors (Principal Components, PCs) which are (obtained as) lineal combinations of the original X-variables.
The score matrix (T) contains information about the objects. Each object is described in terms of its projections onto the PCs, (instead of the original variables).
X = TP' + E
The information not contained in these matrices remains as "unexplained X-variance" in a residual matrix (E) which has exactly the same dimensionality as the original X-matrix.
The PCs, among many others, have two interesting properties:
They are extracted in decreasing order of importance. The first PC always contains more information than the second, the second more than the third and so on...
They are orthogonal to each other. There is absolutely no correlation between the information contained in different PCs.
In PCA, the User can decide how many PCs should be extracted (the number of significant components, i.e. the dimensionality of the model). Each new PC extracted increases further the amount of information (variance) explained by the model. However, usually the first four of five PCs explain more than 90% of the X-variance. There is not a simple nor unique criterion to decide how many PC to extract and two kinds of considerations should be taken into account. From a theoretical point of view, it is possible to use cross-validation techniques to decide the number of PCs to include. From a practical point of view it does not matter to extract a large number of PCs if the User has no way to interpret the results.
7.1.1. Uses of PCA
PCA is extremely useful for understanding the distribution of the objects, and knowing how much and why they differ. Moreover the PCA can be used to highlight the variables which contain similar information, as well as the variables which contains completely independent information.
The best way to extract information from the PCA is graphically, by plotting the matrices obtained.
- 2D and 3D scores plot
These plots represent the relative position of the objects in the space (two-dimensional or three-dimensional) of the Principal Components. Clusters of objects and single objects with particular behaviour (outliers) may be easily identified.
Moreover, the objects' position in the plots may serve to interpret the PCs. The first PCs explain the maximum amount of variation and therefore, when there are clusters of objects, to distinguish among them. In this context, the PC can be interpreted as a compendium of the distinctive features of the objects in these clusters.
- 2D and 3D loading plots
These plots represent the original variables in the space (two-dimensional or three-dimensional) of the Principal Components.
Remember that the PC are obtained as linear combinations of the original X-variables. The loading of a single variable indicates how much this variable participates in defining the PC (the squares of the loadings indicate their percentage in the PC). Variables contributing very little to the PCs have small loading values and are plotted around the center of the plot. On the other hand, variables which contribute most are plotted around the borders of the plot.
Loading plots allow to test the "homogeneity" of the contributions of the X-variables to the model. When there are more than one single field group of variables, it is useful to plot the loadings highlighting the variables belonging to one of them, in order to understand how each field contributes to the whole description of the objects.
7.2. Partial Least Squares (PLS)
In the previous section we have described PCA, a technique of multivariate analysis that deals only with the X-variables. The Partial Least Squares (PLS) analysis is a regression technique whose goal is to explain one or more dependent variables (Y's) in terms of a number of explanatory variables (predictors, X's).
Y = f(X) + E
It is possible to build many different models that fulfill the equation. Different methods produce models that "fit" the Y's more or less accurately. Among them, the best one will be able to calculate Y values that correspond to the experimental ones, even for molecules not included in building the model. These models are "predictive" and can be used to calculate reliable estimations of Y values for new molecules, prior to their availability.
It is important to notice that the Y's variables, like any other experimental variable, contain error. The models will try to fit the Y as much as possible and, if we try to improve too much the fitting, the model will explain also the noise! This phenomenon, called overfitting, is very dangerous, because overfitted models seem to be very good but they often prove to be useless to predict the Y's of objects not included in the available data set (training set).
The predictive ability of a model is attributed to the existence of a "true" relationship between the measured X properties (usually field interaction energies) and the Y property measurements (usually the activity of a drug). Even if the 3D-QSAR models are rough and are not able to explain to a full extent the activity values, they are extremely valuable to identify the structural features that contribute most to the activity. These can be regarded as important in the context of the ligand-receptor interaction, and therefore they can give hints about how they interact and suggest new compounds.
The most sophisticated PLS 3D-QSAR models are subject to the same limitations of any regression model, and the aphorism "No regression model is better than the series it was obtained from" is always applicable. It is not possible that a model can provide information about the influence on the activity of areas which were not different enough in the series. Unfortunately no design method for 3D-QSAR have been reported so far and a good series design in this area continue to be a challenge. Nevertheless, the User should be aware of the lack of series design and understand the limitations of the resulting models.
7.2.1. PLS modeling
In 3D-QSAR often the X-matrix contains much less objects (molecules) than variables. In these situations, the classical regression technique, Multiple Linear Regression (MLR) is completely useless. There are many reasons, but, among others:
MLR was developed to deal with situations in which the number of objects (N) is three times at least larger than the number of variables (K). This inconvenient can be overcome by using stepwise MLR, but then there is a high probability to obtain relationships just by chance.
MLR assumes that the X-variables are "independent" and not correlated. We know, from the very beginning, that our variables don't fulfill these requirements, since they are highly correlated, because of the continuity properties of the force fields.
In fact, the only regression method than can deal with the kind of X-matrices used in 3D-QSAR is Partial Least Squares (PLS). PLS works decomposing the X-matrix as the product of two smaller matrices, much like PCA does.
The loading matrix (P) contains information about the variables. It contains a few vectors (Latent Variables, LVs) which are linear combinations of the original X-variables. The concept of LV is quite equivalent to the PC in PCA.
The score matrix (T) contains information about the objects. Each object is described in terms of the LVs.
The main difference is that PCA obtains the PCs that represent at best the structure of the X-matrix and PLS obtains the LVs under two constrains:
They have to represent the structure of the X-matrix and Y-matrix.
They have to maximize the fitting between the X's and the Y's.
The LVs share some important properties with the PCs:
They are extracted in decreasing order of importance. The first LV always contains more information than the second, the second more than the third and so on...
Each LV is orthogonal to each other. There is absolutely no correlation between the information contained in different LVs.
As in the PCA, the User can select the number of LVs to maintain in the model, but in PLS, selecting the correct dimensionality is of critical importance. When too many LVs are included a serious overfit will result and the model will have little or no validity. To check how many LVs to include it is strictly necessary to test the predictive ability of the model, taking into account different number of components.
7.2.2. Tests of predictive ability: Cross-validation
The evaluation of the predictive ability of the PLS models is important to:
Obtain the complexity (number of LVs) of the model to retain.
Evaluate the quality of the model.
The predictive ability of a model is usually evaluated using cross-validation (CV). It works building reduced models (models for which some of the objects were removed) and using them to predict the Y-variables of the objects held out. Then the Y predicted is compared with the Y experimental, and so, for each model dimensionality the following indexes are computed:
SDEP Standard Deviation of Errors of Prediction

Q^2 Predictive correlation coefficient

Y : Experimental value
Y' : Predicted value
: Average value
N : Number of objects
The CV technique is very valuable because it performs an "internal validation" of the model and obtains an estimation of the predictive ability without the help of external data-sets. This is particularly important in QSAR studies, where the number of objects available is usually small, and it is not affordable to remove objects from the learning data-set.
One of the main inconveniences of CV is that there is not a general agreement on how to build the reduced groups and on the criterion to decide how many objects to keep. It is clear that the objects should be deleted once and only once over the model ensemble, but apart from this there are different approaches:
- Leave One Out
Models are built keeping one object at a time out of the analysis and repeating the procedure until all the objects are kept out once.
- Leave Two Out
Models are built keeping two objects at a time out of the analysis and repeating the procedure until all the objects are kept out once in all the combinations of two.
- Fixed Groups
The objects are assigned in a fixed way to N groups, each one containing an equal (or nearly equal) number of objects. Then models are built keeping one of this groups out of the analysis until all the objects are kept out once.
- Random Groups
The objects are assigned in a random way to N groups, each one containing an equal (or nearly equal) number of objects. Then models are built keeping one of this groups out of the analysis until all the objects are kept out once. The formation of the groups and the validation is repeated M times.
In order to choose the most appropriate CV method we have to consider the peculiarities of the QSAR data-sets, in which the objects are often clustered. In this circumstance, the LOO or LTO methods have no chance to remove the structure of the data; most of the information contained in the object or couple of objects removed is anyway inside the model, kept in others objects of the same cluster, thus leading to overoptimistic results if this is also true that the SDEP is obtained in a reproducible way. On the other hand, the Random Groups approach can produce a much better, but more conservative, estimation of the real predictive ability. In other words: the uncertainly of future predictions is numerically worse but much more reliable. The lower is the number of groups, the harder is the validation criterion.
In this latter approach, the number of groups should be fixed in such a way that there are real chance that complete clusters are removed from the analysis in the reduced models. In addition, the procedure should be repeated many times, in order to obtain stable results. For this procedure the Standard Deviation of SDEP gives an estimate of the dispersion of the SDEP values obtained from different runs.
There are also some more details of the computation that may affect the results of CV. When some objects are removed from the data-set to build the reduced models, there are two possibilities: to recalculate the average and weights of each variable for the new, reduced data-set or to use the original averages and weights. The first approach is more time consuming, but the calculations are more accurate. This method may introduce bias and the reduced model may require one more LV, but only if the groups are formed only once. When the groups are formed in a random way the problem is removed.
Another way to evaluate the predictive ability of the model is to use an external prediction set. In this approach the objects in the original data-set are split up into two groups from the very beginning of the analysis. The first one, the learning set, will be used to build the PLS model. The other, the prediction set, will be used to compare their experimental Y-values with the predictions made by the PLS model. There is no doubt that this technique is more realistic to test the predictive ability. However it can be argued that the results depend critically upon how many and which objects are assigned to each group. Also, the data-sets in QSAR, often contains too few objects and it is not possible to remove objects from the analysis without a loss of information.
7.2.3. Uses of PLS
As for PCA, the best way to examine the information from PLS is graphically, by plotting the matrices obtained. To fully understand and diagnose a PLS model one should look carefully all the available plots.
- 2D T-U scores plot
This plot represents objects in the space of X-scores (T) against the Y-scores (U). From this plot the User can have a clear idea of the correlation between the X's and the Y's obtained in the model for each one of the LV. The plot of the first LV is by far the most informative and contains the main relationship between activities and structural descriptors. This information is completely lost is one looks only to the "predicted vs experimental" plot.
The plot can be useful to identify influential objects or clusters of objects (outliers). Usually these objects don't correlate in the first component. Then, the second or third LV is devoted to fit them, and they appear in the T-U scores plot for this LV as some (few) objects completely distinct from the rest of the objects. Accordingly, the T-U score plots for non-significant components show this behaviour.
- 2D and 3D loading plots
These plots represent original variables in the space (two-dimensional or three-dimensional) of the Latent Variables (P).
Remember that the LVs are linear combinations of the original X-variables. The loading of a single variable indicates how much of this variable is included in the LV.
Variables contributing very little to the LVs have small loading values and are plotted around the center of the plot. On the other hand, the variables which contribute most are plotted at the borders of the plot.
Loading plots allow to test the "homogeneity" of the contributions of the X-variables to the model. When there are more than one single field group of variables, it is useful to plot the loadings highlighting the variables in one of them, to understand how each field contributes to the whole PLS model.
- 2D and 3D weight plots
These plots represent original variables in the space (two-dimensional or three-dimensional) of the weights (W).
The weights (W) represent the coefficients that multiply the X's to best fit the Y's. Therefore, we can say that the loadings represent better the first constraint used to build the PLS model (the representation of the X-matrix) while the weights represent better the second constraint used to build the PLS model (the fitting of the Y's). Variables with high weights are important for the fitting of the Y's while variables with low weights (those in the center of the plot) are not so important. When there are more than one single field group of variables, it is useful to plot the weights highlighting the variables in one of them, to understand how each field contributes to the whole PLS model, from the point of view of the fitting.
7.3. Variable selection
The conceptual model underlying 3D-QSAR is that the large number of variables included in the X-matrix somewhat captures the dominating effects due to changes in structure over the actual molecules. However, we should be aware that a large number of variables in the X-matrix have no relationship with the activity and introduce only noise in the description of the molecules.
It should be considered that any X-variable, even if it does not contribute to explain the Y-variables, certainly contributes to the structure of the X-matrix. As the solution provided by PLS has the constrain of explaining the structure of the X-matrix, this structure only makes more difficult to find a solution satisfying both constraints. It has been reported that PLS is unable to obtain a good model when a single explicative variable is hidden in the middle of many others, even if this variable is highly correlated to the Y. One of the possible solutions to this problem is to remove from the data-set all the noisy variables, but it is not so simple to define a criterion nor a methodology to distinguish noise from information. The FFD procedure was developed just to handle this problem.
7.3.1. The FFD procedure
In few words, the FFD procedure is a method for detecting variables increasing the predictive ability of PLS models. The models obtained by using only variables selected by FFD are more predictive than the PLS on all variables.
The FFD procedure involve the following steps:
Obtain a initial PLS model.
Build the design matrix and evaluate the individual contribution of each variable to the predictive ability of the model.
Remove from the X-matrix the variables which don't contribute to increase the predictive ability and obtain a new PLS model.
- First step: initial PLS model
The first step is to build a PLS model using all the variables and to select the dimensionality that produces the most predictive model.
- Second step: evaluation of the variables
In the second step FFD builds a large number of "reduced models" similar to the complete model but removing some variables. The predictive ability of each model is evaluated using CV and, from these values, FFD relates the predictive ability of the model with the presence or absence of each X-variable.
- Third step: remove undesirable variables
The last step consist in removing from the data set the variables that don't improve the predictive ability of the model. It is clear that all the variables labeled as "excluded variables" should be removed from the analysis, while those labeled as "fixed" will be kept. In the method it is possible to choose between removing or maintaining the uncertain variables, but we suggest to remove them.
However, in our experience, when the elimination of variables is forced too much it might appear overfitting in the model. Also, from a theoretical point of view, force the variable selection too much can break the PLS formalism, by destroying completely the structure of the X-variables. The User should compare the risks of keeping in the model:
Too many variables. The models are too anchored to explaining the X-structure, the Y's fitting is not good. The models are stable
Too few variables. The models are not stabilized because we have destroyed the X-structure. The Y's fitting is good, but their predictive ability is doubtful, even if the CV shows good results.
- Additional remarks
FFD is a procedure which works evaluating the predictive ability of the models using CV and, therefore, it is extremely sensitive to a misuse of the CV. In particular, one should make sure that the data-set don't contains outliers nor dangerous clustering. If not, there is the risk to select the variables that better predict the outliers, obtaining a model of no general validity.
The variable selection can make models more predictive, but when the initial model is very little or no predictive at all, the FFD method cannot improve their quality, but it would derive chance correlations. Our advice is to never apply variable selection to models with low or negative Q2. In any case, the User should be aware that under these conditions, there is a high probability of obtaining spurious results, and he or she should try at his/her own risk.
7.3.2. References
R.D.Cramer, D.E.Patterson, J.D.Bunce. Comparative Molecular Field Analysis (CoMFA). 1. Effect of Shape on Binding of Steroids to Carrier Proteins. J.Am.Chem.Soc., 110, 5959-5967 (1988).
S.J.Cho, M.L.Serrano Garsia, J.Bier, A.Tropsha. Structure-Based Alignment and Comparative Molecular Field Analysis of Acetilcholinesterase Inhibitors. J.Med.Chem., 39, 5064-5071 (1996).
M.Pastor, G.Cruciani, K.A.Watson. A Strategy for the Incorporation of Water Molecules Present in a Ligand Binding Site into a 3D-QSAR Analysis. J. Med. Chem., 40, 4089-4102 (1997).
G.Kleebe. Structural Alignment of Molecules, In: 3D QSAR in Drug Design. Theory, Methods and Applications. H. Kubinyi, ed. ESCOM, Leiden, p. 173-199 (1993).
P.J.Goodford. A Computational Procedure for Determining Energetically Favorable Binding Sites on Biologically Important Macromolecules. J.Med.Chem., 28, 849-857 (1985).
GRID - Molecular Discovery Ltd. - http://www.moldiscovery.com
D.Riganelli, R.Valigi, G.Costantino, M.Baroni, S.Wold. Autocorrelation as a Tool for a Congruent Description of Molecules in 3D QSAR Studies. Pharm.Pharmacol.Lett., 3, 5-8 (1993).
| <<< Previous | Home | Next >>> |
| VolSurf Descriptors | Up | Volsurf Library models |