Chapter 14. Menu Modeling (Alt-M)

The Modeling menu contains the commands for making Principal Components Analysis (PCA), building and validate Partial Least Squares (PLS) models and to use these models for making predictions on external data sets.

14.1. Modeling->Generate PCA model...

PCA is carried out on the whole X-matrix, but for the variables and objects excluded using the commands Pretreatment->Exclude vars. and Pretreatment->Exclude objects. Data is scaled using the last scaling method defined in Pretreatment->Scale... or Autoscaled if no scaling has been defined by the User.

The item in the menu is inactive when the data file contains less than 2 objects.

The number of PC's calculated for the model can be selected by the User:

Press the right arrow button to increase the number of components or the left arrow button to decrease the number of components. When it appears in the dialog window the desired model dimensionality press the OK button, or press the Cancel button to abort the operation.

Remember that, in order to be safe, the number of components should always be smaller than one third of the number of the objects

The calculation will take from a few seconds to several minutes, depending on the number of X-variables and the number of objects in the data file. VolSurf will inform on the progress with a working dialog in which the number of components processed are shown.

After a while VolSurf will display in the main window the results of the PCA:

For each component it is shown:

XVarExp: percentage of the X-matrix variance explained by this component.

XAccum: accumulative percentage of the X-matrix variance a explained by the model.

14.2. Modeling->Generate PLS model...

This command generates the PLS model. PLS is carried out on the whole X-matrix, but not for the variables and objects excluded using the commands Pretreatment->Exclude vars. and Pretreatment->Exclude objects Data is scaled using the last scaling method defined in Pretreatment->Scale... or Autoscaled if no scaling has been defined by the User.

The item in the menu is inactive when the data file contains less than 2 objects or does not contains Y-variables.

The number of PC's calculated for the model can be selected by the User:

Press the right arrow button to increase the number of components or the left arrow button to decrease the number of components. When the desired model dimensionality appears in the dialog window, press the OK button, or press the Cancel button to abort the operation. Remember that, in order to be safe, the number of components should always be smaller than one third of the number of the objects

The calculation will take between a few seconds to several minutes, depending on the number of X and Y-variables and the number of objects in the data file. VolSurf will inform on the progress with a working dialog window in which the number of components processed are shown.

After a while VolSurf will display in the main window the results of the PLS:

For each component it is shown:

XVarExp: percentage of the X-matrix variance explained by this component.

XAccum: accumulative percentage of the X-matrix variance explained by the model.

SDEC: Standard Deviation of Error of Calculations.

r2 Squared Correlation coefficient.

Y : Experimental value

Y' : Value calculated by the model

: Average value

N : Number of objects

14.3. Modeling->Validate PLS model...

This command can be accessed only after a PLS model in fitting has been generated.

Max. dimensionality

selects the maximum dimensionality of the PLS model to validate. The optimal dimensionality of the model may be less or equal to this maximum dimension number.

Validation mode

select the crossvalidation method reported in order to validate the model. It is possible to choose between:

  • selecting Leave One Out, models are built keeping one object at a time out of the analysis for predicting its Y value and repeating until all the objects were kept out once.

  • selecting Leave Two Out: models are built keeping two objects at a time out of the analysis for predicting their Y values and repeating until all the objects are kept out once in all possible combinations of two objects.

  • selecting Random Groups, the objects are assigned in a random way to N groups, each one containing an equal (or nearby equal) number of objects. Then VolSurf automatically builds models which keep one of this groups out of the analysis until all the objects were kept one once. The formation of the groups and the validation is repeated M times. The parameters N (number of groups) and M (number of times the whole procedure is repeated) are defined in this dialog window within the Number of groups control and the Num. of SDEP scale.

  • selecting Specific Groups, the objects are assigned by the User to N groups. Then VolSurf builds models which keep one of this groups out of the analysis until all the objects were kept one once. The parameter N (number of groups) is defined in this dialog window within the Number of groups control.

    In this last option, immediately after pressing the OK button, the User will be prompted to define the groups in the following dialog window:

    The User should proceed as follows:

    • Select a group from the Object Group list.

    • Click on the names of the objects to include in this grouping the Objects: list. The item in the list changes showing the group to which it has been assigned. Notice that, by default, all the objects are assigned to group A.

    • Repeat this procedure until all the objects have been assigned to any group. No group can be empty.

    When all the objects were assigned to a group, press the OK button, to proceed with the validation, or Cancel to abort it.

Num. of SDEP

this scale is active only when the option Random Groups is selected. The number shown in the scale indicates the number of times that the whole validation procedure will be repeated, as it was stated above. The default is 20 times.

Number of groups

this control is active only when the options Random Groups or Specific Groups are selected. Specifies the number of groups in which the objects in the data file will be split. We suggest using 5 groups when the number of objects is 20 or larger, and less groups when the number of objects is smaller.

Recalculate weights

selecting yes will force VolSurf to recalculate the variable weights in each computation. The results are more reliable and stable although the computation is slightly slower.

When all the settings are correct press the OK button to start the computation. Press the Cancel button to abort the validation or the Defaults button to change all the settings in this dialog window with the default values. Remember that, when the validation uses selected groups, a new dialog window will appear to define the groups.

The calculation will take from a few seconds to some minutes, depending on the number of X and Y-variables, the number of objects and, mainly, on the validation procedure chosen. Random Groups is the most time consuming procedure, depending also of the Num. of SDEP defined. VolSurf will inform on the progress of the validation by a working dialog where the percentage of the calculation completed is shown.

After a while VolSurf will display in the main window the results of the PLS validation:

For each component it is shown:

SDEP: Standard Deviation of Error of Predictions:

SDEV(sdep): Standard Deviation of SDEP.

q2: Squared Predictive correlation coefficient:

Y : Experimental value

Y' : Predicted value

: Average value

N : Number of objects

14.4. Modeling->External PCA pred...

A PCA model provides a simplified representation of the original X matrix as the product of two matrices: a loading (P) matrix and a scores (T) matrix. The latest can be used for detecting the structure of the objects in terms of clustering, similarities, outlier detection, etc...

It is possible to apply a certain PCA model to a external dataset (X*), in order to obtain "predicted" scores (T*) that can be used to obtain scores plots representing both the original series (T) and the external objects (T*). This representation can be seen as a projection of the external series in the same dimensionally-reduced space obtained for the original series, with a rotation defined by the original loading matrix (P).

In the original PCA model:

X = TP' + E

for a external dataset X*

X*=T*P' + E*

and using the NIPALS algorithm, for a certain dimension a:

ta*=X*pa/pa'pa

In VolSurf, the predicted scores can be used to obtain mixed scores 2D plots and 3D plots (using Plot->2D plots->PCA-scores... and plot->3D plots->PCA-scores... ).

This warning is shown every time a prediction is made because it is very important that the external data is:

  1. In GOLPE format: VolSurf data can be easily written in this format using the command File->Export data->GOLPE format... .

  2. Raw scaled (see Pretreatment->Scale... command): the same scaling used to develop the model will be automatically applied.

  3. Data must contain exactly the same number of X and Y variables.

If some of these conditions were not fulfilled, wrong predictions might be obtained!

Once the User acknowledge the warning dialog click OK, a Select a file dialog will be opened and the User will be asked to select the external data file. See the File->Open data file... command for details about the file selection dialog.

Once the selection was made VolSurf will list two lines for each object and for each model dimensionality in the main window: in the first one is shown the percentage of the sum of squares (as %SS), in the second one is shown the predicted PCA score values. At the end, it is also listed a resume of the total percentage of sum of squares explained for each model dimensionality, (as SSExp and accumulated as SSAcum) referred to all the external set.

14.5. Modeling->External PLS pred...

Once the PLS model is built it is possible to apply the model to an external data file in order to predict the activity of other molecules. This option is also useful to check the model predictive power on an external validation set, besides the self-consistency of the SDEP procedure along the stepwise process of variable selection.

Please notice that the type and the number of variables in the external data file must exactly match the type and number of variables in the data file used to generate the PLS model. This also applies to the Y-variables, and therefore, when these are unknown some numerical values should be introduced in the external data file.

The command first presents the following warning dialog:

  1. In GOLPE format: VolSurf data can be easily written in this format using the command File->Export data->GOLPE format... .

  2. Raw scaled (see Pretreatment->Scale... command): the same scaling used to develop the model will be automatically applied.

  3. Data must contain exactly the same number of X and Y variables.

If some of these conditions were not fulfilled, wrong predictions might be obtained!

Once the User acknowledge the warning dialog, a standard file selection dialog will be opened and the User will be asked to select the external data file. Once the selection was made VolSurf will show in the main window the predicted Y-values, for each PLS model dimensionality. See the File->Open data file... command for details about the file selection dialog.

Additionally, in order to use the external data file for validating the PLS model, VolSurf will use the Y-variables provided to calculate the SDEP (external), component by component.

SDEP (external): Standard Deviation of Error of Predictions (external).

Y : Y-value in the external data file.

Y' : Predicted value.

: Average value.

N : Number of objects.

If the Y-variables for the external data file are unknown and the User introduced just dummy values, the SDEP value has no meaning and can be ignored.

14.6. Modeling->Project on library model

This option is similar to the File->Direct prediction... command but, while the former starts from the molecular structures and automatically produces the predictions, this one allow to project on a precomputed model only a description X matrix. Two options are available:

  • With spectra...

    choose this option in order to visualize the projected compounds by a X or Y varible value of the projected model. This command supplies the same options of selecting Enable projected spectra in the File->Direct prediction... dialog.

  • Without spectra...

    select this command in order not to visualize the projected compounds by a X or Y variable value of the projected model.

Using both options the model library window will appear:

After the selection of one model and pressing the Project button the Prediction palette dialog will be shown:

For more details about the Prediction palette dialog see the File->Direct prediction... command.

It should be noted that with this option it is possible to project one precomputed model (i.e. the BBB model) on another one (i.e. the Caco2 model) in order to compare results and/or find new patterns.

14.7. Modeling->Select subset...

When the following dialog appears select the space where selecting the objects:

Space: the User can choose:

  • PCA (option active after generating a PCA model)

  • PLS (option active after generating a PLS model)

  • Descriptors

In the first case the criteria was applied in the PCA scores space, in the second one in the PLS scores space and the last one in the selected descriptors space. After pressing OK the Select selection dialog appears:

In the top left window

  • for selection of PCA or PLS space:

    click on the lines to select the model dimension which will be used in the design. Any combination of components (with an upper limit of five) can be selected.

  • for selection of Descriptors space:

    click on the lines to select the descriptors which will be used in the design. Any combination of descriptors (with an upper limit of five) can be selected.

selection method

  • Largest Minimum Distance (LMD)

    This methods tries to extract a number of compounds maximizing their mutual distances. Therefore, the selected subset is well spread over all the space covered by the original series, but it doesn't takes into account how such space was populated. Compounds with extreme values are nearly always included. The method can be slow when the number of compounds in the original series is very large (over 10.000). The algorithm was inspired by the work of Marengo et al. (Chemometrics and Intelligent Laboratory Systems, 16, 37, 1992), but uses different computational algorithms.

  • Most Descriptive Compound (MDC)

    This criterion privileges a selection scheme that weights the compounds according to their population density. A full description of the method can be found in B.D.Hudson et. al. Quant. Struct.-Act. Relat. 15, 285 1996.

output

when this control is selected, the selected compounds will be stored in a ASCII file called SelectedSeries.txt using the following format

100
129 caco129 5.602 4.686 1.000
91 caco91 2.906 -33.314 1.000
44 caco044 -2.682 -17.314 -1.000
.................

where 100 is the number of selected compounds, 129 91 and 44 are the number of compounds, caco129 caco91 and caco44 are the name of compounds, 5.602 4.68 and 1.00 are respectively the PC1 PC2 and Y values of the first compounds.

selection space

two selections are possible, complete and focused where the former allows a non constrained selection and the latter allows a constrained (by the user) selection search.

When focused is selected, a new dialog is presented to define the region of interest on which to make the candidate selection:

By default, all compounds are considered candidates for being selected. By clicking on each compound one can change its status to eligible (in) or not eligible (out). Apart from this method, the dialog includes a number of tools to simplify the selection of compounds:

all in

all the compounds are included in the set of candidates

all out

all the compounds are excluded from the set of candidates

plot 2D

  • for selection of PCA or PLS space:

    generates an interactive 2D plot from which the User can select a subregion of allowed compounds. In order to select the compounds, with the mouse on the 2D plot, click the central mouse button until a magenta cross symbol will appear. Then move the mouse in another position and click the button again. A line will be drawn in the plot. Repeat the procedure until a polygon has been drawn. Be sure to close the polygon (the line color will change to red).

  • for selection of Descriptors space:

    a new dialog appears:

    In the X axis and Y axis click on the descriptors to select the space to use for selection of subset and click OK to generate an interactive 2D plot. In this the User can select a subregion of allowed compounds. In order to select the compounds, with the mouse on the 2D plot, click the central mouse button until a magenta cross symbol will appear. Then move the mouse in another position and click the button again. A line will be drawn in the plot. Repeat the procedure until a polygon has been drawn. Be sure to close the polygon (the line color will change to red).

When finished, press the button capture clipboard on the focus region palette. VolSurf will automatically count the number of compounds included in the region and will upgrade the listed compounds (in or out).

Expansion:

the slide bar at the bottom of the dialog shows the number of compounds selected so far. If the bar is moved to the right, the series selected as candidates is expanded adding more compounds, in particular those nearer in the space. This technique is useful if the User wants to performs a selection only in the influence space of a certain compound of interest (very active, etc...). In this case, it is enough to select the compound in the list and then move the slide bar to choose the size of the subset. By pressing OK button the selected compounds will be stored in memory, and will be ready for the subsequent LMD or MDC criteria.

subset size

move this slide bar to select the size of the subset

starting design (only applicable for LMD criteria)

When random button is checked, the LMD criteria will be performed a number of times starting from different randomly selected subsets and the best subset found will be chosen. When random button is not selected, the method runs just once. In our experience, 10 different starting points are normally sufficient for a good estimation.

protected subset

in this text field the User can enter the name of a protected subset file. The latter is a file with the same format of the SelectedSeries.txt reporting compounds that will be always taken into consideration in the selection phase. Accordingly the LMD or MDC methods will generate a selection of compounds that will complement in the optimal manner the protected subset.

excluded subset

in this text field the User can enter the name of an excluded subset file. The latter is a file with the same format of the SelectedSeries.txt reporting compounds that will be always excluded by the selection phase. Accordingly, the LMD or MDC methods will generate a selection of compounds without take into consideration the excluded set of chemicals.

Press the Select button to starts the LMD or the MDC algorithms or the Exit button to force the exit from the subset selection. The Plot 2D and Plot 3D buttons shows the selected compounds colored in red in 2D or 3D plots respectively.

Latest versions

Login

Username

Password

Register | Lost password?