Modeling

Alt-M

 

Modeling Menu

[Generate PCA model...][Generate PLS model...][Validate PLS model....][External PLS pred....][External PCA pred....][Project on library model...] [Select subset...]

 

The Modeling menu contains the commands for making Principal Components Analysis (PCA), building and validate Partial Least Squares (PLS) models and to use these models for making predictions on external data sets.

 


 

Modeling>>>Generate PCA model...

 

PCA is carried out on the whole X-matrix, but for the variables and objects excluded using the commands Pretreatment>>Exclude vars. and Pretreatment>>>Exclude objects.... Data is scaled using the last scaling method defined in Pretreatment>>>Scale... or Raw scaled (unscaled) if no scaling has been defined by the User.

The item in the menu is insensitive when the data file contains less than 2 objects.

 

The number of PC's calculated for the model can be selected by the User.

 

PCA Dialog

 

Press the right arrow button to increase the number of components or the left arrow button to decrease the number of components. When it appears in the dialog window the desired model dimensionality press the OK button, or press the Cancel button to abort the operation.

Remember that, in order to be safe, the number of components should always be smaller than one third of the number of the objects

 

 

The calculation will take from a few seconds to several minutes, depending on the number of X-variables and the number of objects in the data file. ALMOND will inform on the progress with a working dialog in which the number of components processed are shown.

 

After a while ALMOND will display in the main window the results of the PCA:

 

 

Principal Component Analysis (PCA)   24 objects     24 X-var
      components    XVarExp     XAccum
          1        59.7849     59.7849
          2        39.9664     99.7513
          3         0.1676     99.9188
          4         0.0466     99.9654
          5         0.0250     99.9905

 

 

For each component it is shown:

 

XVarExp Percentage of the X-matrix variance explained by this component.

 

XAccum Accumulative percentage of the X-matrix variance explained by the model.

 


 

Modeling>>>Generate PLS model

 

This command generates the PLS model. PLS is carried out on the whole X-matrix, but for the variables and objects excluded using the commands Pretreatment>>Exclude vars. and Pretreatment>>>Exclude objects. Data is scaled using the last scaling method defined in Pretreatment>>>Scale or Raw scaled (unscaled) if no scaling has been defined by the User.

The item in the menu is insensitive when the data file contains less than 2 objects or does not contains Y-variables.

 

The number of PC's calculated for the model can be selected by the User.

 

PLS Dialog

 

Press the right arrow button to increase the number of components or the left arrow button to decrease the number of components. When the desired model dimensionality appears in the dialog window, press the OK button, or press the Cancel button to abort the operation. Remember that, in order to be safe, the number of components should always be smaller than one third of the number of the objects

 

 

The calculation will take between a few seconds to several minutes, depending on the number of X and Y-variables and the number of objects in the data file. ALMOND will inform on the progress with a working dialog window in which the number of components processed are shown.

 

After a while ALMOND will display in the main window the results of the PLS:

 

Partial Least Squares        (PLS)   15 objects     449 X-var     1 Y-var
Y1    components    XVarExp     XAccum      SDEC       r2
          0         0.0000      0.0000     1.0675     0.0000
          1        18.7309     18.7309     0.5703     0.7146
          2        12.7664     31.4973     0.4179     0.8468
          3        19.7530     51.2503     0.3586     0.8871
          4        10.4417     61.6920     0.3052     0.9183
          5        14.4762     76.1682     0.2760     0.9331

 

 

For each component it is shown:

 

XVarExp Percentage of the X-matrix variance explained by this component.

 

XAccum Accumulative percentage of the X-matrix variance explained by the model.

 

SDEC Standard Deviation of Error of Calculations.

 

r2 Squared Correlation coefficient.

 

Y : Experimental value

Y' : Value calculated by the model

: Average value

N : Number of objects


 

Modeling>>>Validate PLS model

 

This command can be accessed only after a PLS model in fitting has been generated.

 

Model Validation Dialog

 

Max. dimensionality

Selects the maximum dimensionality of the PLS model to validate. The optimal dimensionality of the model may be less or equal to this maximum dimension number.

 

Validation mode

Select the crossvalidation method reported in order to validate the model. It is possible to choose between:

 

 

Only in this last option, immediately after pressing the OK button, the User will be prompted to define the groups in a dialog window like this:

 

Groups Dialog

 

The User should proceed as follows:

 

 

When all the objects were assigned to a group, press the OK button, to proceed with the validation, or Cancel to abort it.

 

Num. of SDEP

This scale is sensitive only when the option Random Groups is selected. The number shown in the scale indicates the number of times that the whole validation procedure will be repeated, as it was stated above. The default is 20 times.

 

Number of groups

This control is sensitive only when the options Random Groups or Specific Groups are selected. Specifies the number of groups in which the objects in the data file will be split. We suggest using 5 groups when the number of objects is 20 or larger, and less groups when the number of objects is smaller.

 

Recalculate weights

Selecting yes will force ALMOND to recalculate the variable weights in each computation. The results are more reliable and stable although the computation is slightly slower.

 

 

When all the settings are correct press the OK button to start the computation. Press the Cancel button to abort the validation or the Defaults button to change all the settings in this dialog window with the default values. Remember that, when the validation uses selected groups, a new dialog window will appear to define the groups.

 

The calculation will take from a few seconds to some minutes, depending on the number of X and Y-variables, the number of objects and, mainly, on the validation procedure chosen. Random Groups is the most time consuming procedure, depending also of the Num. of SDEP defined. ALMOND will inform on the progress of the validation by a working dialog where the percentage of the calculation completed is shown.

After a while ALMOND will display in the main window the results of the PLS validation:

 

 

PLS Model Validation - 5 Random Groups   20 SDEP-calc


Y1    components    SDEP        SDEV(sdep)    q2
          0        1.1599      0.0417       -0.1807
          1        0.9637      0.0592        0.1850
          2        0.9217      0.0888        0.2544
          3        0.8738      0.1087        0.3300
          4        0.8607      0.0933        0.3498
          5        0.8639      0.0732        0.3451

 

For each component it is shown:

 

SDEP Standard Deviation of Error of Predictions.

 

SDEV(sdep) Standard Deviation of SDEP

 

q2 Squared Predictive correlation coefficient

 

Y : Experimental value

Y' : Predicted value

{short description of image} : Average value

N : Number of objects

 


 

Modeling>>>External PLS pred.

 

Once the PLS model is built it is possible to apply the model to an external data file in order to predict the activity of other molecules. This option is also useful to check the model predictive power on an external validation set, besides the self-consistency of the SDEP procedure along the stepwise process of variable selection.

Please notice that the type and the number of variables in the external data file must exactly match the type and number of variables in the data file used to generate the PLS model. This also applies to the Y-variables, and therefore, when these are unknown some numerical values should be introduced in the external data file.

 

The command first presents the following warning dialog.

 

Ext PLS warning

 

This warning is shown every time a prediction is made because it is very important that the external data is:

  1. In GOLPE format. ALMOND data can be easily written in this format using the command File>>Export data>>GOLPE format.
  2. Raw scaled. The same scaling used to develop the model will be automatically applied.
  3. Data must contain exactly the same number of X and Y variables. In order to obtain the same number of X variables, sometimes is necessary to define the number of variables in the Calculation Parameters dialog.

If some of these conditions were not fulfilled, wrong predictions might be obtained!

 

Once the User acknowledge the warning dialog, a standard file selection dialog will be opened and the User will be asked to select the external data file. Once the selection was made ALMOND will show in the main window the predicted Y-values, for each PLS model dimensionality. See the File>>>Open data file command for details about the file selection dialog.

 

Additionally, in order to use the external data file for validating the PLS model, ALMOND will use the Y-variables provided to calculate the SDEP (external), component by component.

 

SDEP (external) Standard Deviation of Error of Predictions (external).

 

Y : Y-value in the external data file.

Y' : Predicted value.

: Average value.

N : Number of objects.

 

If the Y-variables for the external data file are unknown and the User introduced just dummy values, the SDEP value has no meaning and can be ignored.

 


 

Modeling>>>External PCA pred.

 

A PCA model provides a simplified representation of the original X matrix as the product of two matrices: a loading (P) matrix and a scores (T) matrix. The latest can be used for detecting the structure of the objects in terms of clustering, similarities, outlier detection, etc...

It is possible to apply a certain PCA model to a external dataset (X*), in order to obtain "predicted" scores (T*) that can be used to obtain scores plots representing both the original series (T) and the external objects (T*). This representation can be seen as a projection of the external series in the same dimensionally-reduced space obtained for the original series, with a rotation defined by the original loading matrix (P).

 

In the original PCA model:

X = TP' + E

for a external dataset X*

X*=T*P' + E*

and using the NIPALS algorithm, for a certain dimension a:

ta*=X*pa/pa'pa

 

In ALMOND, the predicted scores can be used to obtain mixed scores 2D plots and 3D plots (using plot>>2D plots>>PCA-scores and plot>>3D plots>>>PCA-scores).

 

Ext PCA warning

 

This warning is shown every time a prediction is made because it is very important that the external data is:

  1. In GOLPE format. ALMOND data can be easily written in this format using the command File>>Export data>>GOLPE format.
  2. Raw scaled. The same scaling used to develop the model will be automatically applied.
  3. Data must contain exactly the same number of X and Y variables. In order to obtain the same number of X variables, sometimes is necessary to define the number of variables in the Calculation Parameters dialog.

If some of these conditions were not fulfilled, wrong predictions might be obtained!

 

Once the User acknowledge the warning dialog, a standard file selection dialog will be opened and the User will be asked to select the external data file. See the File>>>Open data file command for details about the file selection dialog.

Once the selection was made ALMOND will list in the main window the percentage of the sum of squares (as %SS) explained for each object and for each model dimensionality. At the end, it is also listed a resume of the total percentage of sum of squares explained for each model dimensionality, (as SSExp and accumulated as SSAcum) referred to all the external set.

 

External PCA predictions for /usr/people/prb/FFD.dat

% SS explained for each object and model dimensionality
1    d1a      27.75    53.00    83.76    83.78    95.72
2    d2a       0.89    10.43    67.26    71.03    72.67
6    d3a      43.39    72.35    89.43    90.37    90.69
10   d4a      55.30    57.50    91.28    92.97    94.16
14   d5a      70.95    74.27    82.88    86.43    93.55
16   d6a       0.13     1.27    57.45    60.94    95.75
20   d7a      48.10    53.62    64.73    71.87    84.11
24   d8a       8.85    44.42    69.96    80.41    81.13
26   d9a      55.16    60.05    60.31    84.45    91.85
30   d10a     31.86    40.81    51.21    86.77    97.58
32   d11a     29.88    31.59    39.30    92.04    95.43
36   d12a      6.41    80.44    93.73    94.50    97.95
40   d13a     33.81    61.57    70.95    83.42    88.23
41   d14a      8.36    70.09    71.12    71.51    97.19
45   d15a     40.15    70.65    78.25    83.80    85.30

      components     SSExp     SSAccum
          1        30.3948     30.3948
          2        24.2823     54.6771
          3        17.8427     72.5198
          4        10.0963     82.6160
          5         8.9537     91.5697

 


Modeling>>>Project on library model

Any precomputed library model can be used to predict the PCA or PLS score of the current objects. Please notice that, as for external prediction, the type and the number of variables in the library model must exactly match the type and number of variables of the current objects. This also applies to the Y-variables, and therefore, when these are unknown, some numerical values should be imported in order to activate PLS prediction.

 

{short description of image}

 

After the selection of one model and pressing the Project button the prediction palette will be displayed.

 


Modelling>>>Select subset

 

The subset selection option allows to select a subset of compounds according to statistical design criteria. ALMOND includes two alternative methods:

Largest Minimum Distance
LMD
This methods tries to extract a number of compounds maximizing their mutual distances. Therefore, the selected subset is well spread over all the space covered by the original series, but it doesn't takes into account how such space was populated. Compounds with extreme values are nearly always included. The method can be slow when the number of compounds in the original series is very large (over 10.000). The algorithm was inspired by the work of Marengo et al. (Chemometrics and Inteligent Laboratory Systems, 16, 37, 1992), but uses different computational algorithms.
Most Descriptive Compounds
MDC
This criterion privileges a selection scheme that weights the compounds according to their population density. A full description of the method can be found in B.D.Hudson et. al. Quant. Struct.-Act. Relat. 15, 285 1996.

In both cases, the criteria were applied in a scores space, that can be formed by any choose dimensionality of a previously selected PCA or PLS model. When on a ALMOND model both PCA and PLS are performed, the User must first decide the model to use. By clicking PCA, the subset will be forced to work on a PCA space model. Conversely, by clicking PLS, the method will run on a PLS space. If only one model has been made by the User, the PCA os PLS selection will be not reported. Once the space is selected, a window dialog like this will be presented:

 

Subset selection dialog

 

dimensions

Click on the lines to select the model dimension which will be used in the design. Any combination of components (with an upper limit of five) can be selected.

selection method

Select Largest Minimum Distance or the Most Descriptive Compound (MDC) criteria, as described above.

output

when this control is selected, the selected compounds will be stored in a ASCII file called SelectedSeries.txt using the following format

100

129 caco129 5.602 4.686 1.000

91 caco91 2.906 -33.314 1.000

44 caco044 -2.682 -17.314 -1.000

.................

where 100 is the number of selected compounds, 129 91 and 44 are the number of compounds, caco129 caco91 and caco44 are the name of compounds, 5.602 4.68 and 1.00 are respectively the PC1 PC2 and Y values of the first compounds.

selection space

Two selections are possible, complete and focused

When focused is selected, a new dialog is presented to define the region of interest on which to make the candidate selection

 

Focused design dialog

 

By default, all compounds are considered candidates for being selected. By clicking on top of each compound one can change its status to eligible (in) or not eligible (out). Apart from this method, the dialog includes a number of tools to simplify the selection of compounds:

all in

All the compounds are included in the set of candidates

all out

All the compounds are excluded from the set of candidates

plot 2D

generates an interactive 2D plot from which the User can select a subregion of allowed compounds. In order to select the compounds, with the mouse on the 2D plot, click the central mouse button until a magenta cross symbol will appear. Then move the mouse in another position and click the button again. A line will be drawn in the plot. Repeat the procedure until a polygon has been drawn. Be sure to close the polygon (the line color will change to red). When finished, press the button capture clipboard on the focus region palette. ALMOND will automatically select the compounds situated within the polygon and will update the listed compounds (in or out) accordingly.

Expansion:

The slide bar at the bottom of the dialog shows the number of compounds selected so far. If the bar is moved to the right, the series selected as candidates is expanded adding more compounds, in particular those nearer in the space. This technique is useful if the User wants to performs a selection only in the influence space of a certain compound of interest (very active, etc...). In this case, it is enough to select the compound in the list and then move the slide bar to choose the size of the subset. By pressing OK button the selected compounds will be stored in memory, and will be ready for the subsequent LMD or MDC criteria.

subset size

Move this slide bar to select the size of the subset

starting design (only applicable for LMD criteria)

When random button is selected, the LMD criteria will be performed a number of times starting from different randomly selected subsets and the best subset found will be chosen. When random button is not selected, the method runs just once. In our experience, 10 different starting points are normally sufficient for a good estimation.

protected subset

In this text field the User can enter the name of a protected subset file. The latter is a file with the same format of the SelectedSeries.txt reporting compounds that will be always taken into consideration in the selection phase. Accordingly the LMD or MDC methods will generate a selection of compounds that will complement in the optimal manner the protected subset.

excluded subset

In this text field the User can enter the name of an excluded subset file. The latter is a file with the same format of the SelectedSeries.txt reporting compounds that will be always excluded by the selection phase. Accordingly, the LMD or MDC methods will generate a selection of compounds without take into consideration the excluded set of chemicals.

 

Press the Select button to starts the LMD or the MDC algorithms or the Exit button to force the exit from the subset selection. The Plot 2D and Plot 3D buttons shows the selected compounds colored in red in 2D or 3D plots respectively.

 

Latest versions

Login

Username

Password

Register | Lost password?