|
Modeling |
Alt-M |

[Generate PCA model...][Generate PLS model...][Validate PLS model....][External PLS pred....][External PCA pred....][Project on library model...] [Select subset...]
The Modeling menu contains the commands for making Principal Components Analysis (PCA), building and validate Partial Least Squares (PLS) models and to use these models for making predictions on external data sets.
Modeling>>>Generate PCA model...
PCA is carried out on the whole X-matrix, but for the variables and objects excluded using the commands Pretreatment>>Exclude vars. and Pretreatment>>>Exclude objects.... Data is scaled using the last scaling method defined in Pretreatment>>>Scale... or Raw scaled (unscaled) if no scaling has been defined by the User.
The item in the menu is insensitive when the data file contains less than 2 objects.
The number of PC's calculated for the model can be selected by the User.

Press the right arrow button to increase the number of components or the left arrow button to decrease the number of components. When it appears in the dialog window the desired model dimensionality press the OK button, or press the Cancel button to abort the operation.
Remember that, in order to be safe, the number of components should always be smaller than one third of the number of the objects
The calculation will take from a few seconds to several minutes, depending on the number of X-variables and the number of objects in the data file. ALMOND will inform on the progress with a working dialog in which the number of components processed are shown.
After a while ALMOND will display in the main window the results of the PCA:
Principal Component Analysis (PCA) 24 objects 24 X-var
components XVarExp XAccum
1 59.7849 59.7849
2 39.9664 99.7513
3 0.1676 99.9188
4 0.0466 99.9654
5 0.0250 99.9905
For each component it is shown:
XVarExp Percentage of the X-matrix variance explained by this component.
XAccum Accumulative percentage of the X-matrix variance explained by the model.
This command generates the PLS model. PLS is carried out on the whole X-matrix, but for the variables and objects excluded using the commands Pretreatment>>Exclude vars. and Pretreatment>>>Exclude objects. Data is scaled using the last scaling method defined in Pretreatment>>>Scale or Raw scaled (unscaled) if no scaling has been defined by the User.
The item in the menu is insensitive when the data file contains less than 2 objects or does not contains Y-variables.
The number of PC's calculated for the model can be selected by the User.

Press the right arrow button to increase the number of components or the left arrow button to decrease the number of components. When the desired model dimensionality appears in the dialog window, press the OK button, or press the Cancel button to abort the operation. Remember that, in order to be safe, the number of components should always be smaller than one third of the number of the objects
The calculation will take between a few seconds to several minutes, depending on the number of X and Y-variables and the number of objects in the data file. ALMOND will inform on the progress with a working dialog window in which the number of components processed are shown.
After a while ALMOND will display in the main window the results of the PLS:
Partial Least Squares (PLS) 15 objects 449 X-var 1 Y-var
Y1 components XVarExp XAccum SDEC r2
0 0.0000 0.0000 1.0675 0.0000
1 18.7309 18.7309 0.5703 0.7146
2 12.7664 31.4973 0.4179 0.8468
3 19.7530 51.2503 0.3586 0.8871
4 10.4417 61.6920 0.3052 0.9183
5 14.4762 76.1682 0.2760 0.9331
For each component it is shown:
XVarExp Percentage of the X-matrix variance explained by this component.
XAccum Accumulative percentage of the X-matrix variance explained by the model.
SDEC Standard Deviation of Error of Calculations.
![]()
r2 Squared Correlation coefficient.
![]()
Y : Experimental value
Y' : Value calculated by the model
: Average value
N : Number of objects
This command can be accessed only after a PLS model in fitting has been generated.

Max. dimensionality
Selects the maximum dimensionality of the PLS model to validate. The optimal dimensionality of the model may be less or equal to this maximum dimension number.
Validation mode
Select the crossvalidation method reported in order to validate the model. It is possible to choose between:
- Leave One Out. Models are built keeping one object at a time out of the analysis for predicting its Y value and repeating until all the objects were kept out once.
- Leave Two Out. Models are built keeping two objects at a time out of the analysis for predicting their Y values and repeating until all the objects are kept out once in all possible combinations of two objects.
- Random Groups. The objects are assigned in a random way to N groups, each one containing an equal (or nearby equal) number of objects. Then ALMOND automatically builds models which keep one of this groups out of the analysis until all the objects were kept one once. The formation of the groups and the validation is repeated M times. The parameters N (number of groups) and M (number of times the whole procedure is repeated) are defined in this dialog window within the Number of groups control and the Num. of SDEP scale.
- Specific Groups. The objects are assigned by the User to N groups. Then ALMOND builds models which keep one of this groups out of the analysis until all the objects were kept one once. The parameter N (number of groups) is defined in this dialog window within the Number of groups control.
Only in this last option, immediately after pressing the OK button, the User will be prompted to define the groups in a dialog window like this:

The User should proceed as follows:
- Select a group from the Object Group list.
- Click on the names of the objects to include in this grouping the Objects: list. The item in the list changes showing the group to which it has been assigned. Notice that, by default, all the objects are assigned to group A.
- Repeat this procedure until all the objects have been assigned to any group. No group can be empty.
When all the objects were assigned to a group, press the OK button, to proceed with the validation, or Cancel to abort it.
Num. of SDEP
This scale is sensitive only when the option Random Groups is selected. The number shown in the scale indicates the number of times that the whole validation procedure will be repeated, as it was stated above. The default is 20 times.
Number of groups
This control is sensitive only when the options Random Groups or Specific Groups are selected. Specifies the number of groups in which the objects in the data file will be split. We suggest using 5 groups when the number of objects is 20 or larger, and less groups when the number of objects is smaller.
Recalculate weights
Selecting yes will force ALMOND to recalculate the variable weights in each computation. The results are more reliable and stable although the computation is slightly slower.
When all the settings are correct press the OK button to start the computation. Press the Cancel button to abort the validation or the Defaults button to change all the settings in this dialog window with the default values. Remember that, when the validation uses selected groups, a new dialog window will appear to define the groups.
The calculation will take from a few seconds to some minutes, depending on the number of X and Y-variables, the number of objects and, mainly, on the validation procedure chosen. Random Groups is the most time consuming procedure, depending also of the Num. of SDEP defined. ALMOND will inform on the progress of the validation by a working dialog where the percentage of the calculation completed is shown.
After a while ALMOND will display in the main window the results of the PLS validation:
PLS Model Validation - 5 Random Groups 20 SDEP-calc
Y1 components SDEP SDEV(sdep) q2
0 1.1599 0.0417 -0.1807
1 0.9637 0.0592 0.1850
2 0.9217 0.0888 0.2544
3 0.8738 0.1087 0.3300
4 0.8607 0.0933 0.3498
5 0.8639 0.0732 0.3451
For each component it is shown:
SDEP Standard Deviation of Error of Predictions.
SDEV(sdep) Standard Deviation of SDEP
q2 Squared Predictive correlation coefficient
Y : Experimental value
Y' : Predicted value
: Average value
N : Number of objects
Once the PLS model is built it is possible to apply the model to an external data file in order to predict the activity of other molecules. This option is also useful to check the model predictive power on an external validation set, besides the self-consistency of the SDEP procedure along the stepwise process of variable selection.
Please notice that the type and the number of variables in the external data file must exactly match the type and number of variables in the data file used to generate the PLS model. This also applies to the Y-variables, and therefore, when these are unknown some numerical values should be introduced in the external data file.
The command first presents the following warning dialog.
This warning is shown every time a prediction is made because it is very important that the external data is:
- In GOLPE format. ALMOND data can be easily written in this format using the command File>>Export data>>GOLPE format.
- Raw scaled. The same scaling used to develop the model will be automatically applied.
- Data must contain exactly the same number of X and Y variables. In order to obtain the same number of X variables, sometimes is necessary to define the number of variables in the Calculation Parameters dialog.
If some of these conditions were not fulfilled, wrong predictions might be obtained!
Once the User acknowledge the warning dialog, a standard file selection dialog will be opened and the User will be asked to select the external data file. Once the selection was made ALMOND will show in the main window the predicted Y-values, for each PLS model dimensionality. See the File>>>Open data file command for details about the file selection dialog.
Additionally, in order to use the external data file for validating the PLS model, ALMOND will use the Y-variables provided to calculate the SDEP (external), component by component.
SDEP (external) Standard Deviation of Error of Predictions (external).
Y : Y-value in the external data file.
Y' : Predicted value.
: Average value.
N : Number of objects.
If the Y-variables for the external data file are unknown and the User introduced just dummy values, the SDEP value has no meaning and can be ignored.
A PCA model provides a simplified representation of the original X matrix as the product of two matrices: a loading (P) matrix and a scores (T) matrix. The latest can be used for detecting the structure of the objects in terms of clustering, similarities, outlier detection, etc...
It is possible to apply a certain PCA model to a external dataset (X*), in order to obtain "predicted" scores (T*) that can be used to obtain scores plots representing both the original series (T) and the external objects (T*). This representation can be seen as a projection of the external series in the same dimensionally-reduced space obtained for the original series, with a rotation defined by the original loading matrix (P).
In the original PCA model:
X = TP' + E
for a external dataset X*
X*=T*P' + E*
and using the NIPALS algorithm, for a certain dimension a:
ta*=X*pa/pa'pa
In ALMOND, the predicted scores can be used to obtain mixed scores 2D plots and 3D plots (using plot>>2D plots>>PCA-scores and plot>>3D plots>>>PCA-scores).
This warning is shown every time a prediction is made because it is very important that the external data is:
- In GOLPE format. ALMOND data can be easily written in this format using the command File>>Export data>>GOLPE format.
- Raw scaled. The same scaling used to develop the model will be automatically applied.
- Data must contain exactly the same number of X and Y variables. In order to obtain the same number of X variables, sometimes is necessary to define the number of variables in the Calculation Parameters dialog.
If some of these conditions were not fulfilled, wrong predictions might be obtained!
Once the User acknowledge the warning dialog, a standard file selection dialog will be opened and the User will be asked to select the external data file. See the File>>>Open data file command for details about the file selection dialog.
Once the selection was made ALMOND will list in the main window the percentage of the sum of squares (as %SS) explained for each object and for each model dimensionality. At the end, it is also listed a resume of the total percentage of sum of squares explained for each model dimensionality, (as SSExp and accumulated as SSAcum) referred to all the external set.
External PCA predictions for /usr/people/prb/FFD.dat
% SS explained for each object and model dimensionality
1 d1a 27.75 53.00 83.76 83.78 95.72
2 d2a 0.89 10.43 67.26 71.03 72.67
6 d3a 43.39 72.35 89.43 90.37 90.69
10 d4a 55.30 57.50 91.28 92.97 94.16
14 d5a 70.95 74.27 82.88 86.43 93.55
16 d6a 0.13 1.27 57.45 60.94 95.75
20 d7a 48.10 53.62 64.73 71.87 84.11
24 d8a 8.85 44.42 69.96 80.41 81.13
26 d9a 55.16 60.05 60.31 84.45 91.85
30 d10a 31.86 40.81 51.21 86.77 97.58
32 d11a 29.88 31.59 39.30 92.04 95.43
36 d12a 6.41 80.44 93.73 94.50 97.95
40 d13a 33.81 61.57 70.95 83.42 88.23
41 d14a 8.36 70.09 71.12 71.51 97.19
45 d15a 40.15 70.65 78.25 83.80 85.30
components SSExp SSAccum
1 30.3948 30.3948
2 24.2823 54.6771
3 17.8427 72.5198
4 10.0963 82.6160
5 8.9537 91.5697
Modeling>>>Project on library model
Any precomputed library model can be used to predict the PCA or PLS score of the current objects. Please notice that, as for external prediction, the type and the number of variables in the library model must exactly match the type and number of variables of the current objects. This also applies to the Y-variables, and therefore, when these are unknown, some numerical values should be imported in order to activate PLS prediction.

After the selection of one model and pressing the Project button the prediction palette will be displayed.
The subset selection option allows to select a subset of compounds according to statistical design criteria. ALMOND includes two alternative methods:
| Largest Minimum Distance LMD |
This methods tries to extract a number of compounds maximizing their mutual distances. Therefore, the selected subset is well spread over all the space covered by the original series, but it doesn't takes into account how such space was populated. Compounds with extreme values are nearly always included. The method can be slow when the number of compounds in the original series is very large (over 10.000). The algorithm was inspired by the work of Marengo et al. (Chemometrics and Inteligent Laboratory Systems, 16, 37, 1992), but uses different computational algorithms. |
| Most Descriptive Compounds MDC |
This criterion privileges a selection scheme that weights the compounds according to their population density. A full description of the method can be found in B.D.Hudson et. al. Quant. Struct.-Act. Relat. 15, 285 1996. |
In both cases, the criteria were applied in a scores space, that can be formed by any choose dimensionality of a previously selected PCA or PLS model. When on a ALMOND model both PCA and PLS are performed, the User must first decide the model to use. By clicking PCA, the subset will be forced to work on a PCA space model. Conversely, by clicking PLS, the method will run on a PLS space. If only one model has been made by the User, the PCA os PLS selection will be not reported. Once the space is selected, a window dialog like this will be presented:
dimensions
Click on the lines to select the model dimension which will be used in the design. Any combination of components (with an upper limit of five) can be selected.
selection method
Select Largest Minimum Distance or the Most Descriptive Compound (MDC) criteria, as described above.
output
when this control is selected, the selected compounds will be stored in a ASCII file called SelectedSeries.txt using the following format
100
129 caco129 5.602 4.686 1.000
91 caco91 2.906 -33.314 1.000
44 caco044 -2.682 -17.314 -1.000
.................
where 100 is the number of selected compounds, 129 91 and 44 are the number of compounds, caco129 caco91 and caco44 are the name of compounds, 5.602 4.68 and 1.00 are respectively the PC1 PC2 and Y values of the first compounds.
selection space
Two selections are possible, complete and focused
- Complete: allows a non constrained selection search
- Focused: allows a constrained (by the user) selection search.
When focused is selected, a new dialog is presented to define the region of interest on which to make the candidate selection
By default, all compounds are considered candidates for being selected. By clicking on top of each compound one can change its status to eligible (in) or not eligible (out). Apart from this method, the dialog includes a number of tools to simplify the selection of compounds:
all in
All the compounds are included in the set of candidates
all out
All the compounds are excluded from the set of candidates
plot 2D
generates an interactive 2D plot from which the User can select a subregion of allowed compounds. In order to select the compounds, with the mouse on the 2D plot, click the central mouse button until a magenta cross symbol will appear. Then move the mouse in another position and click the button again. A line will be drawn in the plot. Repeat the procedure until a polygon has been drawn. Be sure to close the polygon (the line color will change to red). When finished, press the button capture clipboard on the focus region palette. ALMOND will automatically select the compounds situated within the polygon and will update the listed compounds (in or out) accordingly.
Expansion:
The slide bar at the bottom of the dialog shows the number of compounds selected so far. If the bar is moved to the right, the series selected as candidates is expanded adding more compounds, in particular those nearer in the space. This technique is useful if the User wants to performs a selection only in the influence space of a certain compound of interest (very active, etc...). In this case, it is enough to select the compound in the list and then move the slide bar to choose the size of the subset. By pressing OK button the selected compounds will be stored in memory, and will be ready for the subsequent LMD or MDC criteria.
subset size
Move this slide bar to select the size of the subset
starting design (only applicable for LMD criteria)
When random button is selected, the LMD criteria will be performed a number of times starting from different randomly selected subsets and the best subset found will be chosen. When random button is not selected, the method runs just once. In our experience, 10 different starting points are normally sufficient for a good estimation.
protected subset
In this text field the User can enter the name of a protected subset file. The latter is a file with the same format of the SelectedSeries.txt reporting compounds that will be always taken into consideration in the selection phase. Accordingly the LMD or MDC methods will generate a selection of compounds that will complement in the optimal manner the protected subset.
excluded subset
In this text field the User can enter the name of an excluded subset file. The latter is a file with the same format of the SelectedSeries.txt reporting compounds that will be always excluded by the selection phase. Accordingly, the LMD or MDC methods will generate a selection of compounds without take into consideration the excluded set of chemicals.
Press the Select button to starts the LMD or the MDC algorithms or the Exit button to force the exit from the subset selection. The Plot 2D and Plot 3D buttons shows the selected compounds colored in red in 2D or 3D plots respectively.