ipls Documentation


PLS_Toolbox Documentation: ipls	< hline	jcampreadr >

ipls

Purpose

IPLS Interval PLS and forward/reverse MLR variable selection.

Synopsis

results = ipls(X,Y,int_width,maxlv,options)

results = ipls(X,Y,int_width,maxlv,numintervals,options)

[use,fit,lvs,intervals,intcv,intlv] = ipls(X,Y,int_width,maxlv,options)

Description

Performs forward or reverse selection of variable windows based on the RMSECV obtained for each individual window ("intervals") of variables. Multiple windows can also be selected iteratively by modifying the options.numintervals options. The "algorithm" option allows this function to behave as an IPLS or IPCR algorithm or a forward/reverse MLR variable selection algorithm. The default is PLS but options.algorithm = 'mlr' changes to MLR mode. See other options below.

Inputs are (X,Y) the X and Y data, (int_width) the interval i.e. window width in variables and (maxlv) the maximum number of latent variables to use in any model (maxlv has no impact if options.algorithm = 'mlr'). Note that excluding a variable in X will prevent it from being used in any model.

If options.plots is 'final', a plot is given of the minimum RMSECV versus window center. Windows which were used are indicated in blue, windows which were excluded are indicated in red. The number of latent variables (LVs) used to assess each interval (the model size that gives the indicated RMSECV) is shown at the bottom of each interval's bar, inside the axes. The best RMSECV that can be obtained using all intervals is shown as a dashed red line (all-interval RMSECV). The number of LVs used in this model is shown on the right of the axes. If this number of LVs (all-interval model) is different from the number used for the best model of the selected interval(s) (selected-interval model) then a dashed magenta line will indicate the RMSECV obtained when using all intervals but at the selected-interval model size. The mean sample is superimposed on the plot for reference.

INPUTS:

X = X-block,

Y = Y-block, and

int_width = the interval (window width in variables)

maxlv = the maximum number of latent variables to use in any model.

NOTE that excluding a variable in X will prevent it from being used in any model.

OUTPUTS:

When a single output is requested, the output is a structure with the following fields:

use: the final selected indices which gave the best model,

fit: the RMSECV for the selected indicies,

lvs: the number of latent variables which gives the best fit,

intervals: a matrix containing the indicies used for each interval.

intcv: the RMSECV in the last selection cycle for all intervals (these values were used to select the last interval).

intlv: the number of latent variables used in the model which gave the RMSECV values returned in intcv.

Optionally, with multiple outputs, these vaiables will be returned as single outputs (not in structure format) in the order shown above.

Options

options = options structure containing the fields:

display: [ 'off' | {'on'} ], governs level of display to command window,

plots: [ 'none' | {'final'} ], governs level of plotting,

mode: [{'forward'} | 'reverse' ] Defines action to be performed with each interval.

'forward' mode: the RMSECV calculated for each interval represents how well the y-block can be predicted using ONLY the variables included in the interval.

'reverse' mode: the RMSECV calculated for each interval represents how well the y-block can be predicted when the given interval of variables are removed from the range of included X variables.

NOTE that excluding a variable in X will prevent it from being used in any model.

algorithm: [{'pls'} | 'pcr' | 'mlr' ] Defines regression algorithm to use. Selection is done for the specific algorithm. Note that when MLR is used, input (int_width) is most often = 1 (single variable per window).

numintervals: { [1] } Number of intervals to select or remove. If (num_intervals) is Inf, intervals are iteratively selected and added/removed until no improvement in RMSECV is observed. NOTE: this can also be set by passing as a scalar value before, or in place of, the options structure. When passed this way, any value passed in the options structure will be ignored.

mustuse: [ ] A vector of variable indices which MUST be used in all models. These variables will always be included in any model, whether or not they are included in the current interval.

stepsize: [ ] Distance between interval centers. An empty matrix gives the default spacing in which intervals do not overlap (stepsize = int_width).

preprocessing: defines preprocessing and can be one of the following:

(a) One of the following strings:

'none' : no preprocessing {default}

'meancenter' : mean centering

'autoscale' : autoscaling

(b) A single preprocessing structure defined using the function

preprocess. The same preprocessing structure will be used on both

the X and Y blocks.

the X block and one for the Y block.

cvi: {'vet' [ ] 1} Three element cell indicating the cross-validation leave-out settings to use {method splits iterations}. For valid modes, see the "cvi" input to crossval. If splits (the second element in the cell) is empty, the square root of the number of samples will be used. cvi can also be a vector (non-cell) of indices indicating leave-out groupings (see crossval for more info).