crossval
Purpose
Cross-validation for PCA, PLS, MLR, and PCR.
Synopsis
results = crossval(x,y,rm,cvi,ncomp,options)
[press,cumpress,rmsecv,rmsec,cvpred,misclassed] =
crossval(x,y,rm,cvi,ncomp,options)
Description
CROSSVAL
performs cross-validation for linear regression (PCR, PLS, MLR, CorrelationPCR,
and Locally Weighted Regression) and principal components analysis (PCA).
Inputs are the predictor variable matrix x, predicted variable y (y
is empty [] for rm = 'pca'), regression method rm, cross-validation method cvi, and maximum number of
latent variables / components ncomp.
rm = 'pca'
performs cross-validation for PCA,
rm = 'mlr' performs cross-validation for MLR,
rm = 'pcr' performs cross-validation for PCR,
rm = 'nip' performs cross-validation for PLS using
NIPALS,
rm = 'sim' or 'pls' performs cross-validation for PLS using
SIMPLS,
rm = 'correlationpcr' performs cross-validation
for CorrelationPCR, and
rm = 'lwr' performs cross-validation for Locally
Weighted Regression (see LWRPRED).
cvi can
be 1) a cell containing one of the cross-validation methods below with the
appropriate parameters {cvm splits iter}, or 2) a vector representing user-defined cross-validation
groups.
cvi = {'loo'}; leave-one-out
cross-validation,
cvi = {'vet' splits}; venetian blinds (every
n-th sample together),
cvi = {'con' splits}; contiguous blocks, and
cvi = {'rnd' splits iter}; random subsets.
Except for leave-one-out, all methods
require the number of data splits splits to be provided. Random data subsets ('rnd') also requires number
of iterations iter.
For user-defined cross-validation, cvi is a vector with the same number of
elements as x has rows (i.e. length(cvi)
= size(x,1); when x
is class “double”, or length(cvi)
= size(x.data,1); when x
is class “dataset”) with integer elements, defining test subsets. Each cvi(i) is defined as:
cvi(i) = -2
the sample is always in the test set,
cvi(i) = -1 the sample is always in the calibration
set,
cvi(i) = 0 the sample is always never used, and
cvi(i) = 1,2,3… defines each subset.
Options
Optional input options is an options structure
containing one or more of the following fields:
display: [ 'off' | {'on'} ] Governs
output to command window,
plots: [ 'none' | {'final'} ] Governs plotting,
preprocessing: {[1]}
Controls preprocessing. Default is mean centering (1). Can be input in two
ways:
a) As a single value: 0 =
none, 1 = mean centering, 2 = autoscaling, or
b) As {xp yp}, a cell
array containing a preprocessing structure(s) for the X- and Y-blocks (see
PREPROCESS). E.g. pre = {xp
[]}; for PCA. To include preprocessing of each subset use pre = {xp yp}; or pre = {xp []} for PCA. To
avoid preprocessing of each subset use pre = {[] []}; or pre = 0 (zero).
threshold: {[]} Alternative PLSDA
threshold level (default = [] = automatic)
prior: {[]} Used with PLSDA only. Vector of fractional prior probabilities.
This is the probability (0-1) of observing a "1" for each column of y
(i.e. each class). E.g. [.25 .50] defines that only 25Found and 50Found of
future samples will likely be "true" for the classes identified by
columns 1 and 2 of the y-block. [] (Empty) = equal priors.
structureoutput: [ {'no'} | 'yes' ] Governs
output variables. 'Yes' returns a structure instead of individual variables.
'Yes' is default if only one output is requested.
jackknife: [ {'no'} | 'yes' ] Governs
storing of jackknifed regression vectors. Jack-knifing may slow performance
significantly or cause out-of-memory errors when both x and y blocks have many
variables.
rmsec: [ 'no' | {'yes'} ] Governs calculation of RMSEC. When set to
'no', calculation of "all variables" model is skipped (unless
specifically required for plots or requested with multiple outputs)
pcacvi: {'loo'} Cell describing how PCA cross-validation should
perform variable replacement. Variable replacement options are similar to
cross-validation CVI options and include:
{'loo'} leave
one variable out at a time
{'con'
splits} contiguous blocks (total of splits groups)
{'vet' splits} venetian
blinds (every n'th variable), or
{'rnd' splits} random
subsets (note: no iterations)
fastpca: [ 'off' | {'auto'} ] Governs use of "fast"
PCA Cross-validation algorithm. 'off' never uses fast algorithm, 'auto' uses
fast algorithm when other options permit. Fast pca can only be used with
pcacvi set to 'loo'
lwr: Sub-structure of options to use for
locally-weighted regression cross-validation. Most of these options are used as
defined in the LWRPRED function (see LWRPRED for more details) but there are
two additional options defined for cross-validation:
lwr.minimumpts : [20] the minimum number of
points (samples) to use in any LWR sub-model.
lwr.ptsperterm : [20] the number of points to
use per term (LV) in the LWR model. For example, when set to 20, 20 samples
will be use for a 1 LV model, 40 samples will be used for a 2 LV model, etc. If
set to zero, the number of points defined by lwr.minimumpts will be used for
all models - that is, the number of samples used will be independent from the
number of LVs in the model.
In all cases, the number
of samples in an individual test set will be the upper limit of samples to
include in any LWR prediction.
Output:
press: predictive residual error sum of squares
PRESS for each subset (subsets are rows of this matrix, number of components
are columns)
cumpress: cumulative PRESS (sum of columns of press).
rmsecv: root mean square error of cross-validation.
rmsec: root mean square error of calibration.
cvpred: cross-validation y-predictions (regression methods
only). If cross-validation method was random, this is the average prediction of
all replicates.
misclassed: fractional
misclassifications for each class (valid for regression methods only and only
when y is a logical, (i.e. discrete-value) vector.
reg: jack-knifed regression vectors from each
sub-set. This will be size [k*ny nx splits] such that reg(1,:,:) will be the
regression vectors for 1 component model of the first column of y for all sub
sets (a 1 by nx by splits matrix). Use squeeze to reduce to an nx by splits
matrix. (note: options.jackknife must be 'yes' to use reg)
If options.structureoutput is 'yes', a single output (results) will return all the
above outputs as fields in a structure. If options.rmsec is 'no', then RMSEC is not returned
(provides faster iterative calculation)
Note that for multivariate (y) the output (press) is grouped
by output variable, i.e. all of the PRESS values for the first variable are
followed by all of the PRESS values for the second variable, etc.
When options.plots
is not ‘none’ plots both RMSECV and RMSEC are provided.
Examples
[press,cumpress] = crossval(x,y,'nip',{'loo'},10);
[press,cumpress] = crossval(x,y,'pcr',{'vet',3},10);
[press,cumpress] = crossval(x,y,'nip',{'con',5},10);
[press,cumpress] = crossval(x,y,'sim',{'rnd',3,20},10);
res = crossval(x,y,'sim',{'rnd',3,20},10);
pre = {preprocess('autoscale') preprocess('autoscale')};
opts.preprocessing = pre;
opts.plots = ‘none’;
[press,cumpress] =
crossval(x,y,'sim',{'rnd',3,20},10,opts);
res = crossval(x,y,'sim',{'rnd',3,20},10,opts);
[press,cumpress] = crossval(x,[],'pca',{'loo'},10);
[press,cumpress] = crossval(x,[],'pca',{'vet',3},10);
res = crossval(x,[],'pca',{'con',5},10);
See Also
pca, pcr, pls, preprocess, ncrossval