Preprocess

From Eigenvector Documentation Wiki

Jump to: navigation, search

Contents

Purpose

Selection and application of standard preprocessing methods.

Synopsis

s = preprocess(s)  %GUI preprocessing selection
s = preprocess('default','methodname')  %Non-GUI selection
[datap,sp] = preprocess('calibrate',s,data)  %single block calibrate
[datap,sp] = preprocess('calibrate',s,xblock,yblock)  %multi-block
datap = preprocess('apply',sp,data)  %apply to new data
data = preprocess('undo',sp,datap)  %undo preprocessing
preprocess('keywords') %Show valid method names.

Description

PREPROCESS is a general tool to choose preprocessing steps and to perform these steps on data. It can be used as a graphical interface or as a command-line tool. See Model Building - PreProcessing Methods for a description of the use of the graphical user interface. See User Defined Preprocessing and preprouser for a description on how custom preprocessing can be added to the standard preprocessing options listed below.

From the command line, PREPROCESS can be used to perform 4 different tasks:

  • 1) Specification of Preprocessing
  • 2) Estimate preprocessing parameters (calibrate)
  • 3) Apply preprocessing to new data (apply)
  • 4) Remove the effect of previously-done preprocessing on data (undo)

Case 1) Specification of Preprocessing

The purpose of the following calls to PREPROCESS is to generate standard structure arrays that contain the desired preprocessing steps.

s = preprocess;

generates a GUI and allows the user to select preprocessing steps interactively. The output s is a standard preprocessing structure.

s = preprocess(s);

allows the user to interactively edit a previously-built preprocessing structure s. The output s is the edited preprocessing structure.

s = preprocess('default','methodname');

returns the default structure for method methodname. A list of strings that can be used for methodname can be viewed using the command:

preprocess('keywords')

The technical description of the different types of preprocessing can be found on the Model Building: Preprocessing Methods page. Below is list of standard methods that can be used for 'methodname':

  • 'abs': takes the absolute value of the data (see abs),
  • 'autoscale': centers columns to zero mean and scales to unit variance (see auto),
  • 'simple baseline': baseline (specified points, see baseline),
  • 'baseline': baseline (weighted least squares, see wlsbaseline),
  • 'classcenter': Centers classes in data to the mean of each class, see classcenter.
  • 'derivative': derivative savgol,
  • 'detrend': remove a linear trend (see baseline),
  • 'epo': External Parameter Orthogonalization - remove clutter covariance (see glsw),
  • 'gls weighting': generalized least squares weighting (see glsw),
  • 'gscale': group/block scaling (see gscale),
  • 'holoreact': Kaiser HoloReact Method (see hrmethodreadr),
  • 'logdecay': log decay scaling,
  • 'log10': calculate base 10 logarithm of data(log10),
  • 'mean center': center columns to have zero mean (see mncn),
  • 'median center': center columns to have zero median (see medcn),
  • 'msc': multiplicative scatter correction with offset, the mean is the reference spectrum (see mscorr),
  • 'centering': multiway center,
  • 'scaling': multiway scale,
  • 'normalize': normalization of the rows (see normaliz),
  • 'osc': orthogonal signal correction (see osccalc and oscapp),
  • 'pareto': Pareto (Sqrt Std) Scaling, Scale each variable by the square root of its standard deviation,
  • 'sqmnsc': sqrt mean scale, scale each variable by the square root of its mean a.k.a. Poisson scaling(see poissonscale),
  • 'smooth': Savitsky-Golay smoothing and deriviatives (see savgol),
  • 'snv': standard normal deviate (autoscale the rows, see snv),
  • 'specalign': variable alignment via cow and registerspec,
  • 'trans2abs': transmission to absorbance (log(1/T)).
  • 'autoscalenomean': Variance (Std) Scaling, Scale each variable by its standard deviation without mean-centering.

The output is a standard preprocessing structure array s, where each preprocessing method to apply is contained in a separate record.

Case 2) Estimate preprocessing parameters (calibrate)

Many preprocessing methods derive statistics and other numerical values from the calibration data. These values must be stored and used when new data (test or other future data) is going to be preprocessed in the same way as the calibration data. Examples include the mean of your calibration data or scaling values for each variable.

The objective of the calibrate call to PREPROCESS is to estimate the preprocessing parameters, if any, from the calibration data set and to perform preprocessing on the data. The I/O format is:

[datap,sp] = preprocess('calibrate',s,data);

The inputs are s a standard preprocessing structure and data the calibration data. The preprocessed data is returned in datap, and preprocessing parameters are returned in a modified preprocessing structure sp. Note that sp is used as an input with the 'apply' and 'undo' commands described below.

Short-cuts for each method can also be used. Examples for 'mean center' and 'autoscale' are

[datap,sp] = preprocess('calibrate','mean center',data);
[datap,sp] = preprocess('calibrate','autoscale',data);

Preprocessing for some multi-block methods (specifically, 'osc' and 'gls weighting') require that the y-block be passed also. The I/O format in these cases is:

[datap,sp] = preprocess('calibrate',s,xblock,yblock);

Case 3) Apply preprocessing to new data (apply)

Once the preprocessing steps have been calibrated on a set of data, when new data is obtained for which the same preprocessing steps need to be executed, the preprocessing is "applied" to this new data. This is not the same as the initial calibration step described earlier because values which were calculated from the calibration data are used to transform the new data in the exact same way. An example is using mean centering in which the mean of the calibration data is used to center new data relative to the same point.

The following call to PREPROCESS

datap = preprocess('apply',sp,data)

performs the apply of the calibrated preprocessing in sp to new data. Inputs are sp, the modified preprocessing structure (See Case 2 above) and the data, data, to apply the preprocessing to. The output is preprocessed data datap that is class "dataset".

Case 4) Remove the effect of previously-done preprocessing on data (undo)

The inverse operation of applying preprocessing is performed in the following call to PREPROCESS

data = preprocess('undo',sp,datap);

Inputs are sp, the modified preprocessing structure (See Case 2 above) and the data, datap, (class "double" or "dataset") from which the preprocessing is removed. Note that for some preprocessing methods (for example, 'osc' and 'sg') an inverse does not exist or has not been defined, and in such cases an 'undo' call will cause an error to occur. One reason for not defining an inverse, or undo, is because it would require a significant amount of memory storage when data sets get large.

The undo operation is used mostly on predictions for the y-block in regression models as well as in missing data replacement algorithms (see mdcheck and replace) in which the data is preprocessed and then an estimate of the data is then converted back to the original data space.

See Also

crossval, pca, pcr, pls, preprouser, preprocatalog

Views
Personal tools