glsw Documentation


PLS_Toolbox Documentation: glsw	< getpidata	gram >

glsw

Purpose

Calculate or apply Generalized Least Squares weighting.

Synopsis

modl = glsw(x,a); %GLS on matrix

modl = glsw(x1,x2,a); %GLS between two data sets

modl = glsw(x,y,a); %GLS on matrix in groups based on y

modl = glsw(modl,a); %Update model to use a new value

xt = glsw(newx,modl,options); %apply correction

xt = glsw(newx,modl,a); %apply correction

Description

Uses Generalized Least Squares to down-weight variable features identified from the singular value decomposition of a data matrix. The input data usually represents two or more measured populations which should otherwise be the same (e.g. the same samples measured on two different analyzers or using two different solvents) and can be input in one of several forms, as explained below. In all cases, the downweighting is performed by taking the eigenvectors and eigenvalues of the differences.

If the singular value decomposition (SVD) of the input matrix x is X=USV^T then the deweighting matrix is estimated with the following pseudo-inverse W= Udiag(sqrt(1/(diag(S)/a²+1)))V^T, where the center term defines S_inv. The adjustable parameter a is used to scale the singular values prior to calculating their inverse. As a gets larger, the extent of deweighting decreases (because S_inv approaches 1). As a gets smaller (e.g. 0.1 to 0.001) the extent of deweighting increases (because S_inv approaches 0) and the deweighting includes increasing amounts of the the directions represented by smaller singular values.

A good initial guess for a is 1x10^-2 but will vary depending on the covariance structure of X and the specific application. It is recommended that a number of different values be investigated using some external cross-validated metric for performance evalution.

An alternative method to use GLSW is in quantitative analysis where a continuous y-variable is used to develop pseudo-groupings of samples in X by comparing the differences in the corresponding y values. This is referred to as the "gradient method" because it utilizes a gradient of the sorted X and y blocks to calculate a covariance matrix. For more information on this method, see the chapter discussing Preprocessing in the PLS_Toolbox Manual.

For calibration, inputs can be provided by one of three methods:

1) x = data matrix containing features to be downweighted, and

a = scalar parameter limiting downweighting {default = 1e-2}.

Note: If x is a dataset with classes, the differences within each class will be downweighted rather than the entire matrix. This reduces the within-class variation ignoring the between-class variation.

2) x1 = a M by N data matrix and

x2 = a M by N data matrix.

The row-by-row differences between x1 and x2 will be used to estimate the downweighting.

a = scalar parameter limiting downweighting {default = 1e-2}.

3) x = a MxN data matrix,

y = column vector with M rows which specifies sample groups in x within which differences should be downweighted. Note that this method is identical to method (1) when classes of the X block are used to identify groups. The only difference is that these groupings are passed as a separate input. In fact, if y is empty, this defaults to method (1) above.

a = scalar parameter limiting downweighting {default = 1e-2}.

4) x = a MxN data matrix,

y = column vector with M rows specifying a y-block continuous variable. In this input, the "gradient method" is used to identify similar samples and downweight differences between them. See also the gradientthreshold option below.

a = scalar parameter limiting downweighting {default = 1e-2}.

An options structure can be used in place of (a) for any call or as the third output in an apply call. This structure consists of any of the fields:

a: [ 0.02 ] scalar parameter limiting downweighting {default = 1e-2},

applymean: [ 'no' | {'yes'} ] governs the use of the mean difference calculated between two instruments (difference between two instruments mode). When appling a GLS filter to data collected on the x1 instrument, the mean should NOT be applied. Data collected on the SECOND instrument should have the mean applied.

gradientthreshold: [ .25 ] "continuous variable" threshold fraction above which the column gradient method will be used with a continuous y. Usually, when (y) is supplied, it is assumed to be the identification of discrete groups of samples. However, when calibrating, the number of samples in each "group" is calculated and the fraction of samples in "singleton" groups (i.e. in thier own group) is determined.

fraction = (# Samples in Singleton Groups) / Total Samples

If this fraction is above the value specified by this option, (y) is considered a continuous variable (such as a concentration or other property to predict). In these cases, the "sample similarity" (a.k.a. "column gradient") method of calculating the covariance matrix will be used. Sample similarity method determines the down-weighting required based mostly on samples which are the most similar (on the specified y-scale). Set to >=1 to disable and to 0 (zero) to always use.

maxpcs: [ 50 ] maximum number of components (factors) to allow in the GLSW model. Typically, the number of factors in incuded in a model will be the smallest of this number, the number of variables or the number of samples. Having a limit set here is useful when derriving a GLSW model from a large number of samples and variables. Often, a GLSW model effectively uses fewer than 20 components. Thus, this option can be used to keep the GLSW model smaller in size. It may, however, decrease its effectiveness if critical factors are not included in the model.

When applying a GLSW model the inputs are newx, the x-block to be deweighted, and modl, a GLSW model structure.

Outputs are modl, a GLSW model structure, and xt, the deweighted x-block.