glsw
Purpose
Calculate or apply Generalized Least Squares weighting.
Synopsis
modl = glsw(x,a); %GLS on matrix
modl = glsw(x1,x2,a); %GLS between two data sets
modl = glsw(x,y,a); %GLS on matrix in groups
based on y
modl = glsw(modl,a); %Update model to use a new
value
xt = glsw(newx,modl,options); %apply
correction
xt = glsw(newx,modl,a); %apply
correction
Description
Uses Generalized Least Squares to down-weight variable
features identified from the singular value decomposition of a data matrix. The
input data usually represents two or more measured populations which should
otherwise be the same (e.g. the same samples measured on two different
analyzers or using two different solvents) and can be input in one of several
forms, as explained below. In all cases, the downweighting is performed by
taking the eigenvectors and eigenvalues of the differences.
If the singular value decomposition (SVD) of the input matrix x is X=USVT
then the deweighting matrix is estimated with the following pseudo-inverse W=
Udiag(sqrt(1/(diag(S)/a2+1)))VT,
where the center term defines Sinv. The adjustable parameter a is used to scale the
singular values prior to calculating their inverse. As a gets larger, the extent of deweighting
decreases (because Sinv approaches 1). As a gets smaller (e.g. 0.1 to
0.001) the extent of deweighting increases (because Sinv
approaches 0) and the deweighting includes increasing amounts of the the
directions represented by smaller singular values.
A good initial guess for a is 1x10-2 but will vary depending on
the covariance structure of X and the specific application. It is
recommended that a number of different values be investigated using some
external cross-validated metric for performance evalution.
An alternative method to use GLSW is in quantitative analysis
where a continuous y-variable is used to develop pseudo-groupings of samples in
X by comparing the differences in the corresponding y values. This is referred
to as the "gradient method" because it utilizes a gradient of the
sorted X and y blocks to calculate a covariance matrix. For more information on
this method, see the chapter discussing Preprocessing in the PLS_Toolbox
Manual.
For calibration, inputs can be provided by one of three
methods:
1) x = data matrix containing features to be
downweighted, and
a = scalar parameter limiting downweighting
{default = 1e-2}.
Note: If x is a dataset with
classes, the differences within each class will be downweighted rather
than the entire matrix. This reduces the within-class variation ignoring the
between-class variation.
2) x1 = a M
by N data matrix and
x2
= a M by N data matrix.
The row-by-row
differences between x1
and x2 will be used to
estimate the downweighting.
a = scalar parameter limiting downweighting {default = 1e-2}.
3) x = a MxN data matrix,
y = column vector with M rows which
specifies sample groups in x
within which differences should be downweighted. Note that this method is
identical to method (1) when classes of the X block are used to identify
groups. The only difference is that these groupings are passed as a separate
input. In fact, if y is empty, this defaults to method (1) above.
a = scalar parameter limiting downweighting
{default = 1e-2}.
4) x = a MxN data matrix,
y = column vector with M rows specifying a
y-block continuous variable. In this input, the "gradient method" is
used to identify similar samples and downweight differences between them. See
also the gradientthreshold option below.
a = scalar parameter limiting downweighting
{default = 1e-2}.
An options structure can be used in place of (a) for any call
or as the third output in an apply call. This structure consists of any of the
fields:
a: [ 0.02 ] scalar parameter limiting
downweighting {default = 1e-2},
applymean: [ 'no' | {'yes'} ] governs
the use of the mean difference calculated between two instruments (difference
between two instruments mode). When appling a GLS filter to data collected on
the x1 instrument, the mean should NOT be applied. Data collected on the SECOND
instrument should have the mean applied.
gradientthreshold: [ .25 ] "continuous
variable" threshold fraction above which the column gradient method will
be used with a continuous y. Usually, when (y) is supplied, it is assumed to be
the identification of discrete groups of samples. However, when calibrating,
the number of samples in each "group" is calculated and the fraction
of samples in "singleton" groups (i.e. in thier own group) is
determined.
fraction
= (# Samples in Singleton Groups) / Total Samples
If
this fraction is above the value specified by this option, (y) is considered a
continuous variable (such as a concentration or other property to predict). In
these cases, the "sample similarity" (a.k.a. "column
gradient") method of calculating the covariance matrix will be used.
Sample similarity method determines the down-weighting required based mostly
on samples which are the most similar (on the specified y-scale). Set to >=1
to disable and to 0 (zero) to always use.
maxpcs: [ 50 ] maximum number of
components (factors) to allow in the GLSW model. Typically, the number of
factors in incuded in a model will be the smallest of this number, the number
of variables or the number of samples. Having a limit set here is useful when
derriving a GLSW model from a large number of samples and variables. Often, a
GLSW model effectively uses fewer than 20 components. Thus, this option can be
used to keep the GLSW model smaller in size. It may, however, decrease its
effectiveness if critical factors are not included in the model.
When applying a GLSW model the inputs are newx, the x-block to be deweighted, and modl, a GLSW model structure.
Outputs are modl,
a GLSW model
structure, and xt, the
deweighted x-block.
See Also
osccalc, pca, pls, preprocess