Mdcheck

From Eigenvector Documentation Wiki

Visited: Mdcheck

toomuch: [{'error'} | 'exclude'] what action should be taken if too much missing data is found. 'error' exit with error message, 'exclude' will exclude elements (rows/columns/slabs/etc) which contain too much missing data from the data before replacement. 'exclude' requires a dataset object as input for (data),

algorithm: [ {'svd'} | 'nipals' ] specified the missing data algorithm to use, NIPALS typically used for large amounts of missing data or large multi-way arrays.

Note: MDCHECK captures up to options.frac_ssq of the variance using options.max_pcs or fewer PCA components.

The default options can be retreived using: options = mdcheck('options');.

Algorithm

The replacement algorithm is a successive approximations routine using PCA models to replace the data. Values for the missing data are first estimated using the mean of each variable. Then, a PCA model which captures a given percentage of the variance is calculated and the missing values are replaced again to be most consistent with the loadings of the PCA model (see the replace function.) The PCA model is recalculated using the newly replaced data and the process is repeated until the change in the replaced values drops below a threshold.

Using PCA to replace data generally works better than using the mean of a variable the because it uses the covariance in the data to estimate what the missing values should be. For example, given two variables:

Replacement of the NaN in column A with the mean of A would make the value 2.5. However, given that column B is always 2x column A, a value of 5 would be more consistent with the covariance of the two variables. A PCA model gives the correct value but using a simple mean-variable replacement leads to a 50% error.

Mdcheck

From Eigenvector Documentation Wiki

Contents

Purpose

Synopsis

Description

Options

Algorithm

See Also

Views

Personal tools

Navigation

Search Website