Mdcheck
From Eigenvector Documentation Wiki
Contents |
Purpose
Missing Data Checker and infiller.
Synopsis
- [flag,missmap,infilled] = mdcheck(data,options)
Description
This function checks for missing data and infills it using a PCA model if desired. The input is the data to be checked data as either a double array or a dataset object. Optional input options is a structure containing options for how the function is to run (see below).
Outputs are the fraction of missing data flag, a map of the locations of the missing data as an unint8 variable missmap, and the data with the missing values filled in infilled. Depending on the plots option, a plot of the missing data may also be output.
Options
- options = a structure array with the following fields:
- frac_ssq: [{0.95}] desired fraction between 0 and 1 of variance to be captured by the PCA model,
- max_pcs: [{5}] maximum number of PCs in the model, if 0, then it uses the mean,
- meancenter: ['no' | {'yes'}], tells whether to use mean centering in the algorithm,
- recalcmean: ['no' | {'yes'}], recalculate mean center after each cycle of replacement (may improve results for small matricies),
- display: [{'off'} | 'on'], governs level of display,
- tolerance: [{1e-6 100}] convergence criteria, the first element is the minimum change and the second is the maximum number of iterations,
- max_missing: [{0.4}] maximum fraction of missing data with which MDCHECK will operate, and
- toomuch: [{'error'} | 'exclude'] what action should be taken if too much missing data is found. 'error' exit with error message, 'exclude' will exclude elements (rows/columns/slabs/etc) which contain too much missing data from the data before replacement. 'exclude' requires a dataset object as input for (data),
- algorithm: [ {'svd'} | 'nipals' ] specified the missing data algorithm to use, NIPALS typically used for large amounts of missing data or large multi-way arrays.
Note: MDCHECK captures up to options.frac_ssq of the variance using options.max_pcs or fewer PCA components.
The default options can be retreived using: options = mdcheck('options');.
Algorithm
The replacement algorithm is a successive approximations routine using PCA models to replace the data. Values for the missing data are first estimated using the mean of each variable. Then, a PCA model which captures a given percentage of the variance is calculated and the missing values are replaced again to be most consistent with the loadings of the PCA model (see the replace function.) The PCA model is recalculated using the newly replaced data and the process is repeated until the change in the replaced values drops below a threshold.
Using PCA to replace data generally works better than using the mean of a variable the because it uses the covariance in the data to estimate what the missing values should be. For example, given two variables:
A B 1 2 2 4 3 6 4 8 NaN 10
Replacement of the NaN in column A with the mean of A would make the value 2.5. However, given that column B is always 2x column A, a value of 5 would be more consistent with the covariance of the two variables. A PCA model gives the correct value but using a simple mean-variable replacement leads to a 50% error.