PLS_Toolbox Documentation: gaselctr< fullsearch gcluster >

gaselctr

Purpose

Genetic algorithm for variable selection with PLS.

Synopsis

 

model = gaselctr(x,y,options)

[fit,pop,avefit,bstfit] = gaselctr(x,y,options)

options = gaselctr('options')

Description

GASELCTR uses a genetic algorithm optimization to minimize cross validation error for variable selection.

INPUTS:

                         x =   the predictor block (x-block), and

                         y =   the predicted block (y-block) (note that all scaling should be done prior to running GASELCTR).

Options

             options =   a structure array with the following fields:

         plots:  ['none' | {'intermediate'} | 'replicates' | 'final' ] Governs plots.

                                 'final' gives only a final summary plot.

                                 'replicates' gives plots at the end of each replicate.

                                 'intermediate' gives plots during analysis.

                                 'none' gives no plots.

               popsize:   {64} the population size (16<=popsize<=256 and popsize must be divisible by 4),

maxgenerations:   {100} the maximum number of generations (25<=mg<=500),

     mutationrate:   {0.005} the mutation rate (typically 0.001<=mt<=0.01),

       windowwidth:   {1} the number of variables in a window (integer window width),

       convergence:   {50} percent of population the same at convergence (typically cn=80),

     initialterms:   {30} percent terms included at initiation (10<=bf<=50),

           crossover:   {2} breeding cross-over rule (cr = 1: single cross-over; cr = 2: double cross-over),

           algorithm:   [ 'mlr' | {'pls'} ] regression algorithm,

                   ncomp:   {10} maximum number of latent variables for PLS models,

                         cv:   [ 'rnd' | {'con'} ] cross-validation option ('rnd': random subset cross-validation; 'con': contiguous block subset cross-validation),

                   split:   {5} number of subsets to divide data into for cross-validation,

                     iter:   {1} number of iterations for cross-validation at each generation,

   preprocessing:   {[] []} a cell containing standard preprocessing structures for the X- and Y-blocks respectively (see PREPROCESS),

             preapply:   [ {0} | 1 } If 1, preprocessing is applied to data prior to GA. This speeds up the performance of the selection, but my reduce the accuracy of the cross-validation results. Output "fit" values should only be compared to each other. A full cross-validation should be run after analysis to get more accurate RMSECV values.

                     reps:   {1} the number of replicate runs to perform,

                 target:   a two element vector [target_min target_max] describing the target range for number of variables/terms included in a model n. Outside of this range, the penaltyslope option is applied by multiplying the fitness for each member of the population by:

                                    penaltyslope*(target_min-n) when n<target_min, or

                                    penaltyslope*(n-target_max) when n>target_max.

                                 Field target is used to bias models towards a given range of included variables (see penaltyslope below),

           targetpct:   {1} flag indicating if values in field target are given in percent of variables (1) or in absolute number of variables (0), and

     penaltyslope:   {0} the slope of the penalty function (see target above).

The default options can be retreived using: options = gaslctr('options');.

OUTPUT:

                  model =   a standard GENALG model structure with the following fields:

           modeltype:   'GENALG' This field will always have this value,

         datasource:   {[1x1 struct] [1x1 struct]}, structures defining where the X- and Y-blocks came from

                     date:   date stamp for when GASELCTR was run,

                     time:   time stamp for when GASELCTR was run,

                     info:   'Fit results in "rmsecv", population included variables in "icol"', information field describing where the fitness results for each member of the population are contained,

                 rmsecv:   fitness results for each member of the population, for X MxN and Mp unique populations at convergence then rmsecv will be 1xMp,

                     icol:   each row of icol corresponds to the variables used for that member of the population (a 1 [one] means that variable was used and a 0 [zero] means that it was not), for X MxN and Mp unique populations at convergence then icol will be MpxN, and

                 detail:   [1x1 struct], a structure array containing model details including the following fields:

                                 avefit: the average fitness at each generation,

                                 bestfit: the best fitness at each generation, and

                                 options: a structure corresponding to the options discussed above.

Examples

To use mean centering outside the genetic algorithm (no additional centering will be performed within the algorithm) do the following:

x2 = mncn(x);

y2 = mncn(y);

[fit,pop] = gaselctr(x2,y2);

To use mean centering inside the genetic algorithm (centering will be performed for each cross-validation subset) do the following:

options = gaselctr('options');

options.preprocessing{1} = preprocess('default', 'mean center');

options.preprocessing{2} = preprocess('default', 'mean center');

[fit,pop] = gaselctr(x2,y2,options);

See Also

calibsel, genalg, genalgplot


< fullsearch gcluster >