- 11.1. Feature extraction
- 11.1.1. Principal Component Analysis (PCA)
- 11.1.2. Linear Discriminant Analysis (LDA)
- 11.2. Feature selection
- 11.2.1. Individual feature selection (feature ranking)
- 11.2.2. Random feature selection
- 11.2.3. Forward search
- 11.2.4. Backward search
- 11.2.5. Floating search
- 11.2.6. Initialization of the selection searches
An important pre-requisite for training robust classifiers is construction of an informative data representation. perClass provides a number of tools for building data representations based on features or proximities to prototype examples.
This chapter is organized into two parts. First, we discuss feature extraction and then feature selection methods.
Two basic types of dimensionality reduction include feature extraction and feature selection. Feature extraction results in a lower-dimensional space constructed using information from all original features. Feature selection, on the other hand, chooses a smaller subset of the original features. This is useful to identify the informative features, and to limit computational demands when executing the recognition system on new observations.
11.1. Feature extraction ↩
Two prominent feature extraction methods are Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA). The major difference is that PCA is an unsupervised approach (not using ground-truth labels) while LDA is fully supervised.
11.1.1. Principal Component Analysis (PCA) ↩
PCA derives lower-dimensional linear subspace by preserving a specific
fraction of variation in the original data. It is implemented using the sdpca command. This feature extraction method ia unsupervised. This
means that the class labels are not taken into account. Therefore, the presence of labels in the data set does not alter the resulting PCA projection.
Running sdpca without additional arguments results in a decorrelation of
the data (mere rotation) preserving all information in the original data
set. This is useful, for example, when building classifiers that makes use
of thresholding (decision trees) or assumptions of independence (naive
Bayes).
>> p=sdpca(a)
PCA pipeline 300x20 100% of variance (sdp_affine)

In order to reduce number of features, we may either provide the desired dimensionality directly or specify the fraction of preserved variance:
>> p=sdpca(a,3) % specifying output dimensionality
PCA pipeline 300x3 74% of variance (sdp_affine)
>> p=sdpca(a,0.95) % specifying fraction of preserved variance
PCA pipeline 300x16 95% of variance (sdp_affine)
Note that sdpca always displays how much of original data variance is
preserved.
TIP: It is a good practice to perform PCA on a data set we start working on in order to understand its complexity. If the PCA preserving some 95 or 99% of variance results in significant dimensionality reduction, we probably deal with data located along a simple manifold structure. This is often happening for signals or image data sets. If the resulting dimensionality is high compared with the original number of features, we face a complicated problem and may need more complex classifier and consequently larger number of training examples.
11.1.2. Linear Discriminant Analysis (LDA) ↩
LDA is a supervised feature extraction method that finds a linear subspace maximizing separability between classes. The dimensionality of the resulting subspace is fixed to the minimum between: number of features, number of samples, number of classes - 1). Usually, the output dimensionality is determined by the number of classes. For a two-class problem LDA results in a projection on a line, for thee-class problem on a 2D plane.
>> a
'medical D/ND' 6400 by 11 sddata, 3 classes: 'disease'(1495) 'no-disease'(4267) 'noise'(638)
>> p=sdlda(a)
LDA pipeline 11x2 (sdp_affine)
LDA projection may be computed for data sets with number of features larger than number of classes by using the pseudo-inverse. However, we should carefuly investigated the generalization capability of such projections, i.e. the capability of reliably mapping unseen data.
One possible remedy for poor generalization capability of LDA in high-dimensional small-sample size problems is to first perform PCA and estimate LDA projection only in lower-dimensional subspace of the original data set.
>> d
'Difficult Dataset' 100 by 200 sddata, 2 classes: '1'(48) '2'(52)
>> p=sdpca([],10)*sdlda
untrained pipeline 2 steps: sdpca+sdlda
>> p2=d*p
sequential pipeline 200x1 'PCA+LDA'
1 PCA 200x10 38%% of variance (sdp_affine)
2 LDA 10x1 (sdp_affine)
11.2. Feature selection ↩
The feature selection strategies aim at selecting an informative subset of
features, out of the complete set. The goodness of a solution is judged
based on a criterion or on the classification error. perClass includes
fast feature selection via the sdfeatsel function. sdfeatsel selects
features by minimizing the error of 1-NN classifier trained on a part of
the data and tested on the remaining test set. Leveraging classifier error
as a feature selection criterion has the advantage of providing a clear
decision on how many features are needed.
11.2.1. Individual feature selection (feature ranking) ↩
The individual search evaluates each feature separately. The advantage of
individual search is high speed. It is therefore useful for pre-selection
of a candidate feature subset from a large set of features. Note, however,
that individually poor features may yield high class separability when used
together. It is possible to provide additional test set directly using the
'test' option. The data set tr is then used for classifier training and
the test set ts for error estimation.
>> [tr,ts]=randsubset(a,0.5)
'medical D/ND' 3199 by 11 sddata, 3 classes: 'disease'(747) 'no-disease'(2133) 'noise'(319)
'medical D/ND' 3201 by 11 sddata, 3 classes: 'disease'(748) 'no-disease'(2134) 'noise'(319)
>> pf=sdfeatsel(tr,'test',ts,'method','individual')
individual:
Feature subset pipeline 11x1 (sdp_fsel)
The pipeline 'pf' applied to the sddata object will provide the reduced data set. In this case the single best feature.
>> ts2=ts*pf
'medical D/ND' 3201 by 1 sddata, 3 classes: 'disease'(748) 'no-disease'(2134) 'noise'(319)
To inspect the feature selection result sdfeatsel provides also the structure with detailed information on the selection process.
>> [pf,res]=sdfeatsel(tr,'test',ts,'method','individual')
individual:
Feature subset pipeline 11x1 (sdp_fsel)
res =
feat: [6 8 2 9 7 1 11 5 3 10 4]
crit: [1x11 double]
best_feat: 6
best_crit: 0.5081
method: 'individual ranking'
Let's get a feeling for the speed:
>> tic; [pf,res]=sdfeatsel(tr,'test',ts,'method','individual'); toc
individual:
Elapsed time is 0.569798 seconds.
>> tic; [pf2,res2]=sdfeatsel(tr,'test',ts); toc
forward:...........
Elapsed time is 5.486352 seconds.
By default sdfeatsel performs the forward search. Note that the individual search is almost an order of magnitude faster than the forward search.
11.2.2. Random feature selection ↩
Random search evaluates a large set of random feature subsets and returns the best one. It is useful to implement more complex feature selection strategies such as genetic feature selection or as an initialization of greedy searches. For example, the forward search suffers if all features judged individually are poor. Bad choices made early on in greedy search cannot be undone. Random search allows us to start the greedy search from a meaningful feature subset.
>> tic; [pf,res]=sdfeatsel(tr,'test',ts,'method','random'); toc
random:
Elapsed time is 16.202571 seconds.
>> pf
Feature subset pipeline 11x5 (sdp_fsel)
The res structure contains all analyzed feature subsets in a binary matrix
res.subsets (subsets vs features). The criterion values are stored into res.crit.
>> res
subsets: [200x11 uint8]
crit: [200x1 double]
best_subset: [1 2 3 9 11]
best_count: 5
best_crit: 0.2868
method: 'random search'
sdfeatsel may be also used to quickly select a set of features manually. We
may simply provide the data set and desired feature indices:
>> pf=sdfeatsel(a,[1:4 8])
Feature subset pipeline 11x5 (sdp_fsel)
>> +pf
ans =
ind: [1 2 3 4 8]
>> pf.lab'
1 StdDev
2 Skew
3 Kurtosis
4 Energy Band 1
5 Moment 02
>>
>> a2=a*pf
'medical D/ND' 5762 by 5 sddata, 2 classes: 'disease'(1495) 'no-disease'(4267)
>> a2.featlab'
1 StdDev
2 Skew
3 Kurtosis
4 Energy Band 1
5 Moment 02
11.2.3. Forward search ↩
The forward search is a greedy strategy. It start with the best individual feature and adds to it the feature that together with the first provides the best result. The procedure is repeated until all features are used. Therefore, in the forward search, the merit of additional features are judged together with the already selected subset.
>> [pf,res]=sdfeatsel(tr,'test',ts,'method','forward');
forward:...........
Feature subset pipeline 11x7 (sdp_fsel)
res =
method: 'forward search'
improvement: 1
best_subset: [6 11 9 3 10 7 8]
best_count: 7
best_crit: 0.2699
feat_init: []
feat: [6 11 9 3 10 7 8 2 1 5 4]
crit: [1x11 double]
Number of features is selected automatically because the selection criteria uses classifier error. If no test set is provided the input data set is split randomly into two parts, one used to train 1-NN classifier used as criterion, and the other to estimate the performance. Due to the random splitting, repeating the search on the same data may result in different subset of selected features.
>> [pf,res]=sdfeatsel(a)
res =
method: 'forward search'
improvement: 1
best_subset: [6 11 8 3 7 1 9 2]
best_count: 8
best_crit: 0.2655
feat_init: []
feat: [6 11 8 3 7 1 9 2 10 5 4]
crit: [1x11 double]
>> [pf,res2]=sdfeatsel(a)
forward:...........
Feature subset pipeline 11x5 (sdp_fsel)
res2 =
method: 'forward search'
improvement: 1
best_subset: [6 11 8 3 7]
best_count: 5
best_crit: 0.2572
feat_init: []
feat: [6 11 8 3 7 9 10 2 1 5 4]
crit: [1x11 double]
The default splitting fraction is 0.75 (75% of data used for training). The
fraction may be changed using 'trfrac' option. Additional information on the feature selection process may be retrieved using the second output: the structure res provides the best subset found and also all subsets analyzed together with the criteria values.
11.2.4. Backward search ↩
The backward search starts from all features and gradually removes the features until the performance improves.
>> [pf,res]=sdfeatsel(tr,'test',ts,'method','backward')
backward:..........
Feature subset pipeline 11x6 (sdp_fsel)
res =
method: 'backward search'
improvement: 1
best_subset: [1 11 9 3 2 8]
best_count: 6
best_crit: 0.2833
feat_init: []
feat: [1 11 9 3 2 8 7 6 10 5 4]
crit: [1x11 double]
11.2.5. Floating search ↩
The floating search combines multiple forward and backward steps gradually improving the best feature subset found. If not initial feature set is given, the floating search is initialized by the random search.
>> [pf,res]=sdfeatsel(tr,'test',ts,'method','floating')
floating: random:(25), backward:........(22), forward:.........(44), backward:...........(44), forward:........(44)
Feature subset pipeline 60x44 (sdp_fsel)
res =
best_subset: [1x41 double]
best_crit: 0.0191
feat: {1x5 cell}
feat_count: [25 22 44 41 44]
crit: [0.0765 0.0383 0.0281 0.0191 0.0281]
method: 'floating'
stop: 5
Floating search uses full forward and backward search steps. Therefore, it
cannot provide a single nested list of features. Instead, the res
structure contains a cell array with a feature subset for each floating
step and corresponding criterion value.
>> res.feat
ans =
Columns 1 through 4
[1x25 double] [1x22 double] [1x44 double] [1x41 double]
Column 5
[1x44 double]
The res structure also provides lengths of feature subsets in
res.feat_count field. Thereby, we may quickly visualize the process of
floating search minimizing the criterion:
>> figure; plot(res.feat_count,res.crit,'b-x')
>> xlabel('feature count')
>> ylabel('criterion (1-NN error) on the test set')

The x axis shows the number of features in each floating step, the y axis then the corresponding 1-NN error on the test set.
11.2.6. Initialization of the selection searches ↩
Forward, backward or floating search may be initiated from an existing subset of features. For example, we can start the forward search from the random search results:
>> [pf,res]=sdfeatsel(tr,'test',ts,'method','random');
>> [pf2,res2]=sdfeatsel(tr,'test',ts,'method','forward','from',pf);
forward:......
Feature subset pipeline 11x6 (sdp_fsel)
res2 =
method: 'forward search'
improvement: 1
best_subset: [1 2 3 9 11 8]
best_crit: 0.2833
best_count: 6
feat_init: [1 2 3 9 11]
crit_init: 0.2868
feat: [1 2 3 9 11 8 7 6 10 5 4]
crit: [0.2833 0.2837 0.2886 0.2985 0.3202 0.3755]
>> [pf1,res1]=sdfeatsel(tr,'test',ts)
forward:...........
Feature subset pipeline 11x7 (sdp_fsel)
res1 =
method: 'forward search'
improvement: 1
best_subset: [8 11 1 3 7 9 6]
best_count: 7
best_crit: 0.2961
feat_init: []
feat: [8 11 1 3 7 9 6 2 10 5 4]
crit: [1x11 double]
If we attempt to initiate another forward search from the subset found
(pf1), we will not improve:
>> [pf2,res2]=sdfeatsel(tr,'test',ts,'from',pf1)
forward:
Feature subset pipeline 11x7 (sdp_fsel)
res2 =
method: 'forward search'
improvement: 0
best_subset: [8 11 1 3 7 9 6]
best_crit: 0.2961
best_count: 7
feat_init: [8 11 1 3 7 9 6]
crit_init: 0.2961
feat: [8 11 1 3 7 9 6 2 10 5 4]
crit: [0.3160 0.3160 0.3527 0.3850]
However, if we perform a backward search we may reach better performance by removing some of the features:
>> [pf2,res2]=sdfeatsel(tr,'test',ts,'from',pf1,'method','backward')
backward:......
Feature subset pipeline 11x5 (sdp_fsel)
res2 =
method: 'backward search'
improvement: 1
best_subset: [7 11 8 3 6]
best_count: 5
best_crit: 0.2789
feat_init: [8 11 1 3 7 9 6]
feat: [7 11 8 3 6 9 1]
crit: [0.5377 0.4268 0.3429 0.2976 0.2789 0.2823 0.2961]
