perClass Documentation
version 5.0 (21-sep-2016)

kb21: Feature selection in perClass

Keywords: feature selection, visual inspection, greedy search

Problem: Why to perform feature selection?

Solution: Feature selection allows us to identify the informative features and to reduce the dimensionality, helping the classifiers to do a better job. perClass offers interactive tools to perform feature selection, together with a number of greedy searches that are scalable to large data sets.

There are number of reasons why we might want to reduce the dimensionality of our data set:

21.1. Interactive selection of features ↩

With perClass you have several ways to evaluate the relevance of your features. A direct visual inspection gives you a feeling about the feature informativeness. The sdfeatplot shows the per-class density plot, allowing you to inspect each feature individually. Moving with the up/down cursor keys you can browse between the densities for different features. An example is illustrated in the figure below, an a medical data set with two classes (disease/no disease).

In the first figure we can see that the distributions of the two classes are overlapping. We can judge that this feature is, therefore, not important, at least if taken individually. The second figure shows some separation in the class distributions, which increases in the third feature. This last one seems to carry the most information of the three.

We might want to inspect the effect of combination of features. The "show feature distribution" option of sdscatter helps us to evaluate the combination of two features. By pressing the d key we can visualize the scatterplot and the corresponding density plots. In this way we can identify combination of features that show good class separability.

Once we identify a set of features that we judged informative, we can create a pipeline that allows us to select them directly from the original data set. The example below shows how to select features 3, 5, and 8 from the data set b containing 11 features.

>> b
'medical D/ND' 5762 by 11 sddata, 2 classes: 'disease'(1495) 'no-disease'(4267) 
>> pf=sdfeatsel(b,[3 5 8])
Feature subset pipeline 11x3   (sdp_fsel)
>> c=b*pf
'medical D/ND' 5762 by 3 sddata, 2 classes: 'disease'(1495) 'no-disease'(4267) 

Practical remark: Typically, during classifier design you start with large number of potentially useful features and evaluate classifiers using different feature subsets. Your final classifier will start with a feature selection pipeline step. In production you need to make sure that correct features are provided to the classifier in a good order. Do you need to remember by heart what features were used? Luckily not, perClass stores feature names in the feature selection pipeline:

>> pf.lab'
   1 Kurtosis     
   2 Energy Band 2
   3 Moment 02        

21.2. Feature selection: individual search ↩

More objective evaluation of the feature relevance is provided by sdfeatsel. This function provides several ways to perform feature selection.

The fastest method is the individual selection. Each feature is judged individually based on a criterion. As default the 1-NN (first nearest neighbor) classifier is used.

>> b
'medical D/ND' 5762 by 11 sddata, 2 classes: 'disease'(1495) 'no-disease'(4267) 
>> [pf,res]=sdfeatsel(b,'method','individual')
Feature subset pipeline 11x1   (sdp_fsel)

res = 

         feat: [6 8 7 9 2 1 11 5 4 10 3]
         crit: [0.2920 0.3127 0.3143 0.3319 0.3458 0.3534 0.3963 0.4317 0.4806 0.4928 0.5138]
    best_feat: 6
    best_crit: 0.2920
       method: 'individual ranking'

The features are ranked according to the criterion value from the most to the least informative. The classification error of a different classifier can also be used as criterion to judged the feature informativeness.

When to use it: when you have large number of features and would like to make a quick preselection.

Keep in mind: The individual selection does not take into account how the features work as a group. Individually poor features may yield high class separability when used together. To take this into account a forward or a backward search may be performed.

21.3. Feature selection: Forward and backward searches ↩

The forward search starts with the best individual feature and adds to it the feature that together with it provides the best result. The procedure is repeated until all features are used. Therefore, in the forward search, the merit of additional features are judged together with the already selected subset.

>> [pf,res]=sdfeatsel(b,'method','forward')
Feature subset pipeline 11x8   (sdp_fsel)

res = 

         method: 'forward search'
    improvement: 1
    best_subset: [6 8 11 1 3 7 9 10]
     best_count: 8
      best_crit: 0.1483
      feat_init: []
           feat: [6 8 11 1 3 7 9 10 2 5 4]
           crit: [0.3162 0.2647 0.2005 0.1661 0.1565 0.1485 0.1519 0.1483 0.1659 0.1738 0.1947]

The backward search starts from all features and gradually removes them until the performance improves.

>> [pf,res]=sdfeatsel(b,'method','backward')
Feature subset pipeline 11x3   (sdp_fsel)

res = 

         method: 'backward search'
    improvement: 1
    best_subset: [1 11 9]
     best_count: 3
      best_crit: 0.1492
      feat_init: []
           feat: [1 11 9 3 2 5 10 6 8 7 4]
           crit: [0.3417 0.1843 0.1492 0.1699 0.1711 0.1658 0.1573 0.1577 0.1563 0.1563 0.1664]

Keep in mind: Both forward and backward searches are sensitive to the starting features. If the first selection is not optimal we will never recover. A selection method that does not have this shortcoming is the floating search.

21.4. Feature selection: floating search ↩

The floating search combines multiple forward and backward steps, gradually improving the best feature subset found.

>> [pf,res]=sdfeatsel(b,'method','floating')
floating: random:(5), backward:....(4), forward:.......(4)
Feature subset pipeline 11x4   (sdp_fsel)

res = 

    best_subset: [1 11 8 7]
      best_crit: 0.1527
           feat: {[1 7 8 9 11]  [1 11 8 7]  [1 11 8 7]}
     feat_count: [5 4 4]
           crit: [0.1658 0.1527 0.1527]
         method: 'floating'
           stop: 4

Note that in perClass the starting set used by the floating search is the best subset selected out of a number of randomly chosen sets. In this way we aim at speeding up and improving the search by starting from a good selection. The random selection can also be used as a starting set for the forward and backward searches. perClass performs full forward and backward searches in each floating round.

Keep in mind: The major limitation of the floating method is the speed. This approach is the slowest between the presented approaches.

21.5. Scalability of feature selection methods ↩

Feature selection is important to improve the classification performance, and often is, therefore, a necessary step. In many applications, however, the speed is an issue. perClass C implementation makes feature selection scalable also to very large data sets. Below is a performance comparison of perClass with PRTools feature selection implemented in Matlab. A small data set with 5000 samples and 11 features is used.

Three searches approaches are compared, namely the individual, forward and floating searches using the 1-NN classifier error as a criterion.

perClass feature selection scales to large data sets. Below we illustrate the scalability of the individual search with different data sizes, up to 5 million samples. Two criterion are used. The first is the 1-NN error (in red), whose speed is directly proportional to the number of samples. The second one, in blue, is the error of the Fisher classifier (sdfisher). With 5 million samples the individual search with 1-NN does the job in little more then 200 seconds, while the Fisher criterion takes only 75 seconds. For data sets smaller than 50000 samples the time goes below 2 seconds, with the 1-NN criterion being much faster.

Practical remark: For very large data sets, a criterion based on a model, such as Fisher, is faster than non-parametric criteria such as 1-NN.