perClass Documentation
development version 3.2 (14-Mar-2012)
Content

Comments? Ideas? Compliments?

Your email (only if you wish to be contacted)

Chapter 12: From models to classifiers

Table of contents

perClass provides several tools for training detectors, multi-class classifiers (discriminants) and hierarchical classifiers.

12.1. Detectors ↩

A detector is a classifier that focuses only at one class of interest, the target class. Detectors may be constructed using the sddetector command. It takes a data set, the target class and an untrained model as parameters and returns the pipeline object providing decisions.

pd = sddetector( data, target_class, model )

In the example below a Gaussian detector is constructed for the class 'apple' of the fruit data:

>> load fruit
260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60) 
>> pd=sddetector(a,'apple',sdgauss)
 1: apple  -> apple    
 2: banana -> non-apple
 3: stone  -> non-apple
sequential pipeline     2x1 'Gaussian model+Decision'
 1  Gaussian model          2x1  one class, 1 component (sdp_normal)
 2  Decision                1x1  thresholding on apple at op 32 (sdp_decide)

 >> sdscatter(a,pd)

Two approaches may be used to set the operating point (threshold) of sddetector:

  • One-class scenario, where the threshold is fixed by specifying the fraction of objects rejected from a target class

  • Two-class situation where the model is trained on a subset of the data and possible thresholds (operating points) are defined based on ROC analysis of the remaining validation set. By default, the operating point is set to minimize mean error between target and non-target classes.

If the user specifies an existing class as the target class, the sddetector will use the remaining classes as non-targets and set the operating points by the ROC analysis. Non-existing name of target class results in building a detector for all data using the one-class strategy. The reject parameter must then specify the fraction of samples to be rejected. If used when building detector for an existing class, sddetector simply fixes the threshold based on the target class samples only, skipping the ROC analysis.

12.1.1. Setting the detector operating point via thresholding ↩

The threshold is set by fixing the reject fraction of the samples in the target class. In this examples the detector rejects 10% of the banana class:

>> pd=sddetector(a,'banana',sdgauss,'reject',0.1)
  1: apple  -> non-banana
  2: banana -> banana    
  3: stone  -> non-banana
sequential pipeline     2x1 'Gaussian model+Decision'
 1  Gaussian model          2x1  one class, 1 component (sdp_normal)
 2  Decision                1x1  thresholding on banana at op 1 (sdp_decide)    
 >> sdscatter(a,pd)

12.1.2. Detector for the all data ↩

A detector can be build for the complete data set. In the example below the target class is named 'all'. Since this class name is not defined in the data set a, all samples are used as target:

>> pd=sddetector(a,'all',sdgauss,'reject',0.1)
sequential pipeline     2x1 'Gaussian model+Decision'
 1  Gaussian model          2x1  one class, 1 component (sdp_normal)
 2  Decision                1x1  thresholding on all at op 1 (sdp_decide)    
 >> sdscatter(a,pd)

12.1.3. Setting the detector operating point via ROC analysis ↩

sddetector may returns also the ROC object r with a set of alternative operating points. We can then change the operating point directly from the ROC curve plotted with the sddrawroc function.

>> load fruit;
>> [pd,r]=sddetector(a,'apple',sdmixture([],'comp',5,'iter',10))  
  1: apple  -> apple    
  2: banana -> non-apple
  3: stone  -> non-apple
[class 'apple' EM:.......... 5 comp] 
sequential pipeline     2x1 'Mixture of Gaussians+Decision'
 1  Mixture of Gaussians    2x1  one class, 5 components (sdp_normal)
 2  Decision                1x1  thresholding on apple at op 34 (sdp_decide)
ROC (52 thr-based op.points, 3 measures), curop: 34
est: 1:err(apple)=0.05, 2:err(non-apple)=0.00, 3:mean-error=0.03
>> sdscatter(a,pd,'roc',r) 

The scatter plot shows the corresponding boundary of the two-Gaussians model trained for the apple class.

12.1.4. Visualizing detector decisions on the image data ↩

sdimage may visualize decisions of any classifier pipeline trained in the feature space spanned by the image bands. We may also inspect decisions in an image at different operating points.

We save the hand-painted road labels into data2 data set:

>> data2
412160 by 3 sddata, 2 classes: 'unknown'(399848) 'road'(12312) 

Now we may train the road detector. We use a subset of 500 pixels for road sign background classes and train the detector:

>> b=randsubset(data2,500)
1000 by 3 sddata, 2 classes: 'unknown'(500) 'road'(500) 

>> [pd,r]=sddetector(b,'road',sdgauss)
  1: unknown -> non-road
  2: road   -> road
  1: road   -> road
  2: non-road -> non-road
sequential pipeline     3x1 'Gaussian model+Decision'
 1  Gaussian model          3x1  one class, 1 component (sdp_normal)
 2  Decision                1x1  thresholding on road at op 80 (sdp_decide)
ROC (192 thr-based op.points, 3 measures), curop: 80
est: 1:err(road)=0.00, 2:err(non-road)=0.19, 3:mean-error [0.50,0.50]=0.10

We can now visualize detector decisions on another image:

>> im2=imread('roadsign11.bmp');
>> sdimage(im2,pd)
ans =
 3

sdimage may visualize both the decisions and the ROC of the detector:

>> sdimage(im2,pd,'roc',r)
ans =
 1

This allows us to interactively analyze detector decisions at different operating points.

12.2. Discriminants ↩

perClass brings number of tools for designing two- or multi-class discriminants.

12.2.1. Nearest mean classifier ↩

sdnmean implements the nearest mean classifier. It uses normal model with identical covariance matrix for all classes. Its output is a similarity.

>> load fruit
260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60) 
>> p=sdnmean(a)
sequential pipeline     2x3 'Nearest mean'
 1  sdp_normal          2x3  3 classes, 3 components

>> sdscatter(a,sddecide(p))

12.2.2. Linear classifier assuming normal densities ↩

sdlinear is a linear discriminant based on assumption of normal densities. Covariance matrix is averaged using apparent class priors.

>> load fruit;
>> a=a(:,:,[1 2])
260 by 2 sddata, 2 classes: 'apple'(100) 'banana'(100)  

>> p=sdlinear(a)
sequential pipeline     2x2 'Linear discriminant'
 1  Gauss eq.cov.           2x2  2 classes, 2 components (sdp_normal)
 2  Output normalization    2x2  (sdp_norm)

Confusion matrix estimated on the training set at default operating point:

>> sdconfmat(a.lab,a*sddecide(p),'norm')

ans =

 True      | Decisions
 Labels    | apple  banana  | Totals
-------------------------------------
 apple     | 0.890  0.110   | 1.00
 banana    | 0.150  0.850   | 1.00
-------------------------------------

In order to use specific priors, provide them using the priors option:

>> p=sdlinear(a,'priors',[0.8 0.2])
sequential pipeline     2x2 'Linear discriminant'
 1  Gauss eq.cov.           2x2  2 classes, 2 components (sdp_normal)
 2  Output normalization    2x2  (sdp_norm)

>> sdconfmat(a.lab,a*sddecide(p),'norm')

ans =

 True      | Decisions
 Labels    | apple  banana  | Totals
-------------------------------------
 apple     | 0.950  0.050   | 1.00
 banana    | 0.200  0.800   | 1.00
-------------------------------------

Note that by increasing the apple class prior we lower the apple error. However, the banana error will increase accordingly.

12.2.3. Quadratic classifier assuming normal densities ↩

sdquadratic implements quadratic discriminant based on assumption of normal densities. Specific covariance matrix is estimated for each class.

>> p=sdquadratic(a)
sequential pipeline     2x2 'Quadratic discr.'
 1  Gauss full cov.         2x2  2 classes, 2 components (sdp_normal)
 2  Output normalization    2x2  (sdp_norm)
>> sdscatter(a,sddecide(p))

12.2.4. Gaussian mixture models ↩

sdmixture implements training of Gaussian mixture models using EM algorithm. By default, sdmixture estimates the number of components automatically from the data using the approach proposed in (J. Grim, J. Novovicova, P. Pudil, P. Somol, F.J. Ferri, Initializing Normal Mixtures of Densities, Prof.of ICPR 1998.).

>> load fruit
260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60) 
>> p=sdmixture(a)
[class 'apple' initialization:....................... 3 clusters  
EM:.............................. 3 comp] 
[class 'banana' initialization:....................... 3 clusters  
EM:.............................. 3 comp] 
[class 'stone' initialization:....................... 1 cluster  
EM:.............................. 1 comp] 
Mixture of Gaussians pipeline 2x3  3 classes, 7 components (sdp_normal)

>> sdscatter(a,p) %  visualize the output on 1st class ('apple')

We can visualize the mixture decisions by adding default operating point using sddecide:

The number of mixture components may be also specified manually using the comp option. If a scalar number is specified, it is used for each class. Alternatively, we may provide the vector with number of components for each class (in the lab.list order):

>> p2=sdmixture(a,'comp',[4 4 1])
[class 'apple' EM:.............................. 4 comp] 
[class 'banana' EM:.............................. 4 comp] 
[class 'stone' EM:.............................. 1 comp] 
Mixture of Gaussians pipeline 2x3  3 classes, 9 components (sdp_normal)

By default, the EM algorithm is terminated after 30 iterations. The iteration count may be changed using the iter option. If iter is set to [], the EM algorithm stops when the likelihood difference becomes lower that delta (by default 1e-4).

The EM algorithm may be initialized using init option by providing a sdp_normal pipeline. In this example, we estimate the mixture starting from a simple two-component Gaussian model. Note that the provided model initializes each per-class EM algorithm. Therefore, we eventually obtain a six-component mixture model.

>> pinit=sdp_normal(sddata([0 0; 1 1]), {eye(2) eye(2)},[0.5 0.5])
Gauss pipeline          2x1  one class, 2 components (sdp_normal)
>> p3=sdmixture(a,'initmodel',pinit)
[class 'apple' EM:.............................. 2 comp] 
[class 'banana' EM:.............................. 2 comp] 
[class 'stone' EM:.............................. 2 comp] 
Mixture of Gaussians pipeline 2x3  3 classes, 6 components (sdp_normal)

12.2.5. k-NN classifier ↩

k-NN (k-th nearest neighbor) is a non-parametric classifier. This means that instead of building a model from training examples it uses them directly as evidence for computing its output on new observations. The stored training examples are called "prototypes".

By default sdknn implements the first nearest neighbor rule (k=1) using all provided training examples as prototypes:

>> load fruit
260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60) 
>> p=sdknn(a)
1-NN pipeline           2x3  3 classes, 260 prototypes (sdp_1nn) 
>> sdscatter(a,sddecide(p))

Using larger neighborhoods (k>1), the nearest neighbor classifier becomes more robust against noise in the areas of overlap. sdknn implements two approaches:

  • kappa - computes the distance to the k-th neighbor within each class. This is used by default.
  • classfrac - returns the fraction of samples of each class between the k prototypes closest to the new observation.

The kappa method is used by default for any k>1:

>> p1=sdknn(a,'k',10) %  by default 'kappa'
10-NN classifier (dist) pipeline 2x3   (sdp_stack)    
>> sdscatter(a,sddecide(p1))

Note that the decision boundary becomes much more stable between the overlapping classes.

The nearest neighbor computing class fractions is created by setting the method to classfrac:

>> p2=sdknn(a,'k',10,'method','classfrac')
10-NN pipeline          2x3  k=10, 260 prototypes (sdp_knnmc)

Note that explicit specification of 'k' is not necessary:

>> p2=sdknn(a,10,'method','classfrac')
10-NN pipeline          2x3  k=10, 260 prototypes (sdp_knnmc)

Although both kappa and classfrac methods may look similar at the first sight, the different way of computing the per-class output has important practical consequences. The kappa method returns distance while the classfrac the similarity (fraction):

>> getoutput(p1)
ans =
class distance

>> getoutput(p2)
ans =
class similarity

Because classfrac is computing class fractions between k neighbors, it provides a discriminant applicable to two or more classes. This means that it splits the feature space into open sub-spaces. It separates classes but cannot be used to detect one class protecting it from all directions.

The kappa method, on the other hand, computes distance to k-th nearest neighbor for each class separately.

Finally, let us compare the outputs of both k-NN methods on a small test set:

>> ts=sddata(gendatf([3 3 3]))
'Fruit set' 9 by 2 sddata, 3 classes: 'apple'(3) 'banana'(3) 'stone'(3) 

>> out1=ts*p1 %  kappa
'Fruit set' 9 by 2 sddata, 3 classes: 'apple'(3) 'banana'(3) 'stone'(3) 
>> out2=ts*p2 %  classfrac
'Fruit set' 9 by 2 sddata, 3 classes: 'apple'(3) 'banana'(3) 'stone'(3) 

>> [+out1 +out2]

ans =

   1.4888   44.7181  115.8823    0.8462    0.0769    0.0769
   1.9581   26.9135   71.0346    0.8462    0.0769    0.0769
   1.4120   40.8463  109.4723    0.8462    0.0769    0.0769
  48.1386    3.0872    5.4234    0.0769    0.6154    0.3077
  26.9787    6.6342    6.6689    0.0769    0.5385    0.3846
  26.4837    2.0395   14.6985    0.0769    0.6154    0.3077
  40.3134   34.9477    4.5834    0.0769    0.0769    0.8462
  49.8063   20.2838    2.3588    0.0769    0.0769    0.8462
  23.6764   35.0027   12.8724    0.2308    0.0769    0.6923

In the first three columns, we can see distances returned by kappa k-NN. In the last three columns, we get the fractions of classfrac. Note that the later ones attain only handful of fixed values (fractions). This results in sparse ROC plots for the classfrac classifier.

12.2.5.1. Prototype selection ↩

By default sdknn uses all training examples as prototypes. It may, however, also select a smaller subset. Two prototype selection strategies are used:

  • random - selecting a specified number of prototypes per class
  • kcentres - using PRTools kcentres clustering algorithm to systematically select prototypes.

In this example, we randomly select 40 prototypes per class:

>> p=sdknn(a,'k',5,'proto',40)
5-NN classifier (dist) pipeline 2x3   (sdp_stack)

Number of prototypes may be adjusted per class. Simpler classes may be described by smaller prototype sets. This increases execution speed of the k-NN classifier:

>> p=sdknn(a,'k',5,'proto',[40 40 10])
5-NN classifier (dist) pipeline 2x3   (sdp_stack)

To select prototypes using kcentres, use protosel option:

>> p=sdknn(a,'k',5,'proto',40,'protosel','kcentres')
5-NN classifier (dist) pipeline 2x3   (sdp_stack)

k-centres algorithm computes dissimilarity matrix. For large data sets, sdknn uses only randomly drawn 2000 examples to compute k-centres.

12.2.5.2. Using large data sets ↩

sdknn can handle large data sets with tens of thousands of samples. Here we train 10-NN classifier on 100 000 samples and test its execution speed on 1000 test objects:

>> a=sddata(gendatf(100000))
'Fruit set' 100000 by 2 sddata, 3 classes: 'apple'(33333) 'banana'(33333) 'stone'(33334)

>> p=sdknn(a,'k',10)
10-NN classifier (dist) pipeline 2x3   (sdp_stack)

>> ts=sddata(gendatf(1000))
'Fruit set' 100000 by 2 sddata, 3 classes: 'apple'(333) 'banana'(333) 'stone'(334)

>> tic; out=ts*p; toc
Elapsed time is 1.338342 seconds. % using 2.33 GHz laptop

12.2.6. Parzen classifier ↩

Parzen classifier estimates probability density for each class using a non-parametric approach based on stored training examples. When computing output for a new observation, the contribution of each training example is integrated. The contribution is modeled by a kernel function and is influenced by the smoothing parameter (kernel width).

By default, sdparzen trains a Parzen classifier with Laplace kernel function which is less computationally demanding than frequently-adopted Gaussian kernel. Smoothing parameter is optimized using EM algorithm optimizing cross-validated log-likelihood.

>> load fruit
260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60) 

>> p=sdparzen(a)
.....
Parzen pipeline         2x3  3 classes, 260 prototypes (sdp_parzen)    
>> sdscatter(a,sddecide(p))

The default Parzen classifier uses scalar smoothing i.e. equal kernel width for each feature. Smoothing parameter is stored in h field of parzen pipeline:

>> p{1}

ans = 

kernel: 'laplace'
     h: 0.6482
 proto: [260x2 double]
 prior: [0.3846 0.3846 0.2308]

Smoothing may be fixed manually. Here we use very small kernel with:

>> p=sdparzen(a,'h',0.04)
Parzen pipeline         2x3  3 classes, 260 prototypes (sdp_parzen)    
>> sdscatter(a,sddecide(p))

Note that the decision boundary becomes very complicated emphasizing very local changes of the class distributions.

Smoothing parameter may be also estimated for each dimension, specifying h as vector:

>> p=sdparzen(a,'h','vector')
.........
Parzen pipeline         2x3  3 classes, 260 prototypes (sdp_parzen)

>> p{1}

ans = 

kernel: 'laplace'
     h: [0.5782 0.7266]
 proto: [260x2 double]
 prior: [0.3846 0.3846 0.2308]

Vector smoothing requires extra multiplication for each dimension of each training sample. Alternative strategy is to scale the data to unit variance so that scalar smoothing is sufficient.

12.2.7. Neural network ↩

Feed-forward neural network may be trained using sdneural function. By default, 10 hidden units is used and the optimization runs for 1000 iterations (epochs):

>> load fruit
260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60) 
>> p=sdneural(a)
epochs (*100):..........
Neural network pipeline 2x3   (sdp_neural)

>> sdscatter(a,sddecide(p))

The provided data set is split into training and validation subsets (75%/25% by default). The training subset is used in optimization and the validation subset to estimate the generalization error. The validation fraction may be changed using valfrac option. Eventually, sdneural returns the network with lowest mean square error (MSE) on the validation set. Thanks to this approach the sdneural does not overfit training data when trained for a large number of epochs.

12.2.8. Naive Bayes classifier ↩

Naive Bayes classifier is implemented by the sdnbayes function. For each feature, it estimates a class-conditional distribution using a histogram. Assuming independence of features, the per-class output is computed as a product of per-feature class conditional densities.

sdnbayes estimates the number of histogram bins from the data using non-parametric density estimation approach.

>> a=sddata(gendatf(3000))
3000 by 2 sddata, 3 classes: 'apple'(1000) 'banana'(1000) 'stone'(1000)     
>> p=sdnbayes(a)
 class 'apple':.. class 'banana':.. class 'stone':..
Naive Bayes pipeline    2x3   (sdp_nbayes) 
>> sdscatter(a,sddecide(p))

The number of histogram bins may be fixed manually using the bins option:

>> p=sdnbayes(a,'bins',20)
  class 'apple': class 'banana': class 'stone':
Naive Bayes pipeline    2x3   (sdp_nbayes)

12.2.9. Decision tree classifier ↩

perClass sdtree classifier offers a fast and scalable decision tree implementation. Decision tree is build by finding the best threshold on one of the features that improves class separatation. The process is applied recursively until the stopping condition is met. If the tree is fully built, each data sample ends in a separate terminal node. This solution, however, does not yield good generalization in case of class overlap. Therefore, sdtree applies pruning strategy that stops tree growth at the earlier stage improving the generalization performance. The pruning uses a separate validation set to estimate tree generalization error.

To illustrate the basic use of the decision tree, lets consider the fruit_large data set.

>> load fruit_large
>> a
'Fruit set' 2000 by 2 sddata, 3 classes: 'apple'(667) 'banana'(667) 'stone'(666) 

We split the data to training and test subsets. The test subset will be used later to estimate tree performance.

>> [tr,ts]=randsubset(a,0.5)
'Fruit set' 999 by 2 sddata, 3 classes: 'apple'(333) 'banana'(333) 'stone'(333) 
'Fruit set' 1001 by 2 sddata, 3 classes: 'apple'(334) 'banana'(334) 'stone'(333) 

We train the tree on the training subset tr. By default, the data set passed to sdtree is split internally into two subsets. The first part is used to grow the tree and the second part to limit its growth by identifying the sufficient number of thresholds. This process happens inside sdtree. We will see later, how to take closer control over the pruning process.

>> p=sdtree(tr)
Decision tree pipeline  2x2   (sdp_tree)

>> pd=sddecide(p)  %  we add the default operating point
sequential pipeline     2x1 'Decision tree+Decision'
 1  Decision tree           2x2  (sdp_tree)
 2  Decision                2x1  weighting, 2 classes, 1 ops at op 1 (sdp_decide)

Let's visualize the decisions at the default operating point on the test set:

>> sdscatter(ts,pd)

Default decision tree trained using sdtree.

We now estimate the mean test error using sdtest on the test subset:

>> sdtest(ts,pd)

ans =

0.0730

12.2.9.1. Growing tree without pruning ↩

We may suppress the pruning process and grow the full tree, using the 'full' option.

>> p2=sdtree(tr,'full')
Decision tree pipeline  2x3   (sdp_tree)

In order to see details of the trained decision tree, use the unary plus operator:

>> +p2

ans = 

     table: [195x7 double]
thresholds: 97

We may compare the fully grown tree with the tree built using the default pruning strategy:

>> +p

ans = 

     table: [49x7 double]
thresholds: 24

>> sdscatter(ts,sddecide(p2))

Fully-grown decision tree trained using sdtree.

Notice, the overfitting appearing in the area of overlap. The fully-grown tree solves perfectly separates the training set but fails to do so on the independent test set.

12.2.9.2. Controlling the tree pruning ↩

We may control the sdtree pruning algorithm in several ways. Firstly, we may specify the fraction of the input data set used for training the tree using the 'trfrac' option. By default, 80% of input data is used to grow the tree and 20% to stop the growth process.

>> p=sdtree(tr,'trfrac',0.5)
Decision tree pipeline  2x3   (sdp_tree)

Secondly, we may remove the random splitting step by supplyin the separate validation data used for pruning.

>> [val,tr2]=randsubset(tr,100)
'Fruit set' 300 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(100) 
'Fruit set' 699 by 2 sddata, 3 classes: 'apple'(233) 'banana'(233) 'stone'(233) 

We train on the tr2 part and pass the validation set val separately using the 'test' option:

>> p=sdtree(tr2,'test',val)
Decision tree pipeline  2x3   (sdp_tree)
>> +p

ans = 

     table: [31x7 double]
thresholds: 15

This removes the random element from the sdtree training. Repeating the training, we obtain identical tree.

>> p=sdtree(tr2,'test',val)
Decision tree pipeline  2x3   (sdp_tree)
>> +p

ans = 

     table: [31x7 double]
thresholds: 15

Thirdly, the number of tree levels (thresholds) may be also specified using the 'levels' option during the tree trainig:

>> p=sdtree(tr,'levels',10)
Decision tree pipeline  2x3   (sdp_tree)
>> +p

ans = 

     table: [21x7 double]
thresholds: 10

We may also inspect the pruning process in detail. The second output parameter, returned by sdtree, is a structure containing the full tree and the error criterion estimated from the validation set at each tree threshold.

>> [p,res]=sdtree(tr2,'test',val)
Decision tree pipeline  2x3   (sdp_tree)

res = 

full_tree: [2x3 sdppl]
      err: [50x1 int32]

>> figure; plot(res.err)

Error of the decision tree on the validation set.

The res.full_tree field contains the fully grown tree. To be more precise, it contains the tree grown as much as possible depending on the 'maxsamples' option. By default, at least 10 samples much be present in a node to continue growing process.

To prune the tree manually, we may select a specific number of thresholds by passing tree pipeline to the sdtree function. Here, we select 10 thresholds:

>> p3=sdtree(p,10)
Decision tree pipeline  2x3   (sdp_tree)

12.2.10. Random forest classifier ↩

Random forest classifier sdrandforest combines a large number of decision trees. Each tree is built in a special way considering only randomly selected subset of features at each tree node. The combination of their outputs is based on the sum rule.

By default, sdrandforest builds 20 trees and considers a subset of 20% of features at each node. let us train a random forest classifier on the medical data set. We use data of the first five patients as a test set and remaining data as a training set:

>> load medical
>> a=a(:,:,1:2)
'medical D/ND' 5762 by 11 sddata, 2 classes: 'disease'(1495) 'no-disease'(4267) 
>> [ts,tr]=subset(a,'patient',1:5)
'medical D/ND' 1887 by 11 sddata, 2 classes: 'disease'(597) 'no-disease'(1290) 
'medical D/ND' 3875 by 11 sddata, 2 classes: 'disease'(898) 'no-disease'(2977) 

>> p=sdrandforest(tr)
..........
Random forest pipeline  11x2   (sdp_randforest)
>> sdtest(ts,sddecide(p))

ans =

0.2884

sdrandforest offers two user-adjustable parameters: the number of trees and the number of randomly-selected dimensions.

We may provide the number of trees directly as the second parameter:

>> p=sdrandforest(tr,200)
..........
Random forest pipeline  11x2   (sdp_randforest)
>> sdtest(ts,sddecide(p))

ans =

0.2865

To adjust the number of dimensions that are randomly selected at each tree node during training, use the 'dim' option. Here we combine it with the request for the 200 trees:

>> p=sdrandforest(tr,'dim',0.5,'count',200)
..........
Random forest pipeline  11x2   (sdp_randforest)
>> sdtest(ts,sddecide(p))

ans =

0.2798

12.2.11. Support Vector Machine ↩

perClass sdsvc command trains a support vector machine classifier using libSVM library. sdsvc supports linear, RBF and polynomial kernels. The trained support vector machines are executed through the perClass runtime library.

By default, RBF kernel is used with sigma and C parameters selected automatically based on grid search. The mean error on a validation set (25% of the training set is optimized during the grid search.

>> b
'Fruit set' 200 by 2 sddata, 2 classes: 'apple'(100) 'banana'(100) 

>> p=sdsvc(b)
....................sigma=1.85511 C=54.6 err=0.000 SVs=15
sequential pipeline     2x1 'Scaling+Support Vector Machine'
 1  Scaling                 2x2  (sdp_affine)
 2  Support Vector Machine    2x1  (sdp_svc)

sdsvc displays the sigma and C parameters together with the error on validation set and number of support vectors. Keep in mind that the number of support vectors is important indicator of well-trained support vector machine. Large number of SVs may mean that we use wrong sigma or C or that the libsvm optimizer did not find good solution.

Note that sdsvc performs scaling of input data by default. This may be switched off using the 'noscale' option.

The two-class support vector machine yields one soft output which needs to be thresholded in order to make a decision. We may add a default operating point with zero threshold using sddecide:

>> pd=sddecide(p)
sequential pipeline     2x1 'Scaling+Support Vector Machine+Decision'
 1  Scaling                 2x2  (sdp_affine)
 2  Support Vector Machine    2x1  (sdp_svc)
 3  Decision                1x1  thresholding on apple at op 1 (sdp_decide)

>> sdscatter(b,pd)

12.2.11.1. Linear support vector machine ↩

Linear support vector machine is trained by specifying the kernel type 'linear':

>> p=sdsvc(b,'type','linear')
....................C=0.0785 err=0.160 SVs=52
Support Vector Machine (linear) pipeline 2x1   (sdp_svc)

By default, sdsvc performs search for the C parameter. The list of C parameters may be also specified explicitly:

>> p=sdsvc(b,'type','linear','C',[1 2 5 10 100])
.....C=1 err=0.100 SVs=49
Support Vector Machine (linear) pipeline 2x1   (sdp_svc)

Note that the explicit definition of 'type' is not required:

>> p=sdsvc(b,'linear','C',[1 2 5 10 100])
.....C=1 err=0.100 SVs=49
Support Vector Machine (linear) pipeline 2x1   (sdp_svc)  

12.2.11.2. Polynomial support vector machine ↩

Polynomial SVM optimizes degree and C parameters using a grid search.

>> p=sdsvc(b,'type','poly')
.....degree=5.00000 C=1e+03 err=0.060 SVs=16
sequential pipeline     2x1 'standardization+Support Vector Machine (poly)'
 1  standardization         2x2  (sdp_affine)
 2  Support Vector Machine (poly)    2x1  (sdp_svc)

Polynomial support vector machine

By default, the scaling is applied to the data (may be switched off using 'noscale' option.

Both degree and C parameters may be specified explicitly:

>> p=sdsvc(b,'type','poly','degree',[2 3 4 5],'C',[0.1 1 2 4 10 30 50 80])
....degree=3.00000 C=10 err=0.040 SVs=15
sequential pipeline     2x1 'standardization+Support Vector Machine (poly)'
 1  standardization         2x2  (sdp_affine)
 2  Support Vector Machine (poly)    2x1  (sdp_svc)

12.2.11.3. Grid search for sigma and C parameters ↩

sdsvc allows us to specify sigma and C parameters explicitly:

>> p=sdsvc(b,'sigma',1.5,'C',10)
SVs=21
sequential pipeline     2x1 'Scaling+Support Vector Machine'
 1  Scaling                 2x2  (sdp_affine)
 2  Support Vector Machine    2x1  (sdp_svc)

We may also specify the vector of sigmas and Cs. Second output of sdsvc is a structure with estimated errors and numbers of support vectors:

>> [p,E]=sdsvc(b,'sigma',0.1:0.1:5,'C',[0.01 0.1 1 3 5 10 20])
..................................................sigma=1.10000 C=20 err=0.020 SVs=21
sequential pipeline     2x1 'Scaling+Support Vector Machine'
 1  Scaling                 2x2  (sdp_affine)
 2  Support Vector Machine    2x1  (sdp_svc)

E = 

sigmas: [1x50 double]
    Cs: [0.0100 0.1000 1 3 5 10 20]
   err: [50x7 double]
   svs: [50x7 double]

We may visualize the errors and number of suport vectors in a 3D surface plot:

>> figure; surf(E.Cs,E.sigmas,E.err)
>> xlabel('C'); ylabel('sigma'); zlabel('error')
>> figure; surf(E.Cs,E.sigmas,E.svs)
>> xlabel('C'); ylabel('sigma'); zlabel('SVs')

12.2.11.4. Multi-class support vector machines (RBF) ↩

For multi-class data sets, sdsvc trains one support vector machine per class in one-against-all manner optimizing the parameters specifically for each sub-problem.

>> a
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60) 

>> p=sdsvc(a)
one-against-all: ['apple' ....................sigma=0.90884 C=0.695 err=0.062 SVs=58] 
['banana' ....................sigma=0.98847 C=0.336 err=0.113 SVs=96] 
['stone' ....................sigma=1.37335 C=1.44 err=0.043 SVs=45] 
sequential pipeline     2x3 'Scaling+Support Vector Machine+Output normalization'
 1  Scaling                 2x2  (sdp_affine)
 2  Support Vector Machine    2x3  (sdp_stack)
 3  Output normalization    3x3  (sdp_norm)

>> pd=sddecide(p)
sequential pipeline     2x1 'Scaling+Support Vector Machine+Output normalization+Decision'
 1  Scaling                 2x2  (sdp_affine)
 2  Support Vector Machine    2x3  (sdp_stack)
 3  Output normalization    3x3  (sdp_norm)
 4  Decision                3x1  weighting, 3 classes, 1 ops at op 1 (sdp_decide)

>> sdscatter(a,pd)

Grid-search estimated errors in multi-class problems are returned in a cell array with one E structure per class:

>> [p,E]=sdsvc(a,'sigma',0.1:0.1:5,'C',[0.01 0.1 1 3 5 10 20])
one-against-all: ['apple' ..................................................sigma=0.80000 C=1 err=0.050 SVs=51] 
['banana' ..................................................sigma=0.80000 C=0.1 err=0.125 SVs=150] 
['stone' ..................................................sigma=1.10000 C=10 err=0.010 SVs=33] 
sequential pipeline     2x3 'Scaling+Support Vector Machine+Output normalization'
 1  Scaling                 2x2  (sdp_affine)
 2  Support Vector Machine    2x3  (sdp_stack)
 3  Output normalization    3x3  (sdp_norm)

E = 

[1x1 struct]    [1x1 struct]    [1x1 struct]

>> E{2}

ans = 

sigmas: [1x50 double]
    Cs: [0.0100 0.1000 1 3 5 10 20]
   err: [50x7 double]
   svs: [50x7 double]

12.2.11.5. Accessing support vectors ↩

Support vector objects are stored in the proto sddata set inside the trained sdsvc pipeline.

>> b
'Fruit set' 200 by 2 sddata, 2 classes: 'apple'(100) 'banana'(100) 

>> p2=sdsvc(b)
....................sigma=0.92990 C=26.4 err=0.000 SVs=19
sequential pipeline     2x1 'standardization+Support Vector Machine'
 1  standardization         2x2  (sdp_affine)
 2  Support Vector Machine    2x1  (sdp_svc)

In case of this two-class SVC, we look for support vectors in the second step of the resulting pipeline (the first step performs data scaling).

>> p2{2}

ans = 

   type: 'rbf'
    par: 0.9299
  proto: [19x2x2 sddata]
weights: [19x1 double]
 offset: -0.0757
   mean: [0 0]

>> p2{2}.proto
19 by 2 sddata, 2 classes: 'apple'(9) 'banana'(10) 

Multi-class SVC contains for each one-against-others problem one SVC model in the stacked pipeline.

>> a
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60) 

>> p=sdsvc(a)
one-against-all: ['apple' ....................sigma=0.97084 C=26.4 err=0.025 SVs=21] 
['banana' ....................sigma=0.93841 C=0.336 err=0.050 SVs=96] 
['stone' ....................sigma=0.94301 C=12.7 err=0.000 SVs=36] 
sequential pipeline     2x3 'standardization+Support Vector Machine+Output normalization'
 1  standardization         2x2  (sdp_affine)
 2  Support Vector Machine    2x3  (sdp_stack)
 3  Output normalization    3x3  (sdp_norm)

>> p{2}
stacked pipeline        2x3 '++'
 1  Support Vector Machine    2x1  (pipeline)
 2  Support Vector Machine    2x1  (pipeline)
 3  Support Vector Machine    2x1  (pipeline)

>> p{2}{1}
Support Vector Machine pipeline 2x1   (sdp_svc)

>> p{2}{1}{1}

ans = 

   type: 'rbf'
    par: 0.9708
  proto: [21x2x2 sddata]
weights: [21x1 double]
 offset: 0.6474
   mean: [0 0]

>> p{2}{1}{1}.proto
21 by 2 sddata, 2 classes: 'apple'(8) 'others'(13) 

In order to access indices of the support vectors in the original data set, use the original property of the proto data set.

>> proto=p{2}{1}{1}.proto
21 by 2 sddata, 2 classes: 'apple'(8) 'others'(13) 

>> proto'
21 by 2 sddata, 2 classes: 'apple'(8) 'others'(13) 
sample props: 'lab'->'class' 'class'(L) 'original'(N)
feature props: 'featlab'->'featname' 'featname'(L)
data props:  'data'(N)

>> proto(1:3).original

ans =

14
21
22

The first three support vectors are:

>> +proto(1:3)

ans =

0.5759   -0.6914
1.0284    1.1254
1.1145    0.1547

Support vectors in the original (unscaled) feature space are:

>> +b( proto(1:3).original )

ans =

0.1362   -3.2968
2.2507    4.8269
2.6530    0.4864

Scaling with the first pipeline step, we receive identical numbers:

>> +b( proto(1:3).original ) * p(1)

ans =

0.5759   -0.6914
1.0284    1.1254
1.1145    0.1547

12.3. Classifier combining ↩

perClass supports both major approaches to classifier combining, namely fixed and trained combiners. Fixed combiners are rules chosen on the basis of our assumptions on the problem. For example, if we know that classifiers trained on different data representations are independent, we might choose product rule. The second type of classifier combiner is trained. Instead of assuming fusion strategy, we learn it from examples. Trained combiners simply use per-class outputs of base classifiers as new features.

12.3.1. Fixed combiners ↩

In order to build fixed combiner, we need to train the base classifiers, stack them into a single pipeline and add the fixed combiner stem.

>> load fruit
260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60) 
>> p1=sdquadratic(a)
sequential pipeline     2x3 'Quadratic discr.'
 1  Gauss full cov.         2x3  3 classes, 3 components (sdp_normal)
 2  Output normalization    3x3  (sdp_norm)
>> p2=sdparzen(a)
....
Parzen pipeline         2x3  3 classes, 260 prototypes (sdp_parzen)

>> P=sdp_stack({p1,p2})
stack pipeline          2x6   (sdp_stack)

Note that the stacked pipeline has two inputs (2D feature space) and 6 outputs corresponding to 3 per-class outputs for the p1 and 3 per-class outputs for p2.

Fixed combiner using mean rule is constructed by supplying also the number of classes, number of classifiers and class names:

>> pcomb=sdp_combine('mean',3,2,a.lab)
sequential pipeline     6x3 ''
 1  sdp_combine         6x3  Fixed combiner

Let us now compose the full pipeline delivering decisions of the fixed combiner at default operating point:

>> Pfull=sddecide(P*pcomb)
sequential pipeline     2x1 ''
 1  sdp_stack           2x6  
 2  sdp_combine         6x3  Fixed combiner
 3  sdp_decide          3x1  Weight-based decision (3 classes)
>> sdscatter(a,Pfull)

12.4. Hierarchical classifiers ↩

perClass provides tools for designing hierarchies of classifiers, based on apriori decision-level rules. Let us, for example, consider a problem separating 'apple', 'bananas' and 'lemons' on the conveyor belt. We want to build a detector separating all fruit examples from outliers on the conveyor belt (rocks, dirt, mice,...). Only the samples classified as "fruit" will pass on to the next stage where different types of fruit get separated.

% preparing simple artificial data set:
>> rand('state',42)
>> data=[gauss(1000,[-2 3],[1 0.4; 0.1 1]); gauss(1000,[0 2],[1 0; 0 1]); gauss(1000,[1 4],[2 0.4; 0.1 2]); gauss(100, [-3 4],[6 0; 0 6])];
>> lab=sdlab({'apple','banana','lemon','outlier'},[1000 1000 1000 100])
sdlab with 3100 entries, 4 groups: 'apple'(1000) 'banana'(1000) 'lemon'(1000) 'outlier'(100) 
>> a=sddata(+data,lab)
3100 by 2 sddata, 4 classes: 'apple'(1000) 'banana'(1000) 'lemon'(1000) 'outlier'(100) 

We will split the data into training and testing parts:

>> [tr,ts]=randsubset(a,0.5)
1550 by 2 sddata, 4 classes: 'apple'(500) 'banana'(500) 'lemon'(500) 'outlier'(50) 
1550 by 2 sddata, 4 classes: 'apple'(500) 'banana'(500) 'lemon'(500) 'outlier'(50) 

To train the detector, we need to re-label the data so it reflects our fruit/outlier problem:

>> tr_det=sdrelab(tr,{'~outlier','fruit'})
1: apple   -> fruit  
2: banana  -> fruit  
3: lemon   -> fruit  
4: outlier -> outlier
1550 by 2 sddata, 2 classes: 'outlier'(50) 'fruit'(1500) 

Now, we train a detector using a Gaussian mixture with three components. We fix the operating point using the reject option rejecting 1% of fruit samples:

>> pd=sddetector(tr_det,'fruit',sdmixture([],'n',3,'iter',20),'reject',0.01)    
  1: outlier -> non-fruit
  2: fruit  -> fruit    
[class 'fruit' EM:.................... 3 comp] 
sequential pipeline     2x1 'Mixture of Gaussians+Decision'
 1  Mixture of Gaussians    2x1  one class, 3 components (sdp_normal)
 2  Decision                1x1  thresholding on fruit at op 1 (sdp_decide)

Our detector may be now tested. We will print the confusion matrix:

>> sdconfmat(ts.lab,ts*pd)

ans =

 True      | Decisions
 Labels    | fruit  non-fr  | Totals
-------------------------------------
 apple     |   494      6   |   500
 banana    |   494      6   |   500
 lemon     |   489     11   |   500
 outlier   |    25     25   |    50
-------------------------------------
 Totals    |  1502     48   |  1550

In the second step, we want to train the fruit discriminant:

>> tr_clf=tr(:,:,{'apple','banana','lemon'})
1500 by 2 sddata, 3 classes: 'apple'(500) 'banana'(500) 'lemon'(500) 

>> p=sdgauss(tr_clf)
Gaussian model pipeline 2x3  3 classes, 3 components (sdp_normal)

We use a default operating point in this example:

>> pclf=sddecide(p)
sequential pipeline     2x1 'Gaussian model+Decision'
 1  Gaussian model          2x3  3 classes, 3 components (sdp_normal)
 2  Decision                3x1  weighting, 3 classes, 1 ops at op 

With both stages trained, we can construct the classifier cascade using the sdcascade command. It takes the detector pipeline, the name of the decision to "pass through" and the discriminant pipeline:

>> pc=sdcascade(pd,'fruit',pclf)
2-stage cascade pipeline 2x1   (sdp_cascade)

Let's visualize the decisions on the test set and the confusion matrix of the two-stage system:

>> sdscatter(ts,pc)
>> sdconfmat(ts.lab,ts*pc)

ans =

 True      | Decisions
 Labels    | non-fr  apple banana  lemon  | Totals
---------------------------------------------------
 apple     |     2    425     38     35   |   500
 banana    |     5     57    375     63   |   500
 lemon     |    20     57     41    382   |   500
 outlier   |    30     16      1      3   |    50
---------------------------------------------------
 Totals    |    57    555    455    483   |  1550