perClass Documentation
version 3.4 (9-Oct-2012)

Chapter 3: Getting Started

Table of contents

This chapter provides a simple example of training a classifier in perClass and deploying it outside of Matlab in a custom application.

3.1. Loading data ↩

Let us start by loading the Fruit data set:

>> load fruit.mat 
>> a
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60) 

The 'apple' and 'banana' classes represent genuine fruit to be processed on the extracted from the conveyor belt. The samples labeled as 'stone' are the outliers that should be rejected. The data set a contains 260 samples, each represented by two features.

The data set object is a data matrix augmented with meta-data information such as sample labels or feature names. The samples are stored as data rows and features as columns.

We will visualize the scatter plot of the Fruit data set using

>> sdscatter(a)

We will now split our available data set into training and test subsets. We will use 50% of data for training the classifier and the rest for estimating its performance:

>> [tr,ts]=randsubset(a,0.5)
'Fruit set' 130 by 2 sddata, 3 classes: 'apple'(50) 'banana'(50) 'stone'(30) 
'Fruit set' 130 by 2 sddata, 3 classes: 'apple'(50) 'banana'(50) 'stone'(30) 

3.2. Training a fruit classifier ↩

We will now train a classifier discriminating between the fruit classes. First, lets extract the subset with samples labeled as 'apple' and 'banana'. In perClass, the third dimension represents classes. Therefore, we may extract a subset simply by listing class names:

>> tr2=tr(:,:,{'apple','banana'})
'Fruit set' 100 by 2 sddata, 2 classes: 'apple'(50) 'banana'(50) 

In order to capture the specific shape of the class distributions, we use the Gaussian mixture model:

>> p=sdmixture(tr2)
[class 'apple' initialization:...................... 2 clusters  EM:.....................
......... 2 comp] [class 'banana' initialization:...................... 2 clusters EM:...
........................... 2 comp] 
Mixture of Gaussians pipeline 2x2  2 classes, 4 components (sdp_normal)

The output of sdmixture function is a trained pipeline. Pipelines in perClass represent components of a pattern recognition system.

Note that we have not specified the number of mixture components. By default, sdmixture estimates the number of components in each class from the data automatically using EM algorithm and non-parametric density estimation approach.

3.3. Executing the classifier on new data ↩

The pipeline may be executed on new data using multiplication operator:

>> out=tr*p
'Fruit set' 130 by 2 sddata, 3 classes: 'apple'(50) 'banana'(50) 'stone'(30) 

When executed on new data, the pipeline p returns estimates of probabilisty density for each of the two trained classes. We may display the content of the resulting data set for first ten samples using:

>> +out(1:5)

ans =

0.0157    0.0002
0.0085    0.0000
0.0140    0.0000
0.0012    0.0001
0.0021    0.0001

The unary plus operator is just a convenient shortcut for sddata conversion to double (double(out)).

We may visualize this soft output of the pipeline p using sdscatter:

>> sdscatter(ts,p)

The scatter plot backdrop now shows the estimated class conditional density computed in a dense grid over the feature space. You may use up and down cursor keys to flip between the soft outputs of both classes.

3.4. Performing crisp decisions ↩

To move from the soft output to decisions, we need to add an decision step to our pipeline. We can do that using the sddecide function:

>> pd=sddecide(p)
sequential pipeline     2x1 'Mixture of Gaussians+Decision'
 1  Mixture of Gaussians    2x2  2 classes, 4 components (sdp_normal)
 2  Decision                2x1  weighting, 2 classes, 1 ops at op 1 (sdp_decide)

It adds default operating point assigning the sample to the class with maximum conditional probability density (Maximum Aposteriori Probability rule with equal class priors).

When applying the resulting pipeline pd to any data or matrix of doubles with 2 columns, we obtain crisp decisions:

>> dec=ts*pd
sdlab with 130 entries, 2 groups: 'apple'(54) 'banana'(76) 

Both labels and decisions in perClass are stored in sdlab objects. We may access true labels in the test set using:

>> ts.lab
sdlab with 130 entries, 3 groups: 'apple'(50) 'banana'(50) 'stone'(30) 

Confusion matrix helps us to compare the ground truth labels and decisions:

>> sdconfmat(ts.lab,dec)

ans =

 True      | Decisions
 Labels    |  apple banana  | Totals
-------------------------------------
 apple     |    49      1   |    50
 banana    |     0     50   |    50
 stone     |     5     25   |    30
-------------------------------------
 Totals    |    54     76   |   130

We may see that on our test set, the stones get labeled as bananas. To understand why, lets visualize the classifier decisions in the feature space:

>> sdscatter(ts,pd)

Our mixture model is a discriminant. This means that it labels each newcoming observation as 'apple' or 'banana'. Although this works fine in the neighborhood of our training observations, the discriminant decision may become meaningless far away. In production we may encounter observations not reliably represented by our training data. In such situation, we may want to avoid making decisions only based on the discrimination scheme. We can accomplish this is two ways, either by adding a reject option to our discriminant or by training a separate detector for genuine fruit examples. In this chapter, we illustrate the later case. See Chapter 9 for explanation how to add a reject option to a classifier trained in perClass.

3.5. Building a fruit detector ↩

Detector is a statistical model with thresholded soft output. Training a detector in perClass may be done in a single step using the sddetector function. We provide it with the data set, the name of the class to be modelled and the untrained model:

>> pd2=sddetector(tr2,'fruit',sdparzen,'reject',0.01)
...sequential pipeline     2x1 'Parzen+Decision'
 1  Parzen                  2x1  one class, 100 prototypes (sdp_parzen)
 2  Decision                1x1  thresholding on fruit at op 1 (sdp_decide)

We train our detector on all examples in tr2 data set. We specify the 'fruit' target class. Because 'fruit' is not present in the tr2 set, sddetector trains the model on all data. We adopt the non-parametric Parzen density estimator.

Finally, because we train the model on all available data, we need to specify how to actually set the detector threshold. Here, we set it by rejecting 1% of the data (one-class approach).

The detector pd2 returns two decisions, namely 'fruit' and 'non-fruit'. We visualize its outputs using sdscatter:

>> sdscatter(ts,pd2)

3.6. Creating a detector/classifier cascade ↩

We will now create a two-stage classifier componsed of the 'fruit' detector and 'apple'/'banana' discriminant. The detector will be executed on all input samples. Only the samples accepted by the detector as 'fruit' will be passed to the second stage which performs 'apple'/'banana' discriminantion.

>> pc=sdcascade(pd2,'fruit',pd)
2-stage cascade pipeline 2x1   (sdp_cascade)

Executing the cascade on our test set, we may see that the 'stone' examples get mostly rejected as 'non-fruit':

>> sdconfmat(ts.lab,ts*pc)

ans =

 True      | Decisions
 Labels    | non-fr  apple banana  | Totals
--------------------------------------------
 apple     |     2     47      1   |    50
 banana    |     4      0     46   |    50
 stone     |    25      0      5   |    30
--------------------------------------------
 Totals    |    31     47     52   |   130

Finally, we visualize the cascade decisions in scatter plot:

>> sdscatter(ts,pc)

3.7. Executing the classifier in application outside of Matlab ↩

In order to execute the cascade classifier outside Matlab, we need two things. First, we link our application with the runtime library perclass.dll. In our example, we use the simple PRSDDemo application written in RealBasic. You may find it in interfaces\GUIDemo directory of the perClass distribution. This demo serves only as an illustrative example. perClass may be embedded in any other language or environment as long as it can call a DLL library.

To execute our classifier out of Matlab, we export it using the sdexport command:

>> sdexport(pc,'cascade.ppl')
Exporting pipeline..ok
This pipeline requires perClass runtime version 3.0 (18-jun-2011) or higher.

The sdexport command creates a binary file cascade.ppl with complete description of our cascade classifier.

We also export the test data into a comma separated file:

>> dlmwrite('test_data.txt',+ts)

We may now load this pipeline in our perClassDemo application:

We load the text file with comma-separated test samples:

Finally, we execute the cascade classifier:

We may check that we are getting identical decisions as in Matlab:

>> dec=ts*pc
sdlab with 130 entries, 3 groups: 'non-fruit'(33) 'apple'(46) 'banana'(51) 

>> +dec(1:10)

ans =

apple    
apple    
apple    
apple    
non-fruit
apple    
apple    
apple    
apple    
apple    

Note that the classifier running in our application may be changed anytime without recompilation. This allows us to quickly test our classifiers in real world production conditions.

This concludes our short Getting started intro to perClass.