perClass Documentation
version 5.0 (21-sep-2016)

Chapter 3: Getting Started

Table of contents

This chapter provides a simple example of importing data set to perClass, training a classifier and executing it on new data outside of Matlab.

3.1. Quick perClass installation ↩

Installing perClass is very simple. Add perclass and data sub-directories from perClass distribution to your Matlab path. This can be done either via File/Set Path menu or with addpath command.

Example: If your perClass distribution is available in c:\software\perClass_Demo:

>> addpath c:\software\perClass_Demo\perclass
>> addpath c:\software\perClass_Demo\data

Quick way to test if perClass is installed correctly is to run sdversion command:

>> sdversion
perClass 4.3 (07-May-2014), Copyright (C) 2007-2014, PR Sys Design, All rights reserved
 Customer: PR Sys Design  Issued: 3-feb-2014
 Toolbox with DB,imaging: The license expires on 1-jul-2014.
 Installation directory: '/Users/pavel/Desktop/perclass'

For details on perClass installation process see Chapter: Installation.

3.2. Importing data ↩

In this example, we will use "Fruit" data set. It is stored in the text file fruit.txt in the data sub-directory of perClass distribution.

To create a perClass data set, we will first need to import it into Matlab.

fruit.txt is a comma-separated text file with each row representing one data sample. The first two columns correspond to two features, the third column contains string class label.

3.993477,-0.535440,apple
-4.922709,2.519118,stone
-0.052968,-4.946727,apple
2.364367,-5.600644,apple
3.129976,-4.014243,apple
-8.996251,-4.330067,banana
-2.155181,-0.548931,stone
....

We may import such data using perClass sdimport command that allows us to load both the data matrix and labels easily:

>> a=sdimport('fruit.txt','data',1:2,'lab',3)
260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60) 

We obtain variable a which is an sddata object.

The data set a contains measurements on 260 objects, each represented by two features. Each object also has a class label which corresponds to one of the three classes. The 'apple' and 'banana' classes represent genuine fruit to be processed on the extracted from the conveyor belt. The samples labeled as 'stone' are the outliers that should be rejected.

The data set object is a data matrix augmented with meta-data information such as sample labels or feature names. The samples are stored as data rows and features as columns.

Instead of sdimport, we may also load data matrix and string labels separately by Matlab command or custom scripts and construct an sddata set simply by:

>> a=sddata(data,sdlab(lab))
260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60) 

Here, data is a numerical matrix with samples as rows and features as columns and lab are sample labels in a character array or cell array.

We can assign meaningful names of features to featlab field of our data set. Say, the first feature describes length and the second color of our objects:

>> a.featlab=sdlab('length','color')
260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60) 

We will visualize the scatter plot of the Fruit data set using

>> sdscatter(a)

Each marker represents one data sample, marker color and shape encode the class.

Our goal is to build a classifier distinguishing between apples and bananas and discarding any other observation including known stones. Such statistical classifier is build by training it on a set of labeled observations. In order to also test its perfomance, we need to use data unseen in training. Only then is our performance estimate realistic.

Therefore, we need to keep aside a subset of data for testing. We will split our available data set into training and test subsets. We will use 50% of data for training the classifier and the rest for estimating its performance:

>> [tr,ts]=randsubset(a,0.5)
'Fruit set' 130 by 2 sddata, 3 classes: 'apple'(50) 'banana'(50) 'stone'(30) 
'Fruit set' 130 by 2 sddata, 3 classes: 'apple'(50) 'banana'(50) 'stone'(30) 

3.3. Training a fruit classifier ↩

We will train a classifier discriminating between the fruit classes. First, let us extract the subset with samples labeled as 'apple' and 'banana'. In perClass, the third dimension represents classes. Therefore, we may extract a subset simply by listing class names:

>> tr2=tr(:,:,{'apple','banana'})
'Fruit set' 100 by 2 sddata, 2 classes: 'apple'(50) 'banana'(50) 

Anywhere in perClass, you may use indices instead of names. The apple and banana are first and second classes in the list. Therefore, we may get the test set as:

>> ts2=ts(:,:,1:2)
'Fruit set' 100 by 2 sddata, 2 classes: 'apple'(50) 'banana'(50) 

In order to capture the specific shape of the class distributions, we use the feed-forward neural network:

>> p=sdneural(tr2)
sequential pipeline       2x1 'Neural network'
 1 Neural network          2x2  10 units
 2 Decision                2x1  weighting, 2 classes

The output of sdneural command is a trained classifier, in perClass represented by a "pipeline". Pipeline is a sequence of operations that may be applied to new data. In the example above, the pipeline p gets trained on data set tr. Note that starting with perClass 4, every classifier returns decisions by default.

We may visualize classifier decisions on the training set using sdscatter:

>> sdscatter(tr2,p)

The backdrop color indicates classifier decisions in each position of the feature space.

3.4. Decisions and performance estimation ↩

We may execute the pipeline p on new data using the multiplication operator:

>> dec=ts2*p
sdlab with 130 entries, 2 groups: 'apple'(52) 'banana'(78) 

In perClass, decisions or labels are represented by sdlab objects. The ground-truth labels in the test set ts are accessible by:

>> ts2.lab
sdlab with 100 entries, 2 groups: 'apple'(50) 'banana'(50) 

We may compare true labels and decisions using the confusion matrix:

>> sdconfmat(ts2.lab,dec)

ans =

 True      | Decisions
 Labels    |   apple  banana  | Totals
---------------------------------------
 apple     |     49       1   |     50
 banana    |      3      47   |     50
---------------------------------------
 Totals    |     52      48   |    100

We can see that, on our test set, three banana examples are misclassified as apple. To find our these problematic examples, we may use simple logical operations on labels:

>> ind=find(ts2.lab=='banana' & dec=='apple')

 ind =

 72
 76
100

>> ts2(ind)
'Fruit set' 3 by 2 sddata, class: 'banana'

How to estimate mean classification error of our classifier? We may use the sdtest function:

>> sdtest(ts2.lab,dec)

ans =

0.0400

sdtest offers a number of other performance measures. Maybe, in our application, we prefer sensitivity and precision, considering banana as a target class:

>> sdtest(ts2.lab,dec,'measures',{'sensitivity','banana','precision','banana'})

ans =

0.9400    0.9792

3.5. Classifier confidences ↩

Often, we need to know not only the eventual decision of a classifier, but the level of confidence. How to inspect "soft" classifier output?

In perClass, we may see a detailed information on any object using ' transpose operator:

>> p'
sequential pipeline     2x1 'Neural network'
 1 Neural network          2x2  10 units
   inlab: 'length','color'
     lab: 'apple','banana'
  output: confidence
 2 Decision                2x1  weighting, 2 classes
   inlab: 'apple','banana'
  output: decision ('apple','banana')

We can see, that our pipeline is a sequence of two steps, namely the neural network model and the decision. The first step returns confidences.

We may access each step by its index:

>> p(1)
Neural network pipeline   2x2  10 units

Let us take three test examples and estimate confidence of their classification using our neural network:

>> ts2([10 70 89])
'Fruit set' 3 by 2 sddata, 2 classes: 'apple'(1) 'banana'(2) 

Unary plus (+) operator provides a quick shorthand to extract content of a data set, returning the data matrix D:

>> D=+ts2([10 70 89])

 D =

 6.0009    1.2879
-5.5805   -7.7396
 0.1531    0.5740

Multiplying the data matrix with entire pipeline gives us decisions.

>> D*p

ans =

       1
       2
       2

>> p.list
sdlist (2 entries)
 ind name
   1 apple 
   2 banana

Because we now provide a data matrix as the input, we receive the numerical vector with integer decisions (and not sdlab object as we did earlier).

Applying only the first pipeline step yields the desired confidence values:

>> D*p(1)

ans =

0.8677    0.1933
0.2150    0.8362
0.0366    0.9205

A quick shortcut to strip off decision step from any pipeline is unary minus operator (-):

>> p
sequential pipeline       2x1 'Neural network'
 1 Neural network          2x2  10 units
 2 Decision                2x1  weighting, 2 classes

>> -p
Neural network pipeline   2x2  10 units

We may visualize the confidences in our feature space using sdscatter:

>> sdscatter(ts2,-p)

ans =

 3

Use arrows on the toolbar or cursor keys to move between the two per-class confidence outputs.

3.6. Classifier execution out of Matlab ↩

Eventually, we want to run our classifier outside of Matlab in a machine or a custom application. perClass provides a simple mechanism to export classifiers for execution using perClass Runtime library. This functionality is available in the Pro or Enterprise versions.

>> sdexport(p,'classifier1.ppl')
Exporting pipeline..ok
This pipeline requires perClass runtime version 4.0 (29-mar-2013) or higher.

Apart from standard C/C++ API, perClass distribution comes with sdrun command-line tool which can execute any exported classifier from command line without additional programing.

Open the command-line for your platform (in MS Windows press Windows key+R to open the Run dialog and enter cmd). Change current directory to perclass\interfaces\sdrun\YOUR_PLATFORM and type

> ./sdrun.exe PATH_TO_EXPORTED_CLASSIFIER\classifier1.ppl

where PATH_TO_EXPORTED_CLASSIFIER is the current directory in your Matlab session (use pwd command to display current directory).

It will print basic information on the pipeline classifier1.ppl such as:

Pipeline name: 'Neural network'
Minimum required runtime version: 4.0 (29-mar-2013)
Input type: double, dimensionality: 2
Output type: int, dimensionality: 1, decisions
Operating point count: 1, current: 1
Possible decisions: 1:apple, 2:banana

We may directly provide the numerical values of input features, samples separated by semicolons:

> ./sdrun.exe classifier1.ppl -d "6.0009 1.2879; -5.5805 -7.7396; 0.1531 0.5740"
apple
banana
banana

We have received the same decisions as in Matlab. To compute confidences at runtime, export the pipeline returning soft outputs:

>> sdexport(-p,'classifier2.ppl')
Exporting pipeline..ok
This pipeline requires perClass runtime version 4.0 (29-mar-2013) or higher.

> ./sdrun.exe classifier2.ppl -d "6.0009 1.2879; -5.5805 -7.7396; 0.1531 0.5740"
0.867726,0.193307
0.214956,0.836249
0.036589,0.920537

perClass Runtime may be embedded into custom applications using standard C/C++ interface or a separate .Net interface.