perClass Documentation
version 5.1 (31-May-2017)

Chapter 11: Pipelines

This chapter describes pipelines, trainable operations on data.

Table of contents

11.1. Introduction ↩

In perClass, processing or transformation of data is described using the concept of a pipeline. Let us take, as an example, training of a linear classifier:

>> load fruit
>> a
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60) 

>> p=sdlinear(a)
sequential pipeline       2x1 'Gaussian model+Normalization+Decision'
 1 Gaussian model          2x3  single cov.mat.
 2 Normalization           3x3 
 3 Decision                3x1  weighting, 3 classes

The object p is a pipeline comprised of three stages. The first one is a Gaussian model computed in the input 2D feature space. The model describes three classes ('apple','banana' and 'stone') and, therefore, provides three corresponding outputs (probability densities). The second stage is a normalization turning the density into posterior. Finally, the third stage converts the posteriors in a decision providing a single integer output.

Pipelines in perClass are not limited to classification. They describe all types of data processing, including data scaling, feature extraction and selection.

11.1.1. Execution on new data ↩

The pipeline p may be applied to any data set with two features using the multiplication operator *:

>> data=sddata(data)
5 by 2 sddata, class: 'unknown'
>> out=data*p
sdlab with 5 entries, 2 groups: 'apple'(3) 'stone'(2) 

Output of our pipeline is an sdlab object with classifier decisions. In perClass 4, all classifiers produce decisions by default.

The pipeline execution is an analogy of a matrix multiplication. Our pipeline p acts as a matrix with two rows ( feature inputs) and one output (decision).

The multiplication operator is only syntactic sugar, the real work is done by sdexe function:

>> sdexe(p,data)
sdlab with 5 entries, 2 groups: 'apple'(3) 'stone'(2) 

If we execute pipeline on raw data matrix, we obtain raw numerical output:

>> data=rand(5,2)*100

data =

   41.4248   77.6399
   36.8954    4.8470
   85.0896   59.0271
   79.7602   15.8238
   35.0236   93.7622

>> out=data*p

out =

       3
       1
       1
       1
       3

The mapping between integer decisions and decision names is handled by pipeline list:

>> p.list
sdlist (3 entries)
 ind name
   1 apple 
   2 banana
   3 stone 

We may use the list object to convert decisions into names and vice versa:

>> p.list(3)

ans =

stone

>> p.list('apple')

ans =

 1

>> p.list(out)

ans =

stone
apple
apple
apple
stone

11.1.2. Accessing pipeline steps ↩

Unless specified explicitly, all pipeline operations refer to the last step:

>> p
sequential pipeline       2x1 'Gaussian model+Normalization+Decision'
 1 Gaussian model          2x3  single cov.mat.
 2 Normalization           3x3 
 3 Decision                3x1  weighting, 3 classes

>> p.output

ans =

decision

We may access individual pipeline steps using parentheses ():

>> p(1).output

ans =

probability density

Say, we wish to extract "soft outputs" of our classifier just before turning them into decisions:

>> p(1:2)
sequential pipeline       2x3 'Gaussian model+Normalization'
 1 Gaussian model          2x3  single cov.mat.
 2 Normalization           3x3 

>> out=data*p(1:2)
5 by 3 sddata, class: 'unknown'

The output is now a data set, because the second pipeline step returns real-value output.

A quick shorthand for removing decision step is a unary minus (-) operator:

>> -p
sequential pipeline       2x3 'Gaussian model+Normalization'
 1 Gaussian model          2x3  single cov.mat.
 2 Normalization           3x3 

We may, therefore, get classifier soft outputs using:

>> data*-p
5 by 3 sddata, class: 'unknown'

Applying the unary minus to a data set which already returns soft output has no effect:

>> --p
sequential pipeline       2x3 'Gaussian model+Normalization'
 1 Gaussian model          2x3  single cov.mat.
 2 Normalization           3x3 

11.1.3. Displaying pipeline details ↩

Similarly to data sets and labels, perClass provides a quick shortcut for displaying details about a pipeline with a transpose operator ('):

>> p'
sequential pipeline     2x1 'Gaussian model+Normalization+Decision'
 1 Gaussian model          2x3  single cov.mat.
   inlab: 'length','color'
     lab: 'apple','banana','stone'
  output: probability density
 2 Normalization           3x3 
   inlab: 'apple','banana','stone'
     lab: 'apple','banana','stone'
  output: posterior
 3 Decision                3x1  weighting, 3 classes
   inlab: 'apple','banana','stone'
  output: decision ('apple','banana','stone')

For each step, we can see the input/output labels and the type of output. We can see, that out pipeline p expects two input features, namely 'length' and 'color'.

This information may be accessed using pipeline fields inlab, lab and output:

>> p(1).inlab
sdlab with 2 entries: 'length','color'

>> p(3).output

ans =

decision

11.1.4. Untrained pipelines ↩

Usually, we create pipelines by training them on a data set. However, in some situations, it may be more beneficial to create a pipeline description without a concrete data set. Such pipeline is called untrained.

An untrained pipeline is created by providing the first empty [].

The trained Parzen classifier:

>> a
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60) 

>> p=sdparzen(a)
.....sequential pipeline       2x1 'Parzen model+Decision'
 1 Parzen model            2x3  260 prototypes, h=0.8
 2 Decision                3x1  weighting, 3 classes

The untrained Parzen classifier:

>> u=sdparzen([])
untrained pipeline 'sdparzen'

By multiplying a data set with untrained pipeline, we train it:

>> p2=a*u
.....sequential pipeline       2x1 'Parzen model+Decision'
1 Parzen model            2x3  260 prototypes, h=0.8
2 Decision                3x1  weighting, 3 classes

Note, that the order is always data * pipeline.

Untrained pipelines are useful to separate the definition of a classifier from its training on data. We may provide any parameters when defining an untrained pipeline:

>> u2=sdneural([],'units',20,'iters',1000)
untrained pipeline 'sdneural'

Untrained pipelines are used, for example, by sdcrossval to perform evaluation by cross-validation:

>> sdcrossval(u,a)
10 folds: [1: ....] [2: .....] [3: ....] [4: ....] [5: .....] [6: .....] [7: ....] [8: .....] [9: .....] [10: ....] 

ans =

10-fold rotation

ind mean (std)  measure
1 0.09 (0.02) mean error over classes, priors [0.3,0.3,0.3]

>> sdcrossval(u2,a)
10 folds: [1: ] [2: ] [3: ] [4: ] [5: ] [6: ] [7: ] [8: ] [9: ] [10: ] 

ans =

10-fold rotation

ind mean (std)  measure
1 0.08 (0.01) mean error over classes, priors [0.3,0.3,0.3]