perClass Documentation
version 5.0 (21-sep-2016)

kb13: How to find samples with a specific type of error in a confusion matrix?

Keywords: confusion matrices

Published on: 17-sep-2015 (updated for 4.6)

perClass version used: 4.6 (29-jun-2015)

Problem: To find out what samples suffer from a specific type of error (defined by a confusion matrix)

Solution: Use the sdconfmatind function to find indices of samples in a specific cell of a confusion matrix.

Let us assume a two class banana dataset split into a training and test set:

>> load fruit; a=a(:,:,[1 2])

'Fruit set' 200 by 2 sddata, 2 classes: 'apple'(100) 'banana'(100)

>> [tr,ts]=randsubset(a,0.5)
Banana Set, 50 by 2 dataset with 2 classes: [25  25]
Banana Set, 50 by 2 dataset with 2 classes: [25  25]

We train a Gaussian model, apply the model to the test set, and obtain the classifier decisions (dec):

>> p=sdgauss(tr)
sequential pipeline       2x1 'Gaussian model+Decision'
 1 Gaussian model          2x2  full cov.mat.
 2 Decision                2x1  weighting, 2 classes

>> dec=ts*p
sdlab with 100 entries, 2 groups: 'apple'(57) 'banana'(43) 

The confusion matrix compares the ground-truth labels, stored in the test dataset ts, to the decisions dec:

>> sdconfmat(ts.lab,ts*p)

ans =

 True      | Decisions
 Labels    |   apple  banana  | Totals
---------------------------------------
 apple     |     46       4   |     50
 banana    |     11      39   |     50
---------------------------------------
 Totals    |     57      43   |    100

We would now like to find out, what are the 4 apple samples that are misclassified as banana by our classifier. We use the sdconfmatind function providing it with ground truth labels, decisions and the true and estimated class defining the confusion matrix cell (here 'apple' and 'banana'):

>> ind=sdconfmatind(ts.lab,ts*p,'apple','banana')

ind =

15
23
25
48

The four test samples are:

>> ts(ind)
'Fruit set' 4 by 2 sddata, class: 'apple'

How to find out these samples in the original data set a? It is, actually, quite easy! When we display details on data set a with the transpose operator...

>> a'
'Fruit set' 200 by 2 sddata, 2 classes: 'apple'(100) 'banana'(100) 
sample props: 'lab'->'class' 'class'(L) 'ident'(N)
feature props: 'featlab'->'featname' 'featname'(L)
data props:  'data'(N)

... we can see there is a numerical ident field. We have included an index of each sample using:

>> a.ident=1:length(a)
'Fruit set' 200 by 2 sddata, 2 classes: 'apple'(100) 'banana'(100) 

We may retrieve the ident property on any sample:

>> a(10).ident

ans =

10

Therefore, we may quickly see original indices of misclassified test samples:

>> ts(ind).ident

ans =

33
52
55
97