17.08.2011  pavel

How to find out what samples are misclassified?

image

Often, we need to get more detailed understanding of our classifier errors. Let us go through a simple step-by-step example how to retrieve misclassified examples.


We consider a medical data set with data extracted from 16 patient scans:
>> a
'medical D/ND' 5762 by 10 sddata, 2 classes: 'disease'(1495) 'no-disease'(4267) 

>> a.patient
sdlab with 5762 entries, 16 groups
We split our data set into training and test part. We use different patient scans for training and test set. This is especially useful to obtain realistic performance estimates reflecting generalization to unseen patients.
>> [tr,ts]=subset(a,'patient',1:8)
'medical D/ND' 2920 by 10 sddata, 2 classes: 'disease'(888) 'no-disease'(2032) 
'medical D/ND' 2842 by 10 sddata, 2 classes: 'disease'(607) 'no-disease'(2235) 

>> tr.patient'
 ind name        size percentage
   1 Alex         350 (12.0%)
   2 Bob          400 (13.7%)
   3 Chris        400 (13.7%)
   4 Dick         337 (11.5%)
   5 Emily        400 (13.7%)
   6 Frank        302 (10.3%)
   7 Gabriel      331 (11.3%)
   8 Hanz         400 (13.7%)

>> ts.patient'
 ind name        size percentage
   1 Irene        305 (10.7%)
   2 Monica       221 ( 7.8%)
   3 Nick         316 (11.1%)
   4 Olaf         400 (14.1%)
   5 Paul         400 (14.1%)
   6 Rob          400 (14.1%)
   7 Steffany     400 (14.1%)
   8 Tom          400 (14.1%)
Now, we will train a classifier. In this example, we use a linear discriminant assuming normal densities:
>> p=sdlinear(tr)*sddecide
sequential pipeline     10x1 'Gauss eq.cov.+Output normalization+Decision'
 1  Gauss eq.cov.          10x2  2 classes, 2 components (sdp_normal)
 2  Output normalization    2x2  (sdp_norm)
 3  Decision                2x1  weighting, 2 classes, 1 ops at op 1 (sdp_decide)
By including the sddecide, we let the pipeline p return decisions at a default operating point. We get the decisions on our test set and display the confusion matrix:
>> dec=ts*p
sdlab with 2842 entries, 2 groups: 'disease'(562) 'no-disease'(2280) 

>> sdconfmat(ts.lab,dec)

ans =

 True        | Decisions
 Labels      |  diseas  no-dis  | Totals
-----------------------------------------
 disease     |    278     329   |    607
 no-disease  |    284    1951   |   2235
-----------------------------------------
 Totals      |    562    2280   |   2842
We would like to understand where the 284 samples wrongly accepted as disease come from. Therefore, we use the sdconfmatind command asking for 'no-disease', 'disease' entry (true class, decision):
>> ind=sdconfmatind(ts.lab,ts*p,'no-disease','disease');
>> length(ind)

ans =

   284
The data set with misclassified samples:
>> err=ts(ind)
'medical D/ND' 284 by 10 sddata, class: 'no-disease'
Now, we may wish to analyze these errors more closely, for example, displaying from which patient and tissue type they originate:
>> sdconfmat(err.patient,err.tissue)

ans =

 True      | Decisions
 Labels    |  no-dis  health    bone  organ   | Totals
-------------------------------------------------------
 Irene     |      0       0       9       5   |     14
 Monica    |      0       0       0       5   |      5
 Nick      |     56       0       0       0   |     56
 Olaf      |     48       0       0       0   |     48
 Paul      |      5       0       0       0   |      5
 Rob       |     37       0       0       0   |     37
 Steffany  |     28       0       0       0   |     28
 Tom       |      0      91       0       0   |     91
-------------------------------------------------------
 Totals    |    174      91       9      10   |    284
To conclude, sdconfmatind allows us to easily extract samples based on label or decision values.

Comments

Name:

Email:

Location:

URL:

Remember my personal information

Notify me of follow-up comments?

Submit the word you see below: