17.08.2011 pavel
How to find out what samples are misclassified?

Often, we need to get more detailed understanding of our classifier errors. Let us go through a simple step-by-step example how to retrieve misclassified examples.
We consider a medical data set with data extracted from 16 patient scans:
>> a 'medical D/ND' 5762 by 10 sddata, 2 classes: 'disease'(1495) 'no-disease'(4267) >> a.patient sdlab with 5762 entries, 16 groupsWe split our data set into training and test part. We use different patient scans for training and test set. This is especially useful to obtain realistic performance estimates reflecting generalization to unseen patients.
>> [tr,ts]=subset(a,'patient',1:8) 'medical D/ND' 2920 by 10 sddata, 2 classes: 'disease'(888) 'no-disease'(2032) 'medical D/ND' 2842 by 10 sddata, 2 classes: 'disease'(607) 'no-disease'(2235) >> tr.patient' ind name size percentage 1 Alex 350 (12.0%) 2 Bob 400 (13.7%) 3 Chris 400 (13.7%) 4 Dick 337 (11.5%) 5 Emily 400 (13.7%) 6 Frank 302 (10.3%) 7 Gabriel 331 (11.3%) 8 Hanz 400 (13.7%) >> ts.patient' ind name size percentage 1 Irene 305 (10.7%) 2 Monica 221 ( 7.8%) 3 Nick 316 (11.1%) 4 Olaf 400 (14.1%) 5 Paul 400 (14.1%) 6 Rob 400 (14.1%) 7 Steffany 400 (14.1%) 8 Tom 400 (14.1%)Now, we will train a classifier. In this example, we use a linear discriminant assuming normal densities:
>> p=sdlinear(tr)*sddecide sequential pipeline 10x1 'Gauss eq.cov.+Output normalization+Decision' 1 Gauss eq.cov. 10x2 2 classes, 2 components (sdp_normal) 2 Output normalization 2x2 (sdp_norm) 3 Decision 2x1 weighting, 2 classes, 1 ops at op 1 (sdp_decide)By including the sddecide, we let the pipeline
p return decisions at a default operating point.
We get the decisions on our test set and display the confusion matrix:
>> dec=ts*p sdlab with 2842 entries, 2 groups: 'disease'(562) 'no-disease'(2280) >> sdconfmat(ts.lab,dec) ans = True | Decisions Labels | diseas no-dis | Totals ----------------------------------------- disease | 278 329 | 607 no-disease | 284 1951 | 2235 ----------------------------------------- Totals | 562 2280 | 2842We would like to understand where the 284 samples wrongly accepted as disease come from. Therefore, we use the
sdconfmatind command asking for 'no-disease', 'disease' entry (true class, decision):
>> ind=sdconfmatind(ts.lab,ts*p,'no-disease','disease'); >> length(ind) ans = 284The data set with misclassified samples:
>> err=ts(ind) 'medical D/ND' 284 by 10 sddata, class: 'no-disease'Now, we may wish to analyze these errors more closely, for example, displaying from which patient and tissue type they originate:
>> sdconfmat(err.patient,err.tissue) ans = True | Decisions Labels | no-dis health bone organ | Totals ------------------------------------------------------- Irene | 0 0 9 5 | 14 Monica | 0 0 0 5 | 5 Nick | 56 0 0 0 | 56 Olaf | 48 0 0 0 | 48 Paul | 5 0 0 0 | 5 Rob | 37 0 0 0 | 37 Steffany | 28 0 0 0 | 28 Tom | 0 91 0 0 | 91 ------------------------------------------------------- Totals | 174 91 9 10 | 284To conclude, sdconfmatind allows us to easily extract samples based on label or decision values.
