How to setup leave-one-patient out cross-validation
In many applications, we need to make sure our classifier generalizes to unseen patients, object events etc. Therefore, we need to consider these entities in cross-validation of our algorithm. PRSD Studio provides leave-one-object-out using the sdcrossval routine. But in this example, we show how to make a very simple leave-one-object-out scheme in two lines of code where everything is open to our direct understanding.
Lets start with a data set. We will use the medical data which contains 6400 samples representing neighborhoods in medical images from different patients:
>> a 'medical D/ND' 6400 by 11 sddata, 3 classes: 'disease'(1495) 'no-disease'(4267) 'noise'(638)
We are only interested in disease/no-disease examples:
>> b=a(:,:,{'disease','no-disease'})
'medical D/ND' 5762 by 11 sddata, 2 classes: 'disease'(1495) 'no-disease'(4267)
Apart of class labels, this data set contains additional sample properties such as ‘patient’ and ‘tissue’ labels and ‘pixel’ index:
>> b' 'medical D/ND' 5762 by 11 sddata, 2 classes: 'disease'(1495) 'no-disease'(4267) sample props: 'lab'->'class' 'class'(L) 'pixel'(N) 'patient'(L) 'tissue'(L) feature props: 'featlab'->'featname' 'featname'(L) data props: 'data'(N) 'license'(S) >> b.patient' % patients present in the data set ind name size percentage 1 Alex 350 ( 6.1%) 2 Bob 400 ( 6.9%) 3 Chris 400 ( 6.9%) 4 Dick 337 ( 5.8%) 5 Emily 400 ( 6.9%) 6 Frank 302 ( 5.2%) 7 Gabriel 331 ( 5.7%) 8 Hanz 400 ( 6.9%) 9 Irene 305 ( 5.3%) 10 Monica 221 ( 3.8%) 11 Nick 316 ( 5.5%) 12 Olaf 400 ( 6.9%) 13 Paul 400 ( 6.9%) 14 Rob 400 ( 6.9%) 15 Steffany 400 ( 6.9%) 16 Tom 400 ( 6.9%)
Now the promised two-line leave-one-patient-out. First we define untrained pipeline we wish to cross-validate (here the k-NN trained in a PCA-reduced subspace with the default operating point):
>> p=sdpca([],4)*sdknn*sddecide untrained pipeline 3 steps: sdpca+sdknn+sdp_decide
Second, a simple loop over the patients:
>> err=[]; for i=1:length(b.patient.list), [ts,tr]=subset(b,'patient',i); pd=tr*p; err(i)=sdtest(ts,pd); end
For each patient in
b, we split the data such that the data set ts contains only samples from this patient and tr the rest of data. Then we train on tr and estimate the error on ts. The vector err will contain per-patient mean errors:
>> err
err =
Columns 1 through 7
0.4850 0.2230 0.2939 0.3599 0.3426 0.3254 0.1939
Columns 8 through 14
0.2473 0.4206 0.2355 0.4421 0.5048 0.2556 0.3323
Columns 15 through 16
0.2250 0.5700
That’s it :-)
