22.04.2010  pavel

Protecting clusters from outliers using reject option

Often, when trying to understand our data with cluster analysis we wonder where will the new examples map. Many types of cluster analysis techniques such as k-means or mixtures of Gaussians allow us to apply the trained clustering on new data because they, in fact, train a classifier.

But these trained clustering models act as discriminants. That means that they assign every new data sample into one of the found clusters. This includes the samples very distinct from anything encountered when performing the cluster analysis.
Would not that be great if we could identify samples sticking out from our clustering?

That’s exactly what you may now do using the sdreject command introduced in PRSD Studio 2.1. We may simply add a reject option to a trained clustering model and so protect it from outliers.


Let’s take the fruit data set as an example. We use only fruit examples and remove the sample labels to mimick typical unsupervised problem:

>> a
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60) 
>> b=sddata(+a(:,:,1:2))
200 by 2 sddata, class: 'unknown'

We now cluster the data using mixture of Gaussians sdmixture automatically estimating the number of clusters. Note the ‘cluster’ option that means that each individual cluster found will form one of the mixture outputs. Without it, the mixture would return one output (corresponding to the class ‘unknown’ in our data):
>> p=sdmixture(b,'cluster')
[class 'unknown' initialization: 6 clusters  EM:.............................. 6 comp] 
Mixture of Gaussians pipeline 2x6  6 classes, 6 components (sdp_normal)

To visualize the mixture output (probability density estimated for each cluster), we may use sdscatter:
>> sdscatter(b,p)

Mixture of Gaussians clustering (soft outputs)
By clicking the up/down cursor keys, you may flip between 6 mixture outputs corresponding to 6 clusters found.

To return decisions, we need to add an operating point to the mixture model p. We may use the sddecide function that will add the default one weighting each model output equally:

>> pd=sddecide(p)
sequential pipeline     2x1 'Mixture of Gaussians+Decision'
 1  Mixture of Gaussians    2x6  6 classes, 6 components (sdp_normal)
 2  Decision                6x1  weighting, 6 classes, 1 ops at op 1 (sdp_decide)

The decisions on our unsupervised data set are:
>> dec=b*pd
sdlab with 200 entries, 6 groups: 
'Cluster 1'(27) 'Cluster 2'(43) 'Cluster 3'(30) 'Cluster 4'(33) 'Cluster 5'(28) 'Cluster 6'(39) 

To quickly visualize the decisions, we may add them directly to the data set b:
>> b.lab=dec
200 by 2 sddata, 6 classes: [27  43  30  33  28  39]
>> sdscatter(b,pd)

clustering decisions
The decisions of our cluster-based classifier extend to infinity because it is a true discriminant.

Now we may add the reject option using sdreject:

>> pr=sdreject(p,b)
sequential pipeline     2x1 'Mixture of Gaussians+Decision'
 1  Mixture of Gaussians    2x6  6 classes, 6 components (sdp_normal)
 2  Decision                6x1  weight+reject, 7 decisions, 1 ops at op 1 (sdp_decide)

sdreject takes the model p and a data set b. It adds the default operating point to p (just as sddecide did) and includes the rejection threshold such that a pre-specified fraction of samples in b is rejected. By default, it is 1%. The resulting pipeline pr therefore returns 7 decisions (6 clusters + reject).

We will visualize the pr decisions on the original labeled data set a:

>> sdscatter(a,pr)

clustering decisions using reject option
We may observe, that samples from the ‘stone’ class which was not available in the data set b, used for clustering, are rejected.

To quickly see what fraction of data is rejected, use the transpose operator on the decision object:

>> dec=a*pr
sdlab with 260 entries, 7 groups: 
'Cluster 1'(25) 'Cluster 2'(43) 'Cluster 3'(30) 'Cluster 4'(33) 'Cluster 5'(28) 'Cluster 6'(49) 'reject'(52) 
>> dec'
 ind name         size percentage
   1 Cluster 1      25 ( 9.6%)
   2 Cluster 2      43 (16.5%)
   3 Cluster 3      30 (11.5%)
   4 Cluster 4      33 (12.7%)
   5 Cluster 5      28 (10.8%)
   6 Cluster 6      49 (18.8%)
   7 reject         52 (20.0%)

Comments

Name:

Email:

Location:

URL:

Remember my personal information

Notify me of follow-up comments?

Submit the word you see below: