perClass Documentation
version 5.1 (31-May-2017)

Chapter 18: Cluster analysis

This chapter describes how to identify clusters in the data using unsupervised cluster analysis methods.

Table of contents

18.1. Introduction ↩

Cluster analysis, also called unsupervised learning, groups similar observations into clusters. In this way, we may learn what typical situations occur in our data set. Cluster analysis aims at answering questions such as:

There are two typical uses of cluster analysis. The first one aims at interpretation of clustering result. For example, we cluster an image (set of pixels in RGB feature space) and obtain a set of 10 clusters that represent different image regions. We then interpret (give meaning) to some of these clusters. In this way, we identify that some clusters represent sky, buildings or trees. With these new labels, we may train a specific classifiers applicable to new images.

Secondly, cluster analysis is often adopted as a tool for building a flexible data representation in complex supervised problems. The resulting set of clusters is used, for example, as prototypes for building better classifiers in multi-modal problems.

With perClass, you may quickly cluster even very large data sets, easily interpret the results, and directly leverage this information for building your classifiers.

18.1.1. Example of using cluster analysis to build a pixel classifier ↩

Say we want to detect road in images. However, all we have are images without road labels. Instead of painting the road region by hand, we will use clustering to define class labels.

First, we will load the image:

>> im=sdimage('roadsign18.bmp','sddata')
412160 by 3 sddata, class: 'unknown'

and visualize it:

>> sdimage(im)

Image without labels

The image shows a scene captured from a car. Note that we started from an image without any label information. Therefore, the image data set contains only the 'unknown' class.

We will now use cluster analysis to find clusters in this data set. We use the high-level sdcluster command on the im data set and specify the clustering algorithm (sdkmeans for k-means) and number of clusters (5):

>> a=sdcluster(im,sdkmeans,5)
[class 'unknown' 5 clusters] 
412160 by 3 sddata, 5 'cluster' groups: [42222   53456   47132   51678  217672]

sdcluster returns the data set a with new 'cluster' labels. We may visualize this image:

>> sdimage(a)

Image with labels defined by clustering

By hovering over the image, we may see that clusters C1 and C3 correspond mostly to the road, clusters C2 and C4 to grass and the cluster C5 to sky. Assigning meaningful names to the clusters is the interpretation step. We may rename the cluster labels directly in the sdimage figure with the 'rename class' command in the right-click menu or use the sdrelab function to quickly assign descriptive names:

>> b=sdrelab(a,{[1 3],'road',[2 4],'grass',5,'sky'})
  1: C1     -> road
  2: C2     -> grass
  3: C3     -> road
  4: C4     -> grass
  5: C5     -> sky  
412160 by 3 sddata, 3 'cluster' groups: 'road'(89354) 'grass'(105134) 'sky'(217672) 

We may now build the road detector on our three-class problem. We will model the road class with a Gaussian model and use the other two classes to fix the detector threshold by minimizing the error:

>> pd=sddetect(b,'road',sdgauss)
  1: road   -> road    
  2: grass  -> non-road
  3: sky    -> non-road
sequential pipeline     3x1 'Gaussian model+Decision'
 1  Gaussian model          3x1  one class, 1 component
 2  Decision                1x1  thresholding ROC on road at op 1367

The detector may be applied to a new image:

>> im2=sdimage('roadsign11.bmp','sddata')
412160 by 3 sddata, class: 'unknown'

>> c=im2.*pd
412160 by 3 sddata, 2 classes: 'road'(96068) 'non-road'(316092) 

The .* operator is a shorthand for execution of pipeline pd returning decisions and setting these decisions back in the data set.

>> sdimage(c)

Applying road detector on new image

In this example, we used cluster analysis to define meaningful class labels used for training a supervised classifier.

18.2. Clustering with sdcluster command ↩

sdcluster is a high-level command that allows us to perform cluster analysis on a data set with different clustering algorithms. It accepts a data set, untrained pipeline of the clustering algorithm and number of clusters. Optionally, we may provide extra options.

In 3.0.0 release, sdcluster supports only sdkmeans and sdkcentres clustering algorithms.

Consistently with all other functions in perClass toolbox, sdcluster executes the clustering algorithm on each class of the current set of labels:

>> a
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60) 

>> b=sdcluster(a,sdkmeans,2)
[class 'apple' 2 clusters] [class 'banana' 2 clusters] [class 'stone' 2 clusters] 
'Fruit set' 260 by 2 sddata, 6 'cluster' groups: [61  38  60  48  20  33]

The result is a data set b with cluster labels. Note that sdcluster does not overwrite the original labels but adds a new 'cluster' property, setting is as current:

>> b'
'Fruit set' 260 by 2 sddata, 6 'cluster' groups: [61  38  60  48  20  33]
sample props: 'lab'->'cluster' 'class'(L) 'ident'(N) 'cluster'(L)
feature props: 'featlab'->'featname' 'featname'(L)
data props:  'data'(N)

The lab field points to the new 'cluster' labels. The class field contains the original class labels:

>> b.lab
sdlab with 260 entries, 6 groups: 
'C1'(61) 'C2'(38) 'C3'(60) 'C4'(48) 'C5'(20) 'C6'(33) 
>> b.class
sdlab with 260 entries, 3 groups: 'apple'(100) 'banana'(100) 'stone'(60) 

In this way, we may quickly assess the mapping between the classes and clusters:

>> sdconfmat(b.class,b.lab)

ans =

 True      | Decisions
 Labels    |      C1      C2      C3      C4      C5      C6  | Totals
-----------------------------------------------------------------------
 apple     |     60      32       8       0       0       0   |    100
 banana    |      1       6      43      48       0       2   |    100
 stone     |      0       0       9       0      20      31   |     60
-----------------------------------------------------------------------
 Totals    |     61      38      60      48      20      33   |    260

18.2.1. Clustering all data irrespective of classes ↩

If we wish to cluster the total data set, we may simply set a single class to all samples:

>> c=a
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60) 
>> c.lab='unknown'
'Fruit set' 260 by 2 sddata, class: 'unknown'
>> d=sdcluster(c,sdkmeans,2)
[class 'unknown' 2 clusters] 
'Fruit set' 260 by 2 sddata, 2 'cluster' groups: 'C1'(102) 'C2'(158) 

However, sdcluster provides a faster way with the 'all' option which executes clustering on all data:

>> a
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60) 
>> d2=sdcluster(a,sdkmeans,2,'all')
[class 'Unknown' 2 clusters] 
'Fruit set' 260 by 2 sddata, 2 'cluster' groups: 'C1'(150) 'C2'(110) 

Original labels are still available:

>> d2.class
sdlab with 260 entries, 3 groups: 'apple'(100) 'banana'(100) 'stone'(60) 

18.2.2. How to cluster already clustered data set? ↩

Say we would like to cluster a data set d2 from the previous example again, now with a different number of clusters:

>> d3=sdcluster(d2,sdkmeans,3)
{??? Error using ==> sdcluster at 59
property 'cluster' is already present in the data set. Choose a different
property for storing clustering result using sdcluster 'lab' option.
} 

sdcluster will throw an error because the 'cluster' property is already present in d2. perClass will never remove a set of labels you created in the data set. We may either remove the 'cluster' property manually or simply specify a new name for the property holding clustering result with 'lab' option. Here we store new clustering result in 'newcluster':

>> d3=sdcluster(d2,sdkmeans,3,'lab','newcluster')
[class 'C1' 3 clusters] [class 'C2' 3 clusters] 
'Fruit set' 260 by 2 sddata, 6 'newcluster' groups: [42  45  63  39  37  34]

The data set d3 still contains the 'cluster' and 'class' label sets:

>> d3'
'Fruit set' 260 by 2 sddata, 6 'newcluster' groups: [42  45  63  39  37  34]
sample props: 'lab'->'newcluster' 'class'(L) 'ident'(N) 'cluster'(L) 'newcluster'(L)
feature props: 'featlab'->'featname' 'featname'(L)
data props:  'data'(N)

Because sdcluster always adds clustering outcome as new labels, we may easily compare two clustering results:

>> sdconfmat(d3.cluster,d3.newcluster)

ans =

 True      | Decisions
 Labels    |      C1      C2      C3      C4      C5      C6  | Totals
-----------------------------------------------------------------------
 C1        |     42      45      60       2       0       1   |    150
 C2        |      0       0       3      37      37      33   |    110
-----------------------------------------------------------------------
 Totals    |     42      45      63      39      37      34   |    260

Note that 'C1' and 'C2' names in the rows refer to different clusters than the 'C1' and 'C2' in the columns. Next section shows how to avoid such confusion.

18.2.3. Changing default cluster names ↩

By default, sdcluster uses 'C' as a prefix for cluster names. We may change it using the 'prefix' option:

>> d4=sdcluster(d2,sdkmeans,3,'lab','newcluster','prefix','D')
[class 'C1' 3 clusters] [class 'C2' 3 clusters] 
'Fruit set' 260 by 2 sddata, 6 'newcluster' groups: [57  47  42  34  39  41]
>> sdconfmat(d3.cluster,d4.newcluster)

ans =

 True      | Decisions
 Labels    |      D1      D2      D3      D4      D5      D6  | Totals
-----------------------------------------------------------------------
 C1        |     56      46      42       0       6       0   |    150
 C2        |      1       1       0      34      33      41   |    110
-----------------------------------------------------------------------
 Totals    |     57      47      42      34      39      41   |    260

We may now easily distinguish the clusters in 'cluster' and in 'newcluster' properties.

18.2.4. Removing cluster labels ↩

If you wish to manually remove the 'cluster' property in d2 set from the example above, note that sdcluster set it as current labels. Therefore, we first need to set a different property (label set) as current and only then remove the 'cluster' property:

>> dd=setlab(d2,'class')
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60) 
>> dd=rmprop(dd,'cluster')
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60) 

The set dd may be now clustered with sdcluster default setting which will add 'cluster' property.

18.3. Clustering algorithms ↩

This section describes low-level clustering routines available in perClass. These routines may be used directly, without sdcluster. They provide a trained model (pipeline) which still needs to be applied to data to get clustering decisions.

18.3.1. k-means algorithm ↩

The k-means clustering algorithm minimizes within-cluster variability. It requires the parameter k denoting the number of desired clusters. Starting from a set of k randomly selected examples as cluster prototypes, it iteratively defines partitioning of the data set and shifts these prototypes to means of the new partitions. Note, that some of the initial k prototypes may disappear in the optimization process. Therefore, the resulting number of clusters may be lower.

perClass provides high-speed sdkmeans implementation which is scalable to very large data sets. By default, sdkmeans works as a classifier. If we train it on a data set, it describes each class by a set of k prototypes and return 1-NN classifier.

>> a
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60) 
>> p=sdkmeans(a,10)
[class 'apple' 3 clusters pruned,  7 clusters] 
[class 'banana' 5 clusters pruned,  5 clusters] 
[class 'stone' 3 clusters pruned,  7 clusters] 
1-NN (k-means prototypes) pipeline 2x3  3 classes, 19 prototypes

Note that it provides three outputs, each for one of the classes in a.

To use sdkmeans for clustering, we need to add the 'cluster' option:

>> p=sdkmeans(a,10,'cluster')
[class 'apple' 10 clusters] [class 'banana' 10 clusters] [class 'stone' 10 clusters] 
1-NN (k-means prototypes) pipeline 2x30  30 classes, 30 prototypes

As a result, the pipeline returned has 30 outputs, each returning a distance to one of the 30 prototypes trained. Note that clustering is performed per class - use 'all' option to cluster all data irrespective of classes:

>> p=sdkmeans(a,10,'cluster','all')
[class 'Unknown' 10 clusters] 
1-NN (k-means prototypes) pipeline 2x10  10 classes, 10 prototypes

18.3.1.1. Obtaining cluster labels with sdkmeans ↩

To get cluster labels, we need to make p returning decisions with sddecide and apply the resulting pipeline to the original data:

>> pd=sddecide(p)
sequential pipeline     2x1 '1-NN (k-means prototypes)+Decision'
 1  1-NN (k-means prototypes)    2x10 10 classes, 10 prototypes
 2  Decision               10x1  weighting, 10 classes, 1 ops at op 1

>> dec=a*pd
sdlab with 260 entries, 10 groups

>> dec.list
sdlist (10 entries)
 ind name
   1 C 1
   2 C 2
   3 C 3
   4 C 4
   5 C 5
   6 C 6
   7 C 7
   8 C 8
   9 C 9
  10 C10

In order to get a data set with decisions as new labels dec in one step, use the .* operator:

>> b=a.*pd
'Fruit set' 260 by 2 sddata, 10 classes: [25  37  29  28   9  35  17  27  23  30]

This is roughly analogous to the use of sdcluster:

>> c=sdcluster(a,sdkmeans,10,'all')
[class 'Unknown' 10 clusters] 
'Fruit set' 260 by 2 sddata, 10 'cluster' groups: [22  23  34  24  26  20  38  20  25  28]

The only difference of the later approach is that sdcluster included new 'cluster' labels while the data set b got the decisions set into the current label set referenced via b.lab.

18.3.1.2. Accessing prototypes derived by sdkmeans ↩

The set of prototypes, derived by sdkmeans is stored in the pipeline p:

>> p
1-NN (k-means prototypes) pipeline 2x10  10 classes, 10 prototypes
>> p{1}

ans = 

proto: [10x2x10 sddata]

>> proto=p{1}.proto
10 by 2 sddata, 10 classes: [1  1  1  1  1  1  1  1  1  1]

>> proto'
10 by 2 sddata, 10 classes: [1  1  1  1  1  1  1  1  1  1]
sample props: 'lab'->'class' 'class'(L)
feature props: 'featlab'->'featname' 'featname'(L)
data props:  'data'(N)

Because k-means algorithm derives (extracts) new prototypes by averaging original observations, prototypes do not contain any oroginal properties set in the clustered data set. Use k-centers algorithm if you wish to preserve prototype sample properties.

18.3.2. k-centers algorithm ↩

k-centres algorithm selects k prototypes from the original data set by in such a way that the maximum distance within each cluster is minimized. The main difference with respect to the k-means algorithm is that k-centres prototypes are observations existing in the original data, while k-means exctracted new "virtual" prototypes by averaging. Use k-centres if you wish to find typical observations and use further their original properties.

In this example, we set a unique sample id to each sample:

>> a
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60) 
>> a.id=1:length(a)
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60) 

>> a(127).id

ans =

   127

Now, we cluster the data set with sdkcentres asking for 10 clusters:

>> p=sdkcentres(a,10,'cluster','all')
[class 'unknown' 10 comp] 
1-NN (k-centers prototypes) pipeline 2x10  10 classes, 10 prototypes

The derived prototypes contain the id property:

>> p{1}.proto'
'Fruit set' 10 by 2 sddata, 10 'cluster' groups: [1  1  1  1  1  1  1  1  1  1]
sample props: 'lab'->'cluster' 'class'(L) 'ident'(N) 'id'(N) 'cluster'(L)
feature props: 'featlab'->'featname' 'featname'(L)
data props:  'data'(N)

So we may easily identify what samples were selected:

>> p{1}.proto.id

ans =

     8
   235
   246
   185
    66
   207
    62
    71
   174
   117

18.3.3. Gaussian mixture model ↩

Gaussian mixture model available through sdmixture function may be used for data clustering. It has one substantial advantage with respect to k-means and k-centres algorithms: sdmixture may automatically select the number of clusters in the data.

By default, sdmixture operates per class and returns per-class output:

>> p=sdmixture(a)
[class 'apple' initialization: 4 clusters  EM:.............................. 4 comp] 
[class 'banana' initialization: 4 clusters  EM:.............................. 4 comp] 
[class 'stone' initialization: 1 cluster  EM:.............................. 1 comp] 
Mixture of Gaussians pipeline 2x3  3 classes, 9 components

We received three outputs for the three classes in a.

sdmixture with 'cluster' option will return one outputfor each of the mixture components (clusters) found. The operation is sill on the per-class basis:

>> p=sdmixture(a,'cluster')
[class 'apple' initialization: 4 clusters  EM:.............................. 4 comp]
[class 'banana' initialization: 4 clusters  EM:.............................. 4 comp] 
[class 'stone' initialization: 1 cluster  EM:.............................. 1 comp] 
Mixture of Gaussians pipeline 2x9  9 classes, 9 components

If we wish to cluster the whole data set, we may set a single class:

>> a
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60) 
>> a.lab='unknown'
'Fruit set' 260 by 2 sddata, class: 'unknown'

>> p=sdmixture(a,'cluster')
[class 'unknown' initialization: 3 clusters  EM:.............................. 3 comp] 
Mixture of Gaussians pipeline 2x3  3 classes, 3 components

>> p.lab
sdlab with 3 entries: 'Cluster 1','Cluster 2','Cluster 3'

Assigning data to clusters:

>> pd=sddecide(p)
sequential pipeline     2x1 'Mixture of Gaussians+Decision'
 1  Mixture of Gaussians    2x3  3 classes, 3 components
 2  Decision                3x1  weighting, 3 classes

>> b=a.*pd
'Fruit set' 260 by 2 sddata, 3 classes: 'Cluster 1'(72) 'Cluster 2'(80) 'Cluster 3'(108) 

We may visualize the clustering results in the feature space:

>> sdscatter(b,pd)

Gaussian mixture model automatically estimating number of clusters

Using pd, any 2D feature vector will be assigned into one of the three clusters. Sometimes, we may wish to distinguish the feature space domain observed when performing the clustering from far away outliers or previously unseen clusters. We may do this by adding reject option to the pd pipeline:

>> pr=~sdreject`(pd,b)
Weight-based operating point,3 classes,[0.33,0.33,...]
sequential pipeline     2x1 'Mixture of Gaussians+Decision'
 1  Mixture of Gaussians    2x3  3 classes, 3 components (sdp_normal)
 2  Decision                3x1  weight+reject, 4 decisions, ROC 1 ops at op 1 (sdp_decide)

>> sdscatter(b,pr)

>> pr.list
sdlist (4 entries)
 ind name
   1 Cluster 1
   2 Cluster 2
   3 Cluster 3
   4 reject   

You may find more detailed discussion on this technique in this blog entry

Gaussian mixture model estimating number of components, protection from outliers