13.05.2009  pavel

Automatic estimation of number of components in Gaussian mixtures

We’re happy to share a new addition to PRSD Studio that greatly simplifies construction of more sophisticated classifiers.
Gaussian mixture models are often used to design classifiers in multi-modal problems and the sake of clustering. Given enough training examples, mixture models may describe arbitrary data distribution. Furthermore, execution speed is usually orders of magnitude higher than non-parametric approaches such as k-NN or Parzen classifier which is very important in on-line industrial applications.
In order to effectively use mixture models in practice, one needs to provide the number of mixture components as an input parameter. sdmixture can now estimate the number of components robustly using a non-parametric density estimation approach.


The algorithm starts from a non-parametric density estimate (Parzen) and using EM-algorithm identifies the most salient structures in the data which are used to initialize the mixture model EM algorithm.  The approach is based on paper:
J. Grim, J. Novovicova, P. Pudil, P. Somol, F.J. Ferri, Initializing Normal Mixtures of Densities, Prof.of ICPR 1998.

From the user perspective, there is now one parameter less when training sdmixture:

>> a=gendatm(1000)
1000 by 2 dataset with 8 classes: [125  125  120  123  119  141  137  110]
>> pm=sdmixture(a)
[class 'a' initialization:.............. 1 cluster  EM:..............................]
[class 'b' initialization:............ 3 clusters  EM:..............................]
[class 'c' initialization:............ 1 cluster  EM:..............................]
[class 'd' initialization:............. 1 cluster  EM:..............................]
[class 'e' initialization:............ 3 clusters  EM:..............................]
[class 'f' initialization:............. 4 clusters  EM:..............................]
[class 'g' initialization:............. 2 clusters  EM:..............................]
[class 'h' initialization:............ 3 clusters  EM:..............................] 
sequential pipeline     2x8 ''
 1  sdp_normal          2x8  8 classes, 18 components

>> sdscatter(a,[pm sdops(pm)])

Gaussian mixture model classifier with automatically estimated number of components

It also helps us to design powerful detectors describing multi-modal problems:

>> b=gendatf(1000)
Fruit set, 1000 by 2 dataset with 3 classes: [333  333  334]
% lets build detector for all the data (use class 'all' not present in lablist):
>> pd=sddetector(b,'all',sdmixture,'reject',0.02)
[class 'all' initialization:................. 4 clusters  EM:......................] 
sequential pipeline     2x1 ''
 1  sdp_normal          2x1  one class, 4 components
 2  sdp_decide          1x1  Threshold-based decision on all at op 1
>> sdscatter(b,pd)

Gaussian mixture model detector with automatically estimated number of components

Comments

Name:

Email:

Location:

URL:

Remember my personal information

Notify me of follow-up comments?

Submit the word you see below: