Automatic estimation of number of components in Gaussian mixtures
We’re happy to share a new addition to PRSD Studio that greatly simplifies construction of more sophisticated classifiers.
Gaussian mixture models are often used to design classifiers in multi-modal problems and the sake of clustering. Given enough training examples, mixture models may describe arbitrary data distribution. Furthermore, execution speed is usually orders of magnitude higher than non-parametric approaches such as k-NN or Parzen classifier which is very important in on-line industrial applications.
In order to effectively use mixture models in practice, one needs to provide the number of mixture components as an input parameter. sdmixture can now estimate the number of components robustly using a non-parametric density estimation approach.
The algorithm starts from a non-parametric density estimate (Parzen) and using EM-algorithm identifies the most salient structures in the data which are used to initialize the mixture model EM algorithm. The approach is based on paper:
J. Grim, J. Novovicova, P. Pudil, P. Somol, F.J. Ferri, Initializing Normal Mixtures of Densities, Prof.of ICPR 1998.
From the user perspective, there is now one parameter less when training sdmixture:
>> a=gendatm(1000) 1000 by 2 dataset with 8 classes: [125 125 120 123 119 141 137 110] >> pm=sdmixture(a) [class 'a' initialization:.............. 1 cluster EM:..............................] [class 'b' initialization:............ 3 clusters EM:..............................] [class 'c' initialization:............ 1 cluster EM:..............................] [class 'd' initialization:............. 1 cluster EM:..............................] [class 'e' initialization:............ 3 clusters EM:..............................] [class 'f' initialization:............. 4 clusters EM:..............................] [class 'g' initialization:............. 2 clusters EM:..............................] [class 'h' initialization:............ 3 clusters EM:..............................] sequential pipeline 2x8 '' 1 sdp_normal 2x8 8 classes, 18 components >> sdscatter(a,[pm sdops(pm)])
It also helps us to design powerful detectors describing multi-modal problems:
>> b=gendatf(1000) Fruit set, 1000 by 2 dataset with 3 classes: [333 333 334] % lets build detector for all the data (use class 'all' not present in lablist): >> pd=sddetector(b,'all',sdmixture,'reject',0.02) [class 'all' initialization:................. 4 clusters EM:......................] sequential pipeline 2x1 '' 1 sdp_normal 2x1 one class, 4 components 2 sdp_decide 1x1 Threshold-based decision on all at op 1 >> sdscatter(b,pd)
