18.11.2008  pavel

Fast approximated k-NN classifier

k-th nearest neighbor is a robust data driven classifier. However, the more training samples it uses, the slower it gets in execution. This is because distances from each new observation to all stored training examples (prototypes) need to be computed.
We have developed an approximated k-NN computing distances only to potential nearest neighbors and hence significantly speeding the k-NN execution. Although our strategy to localize the nearest neighbor search is similar to the well-known kd-tree approach, it does not employ per-feature splitting but works directly on distances. Therefore, it scales well to higher dimensionalities unlike the kd-tree which becomes inpractical for more than 20D feature spaces.

As an example, let us train k-NN, k=10 classifier on 20 000 samples. We will simply construct the pipeline action sdp_knnmc implementing k-nn multi-class classifier providing the desired k, training samples, numerical labels and class names:

>> a=sdrelab(gendatb(20000)) % we use sdrelab to ensure string class names
Banana Set, 20000 by 2 dataset with 2 classes: [9942  10058]

>> p=sdp_knnmc(10,+a,getnlab(a),getlablist(a))
sequential pipeline     2x2 ''
 1  sdp_knnmc           2x2  k=10, 20000 prototypes

We will benchmark the execution of this classifier on 100 000 samples:

>> data=rand(100000,2);

>> tic; out1=data*p; toc
Elapsed time is 27.622024 seconds.

The approximated k-NN is created using the sdconvert function:

>> p2=sdconvert(p,'approx')
using C=50 
sequential pipeline     2x2 ''
 1  sdp_knnmc_approx    2x2  

The C is a parameter used in approximation algorithm. The higher it is, the faster classifier we get but also less accurate.
Execution on our benchmark set:
>> tic; out2=data*p2; toc
Elapsed time is 1.514126 seconds.

We can observe 18 times speed-up due to approximation. Let us compare the classifier output in our 2D feature space:
>> b=sdrelab(gendatb(300))
Banana Set, 300 by 2 dataset with 2 classes: [166  134]

>> sdscatter(b,p2)
>> title('approximated k-NN')

>> title('full k-NN')
>> sdscatter(b,p)

Note that the approximated k-NN exhibits more step-like changes of its soft output than the full k-NN.

Let us estimate the confusion matrices to see how the coarser soft-outputs impact accuracy of our classifier. We will use a default 50/50 operating point and a separate test set with 3000 samples:

>> pd=sdp_decide(sdops('w',[0.5 0.5],getlablist(a)))
sequential pipeline     2x1 ''
 1  sdp_decide          2x1  Weight-based decision (2 classes, 1 ops) at op 1

>> b=gendatb(3000)
Banana Set, 3000 by 2 dataset with 2 classes: [1543  1457]
>> confmat(getlab(b),b*p*pd) % decisions of the full k-NN

  True   | Estimated Labels
  Labels |    1      2  | Totals
  1      | 1517     26  | 1543
  2      |   30   1427  | 1457
  Totals | 1547   1453  | 3000

>> confmat(getlab(b),b*p2*pd) % decisions of the approximated k-NN

  True   | Estimated Labels
  Labels |    1      2  | Totals
  1      | 1508     35  | 1543
  2      |   19   1438  | 1457
  Totals | 1527   1473  | 3000

Although we can observe differences in confusion matrices, the error percentages remain very similar. Of course, cross-validation and proper statistical testing is needed to decide if the approximated k-NN performs worse than the full one.

Stay tuned for more examples as our approximation algorithm offers plenty of room for improvement!






Remember my personal information

Notify me of follow-up comments?

Submit the word you see below: