### Fast approximated k-NN classifier

k-th nearest neighbor is a robust data driven classifier. However, the more training samples it uses, the slower it gets in execution. This is because distances from each new observation to all stored training examples (prototypes) need to be computed.

We have developed an approximated k-NN computing distances only to potential nearest neighbors and hence significantly speeding the k-NN execution. Although our strategy to localize the nearest neighbor search is similar to the well-known kd-tree approach, it does not employ per-feature splitting but works directly on distances. Therefore, it scales well to higher dimensionalities unlike the kd-tree which becomes inpractical for more than 20D feature spaces.

As an example, let us train k-NN, k=10 classifier on 20 000 samples. We will simply construct the pipeline action `sdp_knnmc`

implementing k-nn multi-class classifier providing the desired k, training samples, numerical labels and class names:

>> a=sdrelab(gendatb(20000)) % we use sdrelab to ensure string class names Banana Set, 20000 by 2 dataset with 2 classes: [9942 10058] >> p=sdp_knnmc(10,+a,getnlab(a),getlablist(a)) sequential pipeline 2x2 '' 1 sdp_knnmc 2x2 k=10, 20000 prototypes

We will benchmark the execution of this classifier on 100 000 samples:

>> data=rand(100000,2); >> tic; out1=data*p; toc Elapsed time is 27.622024 seconds.

The approximated k-NN is created using the `sdconvert`

function:

>> p2=sdconvert(p,'approx') using C=50 sequential pipeline 2x2 '' 1 sdp_knnmc_approx 2x2

The C is a parameter used in approximation algorithm. The higher it is, the faster classifier we get but also less accurate.

Execution on our benchmark set:

>> tic; out2=data*p2; toc Elapsed time is 1.514126 seconds.

We can observe

**18 times speed-up**due to approximation. Let us compare the classifier output in our 2D feature space:

>> b=sdrelab(gendatb(300)) Banana Set, 300 by 2 dataset with 2 classes: [166 134] >> sdscatter(b,p2) >> title('approximated k-NN') >> title('full k-NN') >> sdscatter(b,p)

Note that the approximated k-NN exhibits more step-like changes of its soft output than the full k-NN.

Let us estimate the confusion matrices to see how the coarser soft-outputs impact accuracy of our classifier. We will use a default 50/50 operating point and a separate test set with 3000 samples:

>> pd=sdp_decide(sdops('w',[0.5 0.5],getlablist(a))) sequential pipeline 2x1 '' 1 sdp_decide 2x1 Weight-based decision (2 classes, 1 ops) at op 1 >> b=gendatb(3000) Banana Set, 3000 by 2 dataset with 2 classes: [1543 1457] >> confmat(getlab(b),b*p*pd) % decisions of the full k-NN True | Estimated Labels Labels | 1 2 | Totals --------|--------------|------- 1 | 1517 26 | 1543 2 | 30 1427 | 1457 --------|--------------|------- Totals | 1547 1453 | 3000 >> confmat(getlab(b),b*p2*pd) % decisions of the approximated k-NN True | Estimated Labels Labels | 1 2 | Totals --------|--------------|------- 1 | 1508 35 | 1543 2 | 19 1438 | 1457 --------|--------------|------- Totals | 1527 1473 | 3000

Although we can observe differences in confusion matrices, the error percentages remain very similar. Of course, cross-validation and proper statistical testing is needed to decide if the approximated k-NN performs worse than the full one.

Stay tuned for more examples as our approximation algorithm offers plenty of room for improvement!