17.06.2008  pavel

Support for decision trees


We’re happy to announce that PRSD Studio supports execution of decision trees trained in PRTools. Decision tree is a classifier trained feature-per-feature splitting the feature space into rectangular subspaces. The two key advantages of decision trees are interpretation capability (why was the decision made?) and speed. It is the speed of execution that makes decision tree classifier particularly interesting for industrial practitioners!


Decision tree may be trained using the PRTools commands treec or stumpc. The first one builds the full tree and optionally prunes it to improve generalization. The second builds only a simple tree with few firsts nodes. This approach is useful when fusing many decision trees (Bagging/Boosting).

Let us first train a simple tree on a banana dataset. We will use Fisher criterion to build tree and “test-set” pruning:

>> a=gendatb(1000)
Banana Set, 1000 by 2 dataset with 2 classes: [529  471]
>> w=treec(a,'fishcrit',-2))
Decision Tree, 2 to 2 trained  mapping   --> tree_map

We can visualize the output of the tree using PRSD Toolbox sdscatter command:

>> sdscatter(a,w*sddecide(w))
ans =
    35

Note that we provide sdscatter also with the default decision mapping so that we visualize the decisions.

An example on multi-class problem (using ‘maxcrit’ instead of Fisher which is defined only for two-class problems):

>> a=gendatm(1000)
Multi-Class Problem, 1000 by 2 dataset with 8 classes: [132  125  152  132  103  121  121  114]
>> w=treec(a,'maxcrit',-2)
Decision Tree, 2 to 8 trained  mapping   --> tree_map
>> sdscatter(a,w*sddecide(w))
ans =
    36

Let us measure the execution speed of this tree. We randomly generate 100 000 samples in 2D space, prepare an execution pipeline and run it using sdexe:

>> data=rand(100000,2);
>> p=sdconvert(w*sddecide(w))
sequential pipeline 2x1 'Weight-based decision (8 cls)'
 1  sdp_tree        2x8  Decision Tree
 2  sdp_decide      8x1  Weight-based decision (8 cls)
>> tic; sdexe(p,data); toc
Elapsed time is 0.023148 seconds.

In this simple example, we’re classifying 100k examples in 23 ms.

Comments