perClass Documentation
version 5.1 (31-May-2017)

Chapter 9: Handling nominal features

Table of contents

9.1. Introduction ↩

Nominal, or categorical, features describe qualitative aspects of an object. For example, an object's color may be "blue", "red" or "green". The color feature is nominal because the available values do not share any ordering relationship ("blue" is not higher/lower or better/worse than "green").

perClass with DB licensing option offers handling of nominal features and training classifiers on this type of data. You can check if the DB option is present using sdversion.

9.2. Creating data sets with nominal features ↩

With DB option, sddata sets may be created from cell arrays containing string fields. The data matrix remains numerical, perClass however defines and maintains the mapping between nominal values and their numerical representation.

Let us consider this cell array:

>> C={1.5 'aaa'; -5 'bbb'; 1.7 'aaa'; 4 'ccc'}

C = 

[1.5000]    'aaa'
[    -5]    'bbb'
[1.7000]    'aaa'
[     4]    'ccc'

>> a=sddata(C)
4 by 2 sddata (nominal), class: 'unknown'

The data set a contains four samples and two features.

The content of the data set a is numerical, similarly to any other sddata set:

>> +a

ans =

1.5000    1.0000
-5.0000    2.0000
1.7000    1.0000
4.0000    3.0000

9.3. Testing if data set contains nominal features ↩

We can use isnominal to test whether a data set contains nominal information.

>> isnominal(a)

ans =

 1

The isnominal returns 1 if at least one nominal feature exists in the data set and 0 otherwise:

>> isnominal(a(:,1))

ans =

 0

9.4. Display info about nominal data ↩

Detailed information on nominal data is provided by sdnominal command:

>> sdnominal(a)
Data set contains one nominal feature:
1 'Feature 1' (real)  
2 'Feature 2' (nominal)  1:aaa 2:bbb 3:ccc 

For each nominal feature, it displays all nominal values with the corresponding numerical representation.

9.5. Converting nominal feature to labels ↩

Any nominal feature may be converted into sdlab object:

>> L=sdlab(a(:,2))
sdlab with 4 entries, 3 groups: 'aaa'(2) 'bbb'(1) 'ccc'(1) 

>> +L

ans =

aaa
bbb
aaa
ccc

9.6. Training a pipeline on nominal data set ↩

A pipeline remembers it was trained on nominal data set. Therefore, we may test it with isnominal and even see the nominal representation with sdnominal:

>> p=sdknn(a)
sequential pipeline       2x1 '1-NN+Decision'
 1 1-NN                    2x1  4 prototypes
 2 Decision                1x1  threshold on 'unknown'

>> isnominal(p)

ans =

 1

>> sdnominal(p)
Pipeline expects on input one nominal feature:
1 'Feature 1' (real)  
2 'Feature 2' (nominal)  1:aaa 2:bbb 3:ccc 

9.7. Combining nominal data sets ↩

Image, we have another cell array:

>> C2={5 'ccc'; -3 'bbb'}

C2 = 

[ 5]    'ccc'
[-3]    'bbb'

We convert it into a data set:

>> a2=sddata(C2)
2 by 2 sddata (nominal), class: 'unknown'
>> +a2

ans =

 5     1
-3     2

If we concatenate the data sets a and a2, we receive an error message:

>> [a;a2]
{??? Error using ==> sddata.vertcat at 50
Data sets being concatenated do not share identical nominal representation. Use
sdnominal to either use one existing representation for all data sets ('from'
option) or create a new representation for all sets ('join' option).

The reason is, that both data sets encode the nominal values by different numbers. While the value c is represented by 3 in data set a, it is 2 in data set a2.

>> sdnominal(a)
Data set contains one nominal feature:
1 'Feature 1' (real)  
2 'Feature 2' (nominal)  1:aaa 2:bbb 3:ccc 

>> sdnominal(a2)
Data set contains one nominal feature:
1 'Feature 1' (real)  
2 'Feature 2' (nominal)  1:ccc 2:bbb 

We need identical numerical representation of nominal features in all data sets and, consequently, all classifiers we build! The fundamental rule when working with nominal data is: In your project, use a single nominal representation of each nominal feature.

9.8. Testing if two nominal reprsentations are identical ↩

The sdnominal function allows us to test whether two objects share the same nominal representation:

>> sdnominal(a,a2)
ISSUE: Each object represents nominal features by different numerical values.

ans =

 0

Subset of the same data set has identical nominal representation:

>> sdnominal(a, a(3:end) )
OK: Both objects share identical numerical representation of nominal data.

ans =

 1

Similarly to other perClass functions, the additional display output may be surpressed using the 'no display' option:

>> sdnominal(a, a(3:end), 'nodisplay' )

ans =

 1

9.9. Making two nominal representations identical ↩

In our example above, the data sets a and a2 contain different numerical representation of nominal data. We may use sdnominal to pass the nominal representation from one object to another:

>> b2=sdnominal(a2,'from',a)
Nominal representation in the data matrix updated based on the 'from' data set.
2 by 2 sddata (nominal), class: 'unknown'

The new nominal representation in b2 data set considers three values of 'Feature 2', namely 'aaa','bbb' and 'ccc':

>> sdnominal(b2)
Data set contains one nominal feature:
1 'Feature 1' (real)  
2 'Feature 2' (nominal)  1:aaa 2:bbb 3:ccc 

>> [+a2 +b2]

ans =

 5     1     5     3
-3     2    -3     2

The data sets a and b2 now share identical representation:

>> sdnominal(a,b2)
OK: Both objects share identical numerical representation of nominal data.

ans =

 1

Therefore, they may be concatenated:

>> [a;b2]
6 by 2 sddata (nominal), class: 'unknown'

Note however, that we cannot pass nominal representation from a2 to a because some categories are not present:

>> b=sdnominal(a,'from',a2)
The following values are not present in the nominal list.

ans =

aaa

{??? Error using ==> sddata.sddata at 129
Some categories in the label object were not found in the list of nominal
values

9.10. Applying pipelines to nominal data sets ↩

We may apply a classifier, trained above on data set a, to it or any subset of a:

>> p
sequential pipeline       2x1 '1-NN+Decision'
 1 1-NN                    2x1  4 prototypes
 2 Decision                1x1  threshold on 'unknown'
>> sdnominal(p)
Pipeline expects on input one nominal feature:
1 'Feature 1' (real)  
2 'Feature 2' (nominal)  1:aaa 2:bbb 3:ccc 

>> a*p
sdlab with 4 entries from 'unknown'
>> a(3)*p
sdlab with one entry: 'unknown'

However, an error is raised if applying it to data set a2:

>> a2*p
{??? Error using ==> sdexe at 102
Nominal representations in data set and pipeline do not agree! Use sdnominal to
validate and/or update nominal representation.

This operation is not allowed because it would lead to incorrect results (recall, that a2 encodes nominal values as 'ccc' as 1 while the classifier thinks it should be represented by 3.)

We may, however, execute p on the data set b2, created from a2 in the previous section:

>> b2*p
sdlab with 2 entries from 'reject'

Important: perClass only checks if nominal representations match when working with sddata sets, but not for numerical matrices:

>> +a2*p

ans =

       2
       2

Also, the C-based execution runtime does not check for correctness of nominal representation executing classifiers out-of-Matlab.

9.11. Turning labels into nominal features ↩

Above, we have seen how to convert nominal feature into an sdlab object. That is very useful if we want to work with categories using powerful logical operations or regular expressions.

But how to bring the label object back to nominal feature values?

Let us consider this label object:

>> L=sdlab(a(3:end,2))
sdlab with 2 entries, 2 groups: 'aaa'(1) 'ccc'(1) 
>> +L

ans =

aaa
ccc

If we only convert L into sddata set, categories are represented differently than in the original set:

>> c=sddata(L)
2 by 1 sddata (nominal), class: 'unknown'
>> +c

ans =

 1
 2

This is because label objects in perClass only represent categories present.

The useful for converting the labels back into original nominal representation is:

>> d=sdnominal(sddata(L),'from',a(:,2))
Nominal representation in the data matrix updated based on the 'from' data set.
2 by 1 sddata (nominal), class: 'unknown'
>> +d

ans =

 1
 3