- 9.1. Introduction
- 9.2. Creating data sets with nominal features
- 9.3. Testing if data set contains nominal features
- 9.4. Display info about nominal data
- 9.5. Converting nominal feature to labels
- 9.6. Training a pipeline on nominal data set
- 9.7. Combining nominal data sets
- 9.8. Testing if two nominal reprsentations are identical
- 9.9. Making two nominal representations identical
- 9.10. Applying pipelines to nominal data sets
- 9.11. Turning labels into nominal features

# 9.1. Introduction ↩

Nominal, or categorical, features describe qualitative aspects of an
object. For example, an object's color may be "blue", "red" or "green". The
*color* feature is nominal because the available values do not share any
ordering relationship ("blue" is not higher/lower or better/worse than
"green").

perClass with DB licensing option offers handling of nominal features and
training classifiers on this type of data. You can check if the DB option
is present using `sdversion`

.

# 9.2. Creating data sets with nominal features ↩

With DB option, `sddata`

sets may be created from cell arrays
containing string fields. The data matrix remains numerical, perClass
however defines and maintains the mapping between nominal values and their
numerical representation.

Let us consider this cell array:

**>> C={1.5 'aaa'; -5 'bbb'; 1.7 'aaa'; 4 'ccc'}**
C =
[1.5000] 'aaa'
[ -5] 'bbb'
[1.7000] 'aaa'
[ 4] 'ccc'
**>> a=**`sddata`

(C)
4 by 2 sddata (nominal), class: 'unknown'

The data set `a`

contains four samples and two features.

The content of the data set `a`

is numerical, similarly to any other
`sddata`

set:

**>> +a**
ans =
1.5000 1.0000
-5.0000 2.0000
1.7000 1.0000
4.0000 3.0000

# 9.3. Testing if data set contains nominal features ↩

We can use `isnominal`

to test whether a data set contains nominal
information.

**>> **`isnominal`

(a)
ans =
1

The `isnominal`

returns `1`

if at least one nominal feature exists
in the data set and `0`

otherwise:

**>> **`isnominal`

(a(:,1))
ans =
0

# 9.4. Display info about nominal data ↩

Detailed information on nominal data is provided by `sdnominal`

command:

**>> **`sdnominal`

(a)
Data set contains one nominal feature:
1 'Feature 1' (real)
2 'Feature 2' (nominal) 1:aaa 2:bbb 3:ccc

For each nominal feature, it displays all nominal values with the corresponding numerical representation.

# 9.5. Converting nominal feature to labels ↩

Any nominal feature may be converted into `sdlab`

object:

**>> L=**`sdlab`

(a(:,2))
sdlab with 4 entries, 3 groups: 'aaa'(2) 'bbb'(1) 'ccc'(1)
**>> +L**
ans =
aaa
bbb
aaa
ccc

# 9.6. Training a pipeline on nominal data set ↩

A pipeline remembers it was trained on nominal data set. Therefore, we may
test it with `isnominal`

and even see the nominal representation
with `sdnominal`

:

**>> p=**`sdknn`

(a)
sequential pipeline 2x1 '1-NN+Decision'
1 1-NN 2x1 4 prototypes
2 Decision 1x1 threshold on 'unknown'
**>> **`isnominal`

(p)
ans =
1
**>> **`sdnominal`

(p)
Pipeline expects on input one nominal feature:
1 'Feature 1' (real)
2 'Feature 2' (nominal) 1:aaa 2:bbb 3:ccc

# 9.7. Combining nominal data sets ↩

Image, we have another cell array:

**>> C2={5 'ccc'; -3 'bbb'}**
C2 =
[ 5] 'ccc'
[-3] 'bbb'

We convert it into a data set:

**>> a2=**`sddata`

(C2)
2 by 2 sddata (nominal), class: 'unknown'
**>> +a2**
ans =
5 1
-3 2

If we concatenate the data sets `a`

and `a2`

, we receive an error message:

**>> [a;a2]**
{??? Error using ==> sddata.vertcat at 50
Data sets being concatenated do not share identical nominal representation. Use
sdnominal to either use one existing representation for all data sets ('from'
option) or create a new representation for all sets ('join' option).

The reason is, that both data sets encode the nominal values by different
numbers. While the value `c`

is represented by 3 in data set `a`

, it is 2
in data set `a2`

.

**>> **`sdnominal`

(a)
Data set contains one nominal feature:
1 'Feature 1' (real)
2 'Feature 2' (nominal) 1:aaa 2:bbb 3:ccc
**>> **`sdnominal`

(a2)
Data set contains one nominal feature:
1 'Feature 1' (real)
2 'Feature 2' (nominal) 1:ccc 2:bbb

We need identical numerical representation of nominal features in all data
sets and, consequently, all classifiers we build! The fundamental rule when
working with nominal data is: **In your project, use a single nominal
representation of each nominal feature.**

# 9.8. Testing if two nominal reprsentations are identical ↩

The `sdnominal`

function allows us to test whether two objects share the
same nominal representation:

**>> **`sdnominal`

(a,a2)
ISSUE: Each object represents nominal features by different numerical values.
ans =
0

Subset of the same data set has identical nominal representation:

**>> **`sdnominal`

(a, a(3:end) )
OK: Both objects share identical numerical representation of nominal data.
ans =
1

Similarly to other perClass functions, the additional display output may be surpressed using the 'no display' option:

**>> **`sdnominal`

(a, a(3:end), 'nodisplay' )
ans =
1

# 9.9. Making two nominal representations identical ↩

In our example above, the data sets `a`

and `a2`

contain different
numerical representation of nominal data. We may use `sdnominal`

to pass
the nominal representation *from* one object to another:

**>> b2=**`sdnominal`

(a2,'from',a)
Nominal representation in the data matrix updated based on the 'from' data set.
2 by 2 sddata (nominal), class: 'unknown'

The new nominal representation in b2 data set considers three values of 'Feature 2', namely 'aaa','bbb' and 'ccc':

**>> **`sdnominal`

(b2)
Data set contains one nominal feature:
1 'Feature 1' (real)
2 'Feature 2' (nominal) 1:aaa 2:bbb 3:ccc
**>> [+a2 +b2]**
ans =
5 1 5 3
-3 2 -3 2

The data sets `a`

and `b2`

now share identical representation:

**>> **`sdnominal`

(a,b2)
OK: Both objects share identical numerical representation of nominal data.
ans =
1

Therefore, they may be concatenated:

**>> [a;b2]**
6 by 2 sddata (nominal), class: 'unknown'

Note however, that we cannot pass nominal representation from `a2`

to `a`

because some categories are not present:

**>> b=**`sdnominal`

(a,'from',a2)
The following values are not present in the nominal list.
ans =
aaa
{??? Error using ==> sddata.sddata at 129
Some categories in the label object were not found in the list of nominal
values

# 9.10. Applying pipelines to nominal data sets ↩

We may apply a classifier, trained above on data set `a`

, to it or any
subset of `a`

:

**>> p**
sequential pipeline 2x1 '1-NN+Decision'
1 1-NN 2x1 4 prototypes
2 Decision 1x1 threshold on 'unknown'
**>> **`sdnominal`

(p)
Pipeline expects on input one nominal feature:
1 'Feature 1' (real)
2 'Feature 2' (nominal) 1:aaa 2:bbb 3:ccc
**>> a*p**
sdlab with 4 entries from 'unknown'
**>> a(3)*p**
sdlab with one entry: 'unknown'

However, an error is raised if applying it to data set `a2`

:

**>> a2*p**
{??? Error using ==> sdexe at 102
Nominal representations in data set and pipeline do not agree! Use sdnominal to
validate and/or update nominal representation.

This operation is not allowed because it would lead to incorrect results (recall, that a2 encodes nominal values as 'ccc' as 1 while the classifier thinks it should be represented by 3.)

We may, however, execute `p`

on the data set `b2`

, created from `a2`

in the
previous section:

**>> b2*p**
sdlab with 2 entries from 'reject'

*Important:* perClass only checks if nominal representations match when
working with `sddata`

sets, but not for numerical matrices:

**>> +a2*p**
ans =
2
2

Also, the C-based execution runtime does not check for correctness of nominal representation executing classifiers out-of-Matlab.

# 9.11. Turning labels into nominal features ↩

Above, we have seen how to convert nominal feature
into an `sdlab`

object. That is very useful if we want to work with
categories using powerful logical operations or regular expressions.

But how to bring the label object back to nominal feature values?

Let us consider this label object:

**>> L=**`sdlab`

(a(3:end,2))
sdlab with 2 entries, 2 groups: 'aaa'(1) 'ccc'(1)
**>> +L**
ans =
aaa
ccc

If we only convert `L`

into `sddata`

set, categories are
represented differently than in the original set:

**>> c=**`sddata`

(L)
2 by 1 sddata (nominal), class: 'unknown'
**>> +c**
ans =
1
2

This is because label objects in perClass only represent categories present.

The useful for converting the labels back into original nominal representation is:

**>> d=**`sdnominal`

(`sddata`

(L),'from',a(:,2))
Nominal representation in the data matrix updated based on the 'from' data set.
2 by 1 sddata (nominal), class: 'unknown'
**>> +d**
ans =
1
3