perClass Documentation
version 5.1 (11-May-2017)

kb28: Example on building a classifier on database records

Published on: 23-sep-2013

perClass version used: 4.0 (26-jul-2013)

Problem: How to build a classifier for SQL database records that contain nominal values?

Solution: Use sdnominal to make sure your data sets share common representation of nominal data.

28.1. Introduction ↩

When we build classifiers for database applications, we often encounter nominal features. A nominal feature captures an object quality. An example is the country a person is from. As designers, we must make sure our classifiers encounter identical representation of nominal features both in training and in execution.

In this article, we walk through an database application example that explains how to deal with nominal data representation.

28.2. Creating a data set from SQL database ↩

In our example, we use Adult data set from UCI repository. The data describes education and occupation of people and can be used to predict their annual income.

The data set was imported into SQLite database. Therefore, we may pull in Matlab only a small subset we actually want to use.

Let's say we want to build a classifier only concerning people in England.

We will first open the database for use in perClass:

>> d=sddb('adult.db')
perClass DB: 1 connections

The database contains one table called 'data':

>> tables(d)

ans = 

'data'

The 'data' table contains the following fields:

>> F=fields(d,'data')

ans = 

  Columns 1 through 5

'age'    'workclass'    'fnlwgt'    'education'    'education_num'

  Columns 6 through 10

'marital_status'    'occupation'    'relationship'    'race'    'sex'

  Columns 11 through 14

'capital_gain'    'capital_loss'    'hours_per_week'    'native_country'

  Column 15

'income'

And create an SQL query only for data from England:

>> s=sdsql(d,'select * from data where native_country="England";')
127 by 15 set in 'adult.db':

The query returned 127 records, each with 15 fields.

We may convert the sdsql object into a Matlab cell array:

>> C=cell(s);

and view its first record:

>> C(1,:)

ans = 

  Columns 1 through 6

[48]    'Private'    [171095]    'Assoc-acdm'    [12]    'Divorced'

  Columns 7 through 12

'Exec-managerial'    'Unmarried'    'White'    'Female'    [0]    [0]

  Columns 13 through 15

[40]    'England'    '<=50K'

Each record is a mix of numerical and string (nominal) values.

We will construct an sddata object from the SQL query:

>> a=sddata(s)
127 by 15 sddata (nominal), class: 'unknown'

and add field names as feature labels:

>> a.featlab=F
127 by 15 sddata (nominal), class: 'unknown'
>> a.featlab'
   1 age           
   2 workclass     
   3 fnlwgt        
   4 education     
   5 education_num 
   6 marital_status
   7 occupation    
   8 relationship  
   9 race          
  10 sex           
  11 capital_gain  
  12 capital_loss  
  13 hours_per_week
  14 native_country
  15 income        

Note, that sddata indicates that the object contains nominal values. With sdnominal command, we may display details on the nominal data set:

>> sdnominal(a)
Data set contains 9 nominal features:
1 'age' (real)  
2 'workclass' (nominal)  1:? 2:Federal-gov 3:Local-gov 4:Private 5:Self-emp-inc 6:Self-emp-not-inc 7:State-gov 
3 'fnlwgt' (real)  
4 'education' (nominal)  1:10th 2:11th 3:12th 4:1st-4th 5:5th-6th 6:9th 7:Assoc-acdm 8:Assoc-voc 9:Bachelors 10:Doctorate 11:HS-grad 12:Masters 13:Prof-school 14:Some-college 
5 'education_num' (real)  
6 'marital_status' (nominal)  1:Divorced 2:Married-civ-spouse 3:Married-spouse-absent 4:Never-married 5:Separated 6:Widowed 
7 'occupation' (nominal)  1:? 2:Adm-clerical 3:Craft-repair 4:Exec-managerial 5:Farming-fishing 6:Handlers-cleaners 7:Machine-op-inspct 8:Other-service 9:Priv-house-serv 10:Prof-specialty 11:Protective-serv 12:Sales 13:Tech-support 14:Transport-moving 
8 'relationship' (nominal)  1:Husband 2:Not-in-family 3:Other-relative 4:Own-child 5:Unmarried 6:Wife 
9 'race' (nominal)  1:Asian-Pac-Islander 2:Black 3:Other 4:White 
10 'sex' (nominal)  1:Female 2:Male 
11 'capital_gain' (real)  
12 'capital_loss' (real)  
13 'hours_per_week' (real)  
14 'native_country' (nominal)  1:England 
15 'income' (nominal)  1:<=50K 2:>50K 

For each nominal attribute, sdnominal lists the values both in numerical representation used inside sddata and the original string value.

For our example, we will use the 15th 'income' feature as a label. We can convert any nominal feature into sdlab object:

>> a.lab=sdlab(a(:,'income'))
127 by 15 sddata (nominal), 2 classes: '<=50K'(80) '>50K'(47) 

and 14th feature (native_country) as a "country" labels:

>> a.country=sdlab(a(:,'native_country'))
127 by 15 sddata (nominal), 2 classes: '<=50K'(80) '>50K'(47) 
>> a.country
sdlab with 127 entries from 'England'

Finally, we preserve only the first 13 columns as features:

>> a=a(:,1:13)
127 by 13 sddata (nominal), 2 classes: '<=50K'(80) '>50K'(47) 

28.3. Training a classifier on nominal data set ↩

We can directly train any classifier on our nominal data set a:

>> pr=sdrandforest(a)
..........
sequential pipeline       13x1 'Random forest+Decision'
 1 Random forest          13x2  20 trees
 2 Decision                2x1  weighting, 2 classes

Random forest error on the training set:

>> sdconfmat(a.lab,a*pr)

ans =

 True      | Decisions
 Labels    |   <=50K    >50K  | Totals
---------------------------------------
 <=50K     |     79       1   |     80
 >50K      |      3      44   |     47
---------------------------------------
 Totals    |     82      45   |    127

Note, that perClass classifiers, trained on nominal data, preserve detailed nominal information:

>> sdnominal(pr)
Pipeline expects on input 7 nominal features:
1 'age' (real)  
2 'workclass' (nominal)  1:? 2:Federal-gov 3:Local-gov 4:Private 5:Self-emp-inc 6:Self-emp-not-inc 7:State-gov 
3 'fnlwgt' (real)  
4 'education' (nominal)  1:10th 2:11th 3:12th 4:1st-4th 5:5th-6th 6:9th 7:Assoc-acdm 8:Assoc-voc 9:Bachelors 10:Doctorate 11:HS-grad 12:Masters 13:Prof-school 14:Some-college 
5 'education_num' (real)  
6 'marital_status' (nominal)  1:Divorced 2:Married-civ-spouse 3:Married-spouse-absent 4:Never-married 5:Separated 6:Widowed 
7 'occupation' (nominal)  1:? 2:Adm-clerical 3:Craft-repair 4:Exec-managerial 5:Farming-fishing 6:Handlers-cleaners 7:Machine-op-inspct 8:Other-service 9:Priv-house-serv 10:Prof-specialty 11:Protective-serv 12:Sales 13:Tech-support 14:Transport-moving 
8 'relationship' (nominal)  1:Husband 2:Not-in-family 3:Other-relative 4:Own-child 5:Unmarried 6:Wife 
9 'race' (nominal)  1:Asian-Pac-Islander 2:Black 3:Other 4:White 
10 'sex' (nominal)  1:Female 2:Male 
11 'capital_gain' (real)  
12 'capital_loss' (real)  
13 'hours_per_week' (real)  

28.4. Creating test data set ↩

We will pull records from Germany for testing:

>> s2=sdsql(d,'select * from data where native_country="Germany";')
206 by 15 set in 'adult.db':
SQL='select * from data where native_country="Germany";'

>> b=sddata(s2)
206 by 15 sddata (nominal), class: 'unknown'
>> b.featlab=fields(d,'data')
206 by 15 sddata (nominal), class: 'unknown'

>> b.lab=sdlab( b(:,'income') )
206 by 15 sddata (nominal), 2 classes: '<=50K'(148) '>50K'(58) 

>> b.country=sdlab( b(:,'native_country') )
206 by 15 sddata (nominal), 2 classes: '<=50K'(148) '>50K'(58) 

>> b=b(:,1:13)
206 by 13 sddata (nominal), 2 classes: '<=50K'(148) '>50K'(58) 

Let's apply our classifier pr on the data set b:

>> b*pr
{??? Error using ==> sdexe
Nominal representations in data set and pipeline do not agree! Use sdnominal to
validate and/or update nominal representation.

What is the cause of the reported error?

The nominal representation in data set b and classifier pr do not match. In other words, different numerical coding is used for nominal categories in b and in pr. Applying one to another would yield meaningless results.

>> sdnominal(b)
Data set contains 7 nominal features:
1 'age' (real)  
2 'workclass' (nominal)  1:? 2:Federal-gov 3:Local-gov 4:Private 5:Self-emp-inc 6:Self-emp-not-inc 7:State-gov 
3 'fnlwgt' (real)  
4 'education' (nominal)  1:10th 2:11th 3:12th 4:7th-8th 5:Assoc-acdm 6:Assoc-voc 7:Bachelors 8:Doctorate 9:HS-grad 10:Masters 11:Prof-school 12:Some-college 
5 'education_num' (real)  
6 'marital_status' (nominal)  1:Divorced 2:Married-civ-spouse 3:Married-spouse-absent 4:Never-married 5:Separated 6:Widowed 
7 'occupation' (nominal)  1:? 2:Adm-clerical 3:Craft-repair 4:Exec-managerial 5:Farming-fishing 6:Handlers-cleaners 7:Machine-op-inspct 8:Other-service 9:Priv-house-serv 10:Prof-specialty 11:Protective-serv 12:Sales 13:Tech-support 14:Transport-moving 
8 'relationship' (nominal)  1:Husband 2:Not-in-family 3:Other-relative 4:Own-child 5:Unmarried 6:Wife 
9 'race' (nominal)  1:Amer-Indian-Eskimo 2:Asian-Pac-Islander 3:Black 4:Other 5:White 
10 'sex' (nominal)  1:Female 2:Male 
11 'capital_gain' (real)  
12 'capital_loss' (real)  
13 'hours_per_week' (real)  

Comparing detailed nominal information on data set b and the classifier pr (see previous section) shows, for example, that our test set contains Amer-Indian-Eskimo race but the classifier does not know about it as this category was not present in the training data set a.

28.5. Making sure nominal representation are identical ↩

The message is: We must make sure the nominal representation of our problem is identical in all data sets and classifiers we use.

The sdnominal command may create a joined representation updating both data sets a and b so that they are compatible:

>> [a2,b2]=sdnominal(a,'join',b)
OK: Both returned data sets share the same joined nominal data representation (data matrices updated).
127 by 13 sddata (nominal), 2 classes: '<=50K'(80) '>50K'(47) 
206 by 13 sddata (nominal), 2 classes: '<=50K'(148) '>50K'(58) 

Note, that our classifier pr is now obsolete because its representation does not match the one in the updated training set a2:

>> sdnominal(a2,pr);
ISSUE: Each object represents nominal features by different numerical values.

We need to retrain it.

>> pr2=sdrandforest(a2)
..........
sequential pipeline       13x1 'Random forest+Decision'
 1 Random forest          13x2  20 trees
 2 Decision                2x1  weighting, 2 classes

Our classifier pr2 may be now applied to data set b2:

>> sdconfmat(b2.lab,b2*pr2)

ans =

 True      | Decisions
 Labels    |   <=50K    >50K  | Totals
---------------------------------------
 <=50K     |    131      17   |    148
 >50K      |     20      38   |     58
---------------------------------------
 Totals    |    151      55   |    206

28.6. Conclusions ↩

We have seen how to pull data from SQL database and train a classifier on nominal feature represenation. perClass helps us to make sure that the nominal representation used in our data sets and classifiers is identical.

TIP: To avoid the need for re-training our classifiers, we should define our data sets with the complete set of nominal categories for each attribute and use the same setup in the entire project.