perClass Documentation
development version 3.2 (14-Mar-2012)
Content

Comments? Ideas? Compliments?

Your email (only if you wish to be contacted)

Chapter 6: Data visualization

Table of contents

6.1. Interactive scatter plot ↩

perClass provides an interactive scatter plot sdscatter. We can launch it on any data set - here we create a data set with three features computed from road sign images. We will compute mean, standard deviation and median of each data set row (image reshaped to a vector):

>> a
381 by 1024 sddata, 17 classes: [31  28  24  33  19  21  57  26  21   9  13  15  14   1  14  29  26]

>> a2=setdata(a,[mean(+a,2) std(+a,0,2) median(+a,2)])
Warning: Feature names reset to 'Feature X' format.
> In <a href="error:/Users/pavel/ws/misc/tools/prsd_toolbox/DEV/src/prsd/@sddata/setdata.m,31,1">sddata.setdata at 31</a>
381 by 3 sddata, 17 classes: [31  28  24  33  19  21  57  26  21   9  13  15  14   1  14  29  26]

Note the warning message stating that feature labels of the new data set were set automatically.

>> getfeatlab(a2)
sdlab with 3 entries, 3 groups: 'Feature 1'(1) 'Feature 2'(1) 'Feature 3'(1) 

We may set the feature labels to more descriptive names using setfeatlab:

>> a2=setfeatlab(a2,sdlab('mean','std','median'))
381 by 3 sddata, 17 classes: [31  28  24  33  19  21  57  26  21   9  13  15  14   1  14  29  26]

Alternatively, we may provide the feature labels directly in the setdata call:

>> a2=setdata(a,[mean(+a,2) std(+a,0,2) median(+a,2)],sdlab('mean','std','median'))
381 by 3 sddata, 17 classes: [31  28  24  33  19  21  57  26  21   9  13  15  14   1  14  29  26]

In order to visualize the scatter plot, we invoke the sdscatter command:

>> sdscatter(a2)
ans =
 1

sdscatter opens a new figure and returns its handle:

Scatter plot with multiple classes.

The figure shows scatter plot of the first two features in the data set. Each point represents one data sample (here a road sign). The color and marker styles correspond to different classes.

By moving the mouse over the plot, we're shifting focus to the closest data sample represented by black marker. The figure title provides details about the highlighted sample, such as its index in the data set and class.

6.1.1. Legend ↩

The legend may be switched on either by pressing the l key (as in legend) or using Show legend command in Scatter menu.

Scatter showing legend with class names.

Note that pressing the legend toolbar button does not show correct class names in the legend; this is a known issue.

6.1.2. Changing features ↩

We can change features shown in sdscatter using cursor keys. "Left" and "Right" arrow flips through the features on the horizontal and "Up" and "Down" through the features on the vertical axis.

In order to directly select a feature of interest, use right click on the axis legend. A pop-up menu will appear listing the features available.

Changing features in a scatter plot by right mouse click.

If more than 25 features are present in the data set, a dialog will appear allowing us to select a feature by its index.

6.1.3. Sample inspector ↩

Sample inspector shows a detailed view of a current sample. It is especially useful if data samples in the data set represent images (such as in our road sign example).

We can select the Show sample inspector command from Scatter menu. The dialog opens asking for the name of the data set which contains the image data. We will type a2 and click on OK. A separate window opens showing the road sign image of the currently highlighted example:

Showing an image corresponding to the sample in a scatter plot.

You can use the sample inspector to identify outliers or to understand which objects fall in the area of overlap.

6.1.4. Switching between different sets of labels ↩

It is often beneficial to use multiple sets of labels. For example, in a medical problem, we may be interested not only in the top-level class such as 'cancer'/'non-cancer' but also in specific type of tissue or in the patient the sample originates from.

sdscatter may visualize any sample labeling available in the data set. Any sdlab object stored as a sample property is available.

Let's use a medical data set from cancer detection problem in this example. It contains information on pixels in scans of multiple patients. For each pixel, we know the high-level label such as 'cancer'/'non-cancer' more precise tissue type and patient:

>> load medical;
>> a'
'medical all' 225119 by 11 sddata, 2 classes: 'cancer'(56652) 'non-cancer'(168467) 
sample props: 'lab'->'class' 'class'(L) 'pixel'(N) 'patient'(L) 'tissue'(L)
feature props: 'featlab'->'featname' 'featname'(L)
data props:  'data'(N)

>> sdscatter(a)

Visualizing medical data set in a scatter plot.

We may switch between different labels via Use property command in Scatter menu.

Selecting a set of labels in the scatter plot.

Switching to patient labeling:

Scatter plot showing patient labels.

We may switch quickly to a specific property using the 1-9 shortcut keys. In our example, the tissue property is accessible by pressing '3':

Scatter plot showing tissue labels.

6.1.5. Visualizing subsets of samples ↩

sdscatter allows us to show only subset of samples defined by label values. This feature is accessible via the Sample filter command in Scatter menu.

For example, we may be interested only in non-cancer tissues. We can select only non-cancer examples in *Scatter/Sample filter/class*.

Handing class visibility in scatter plot.

We may combine multiple filters. For example, we might be interested only in non-cancer of patient 'Dick':

Scatter plot showing non-cancer data of a single patient.

Note that sdscatter preserves the axes limits of the total data set also for the sample subsets. This gives us important clues about position of the subset within the total data distribution. If we are interested in the detailed view of the subset, we may enter the automatic mode by pressing 'a' key. The limits will then be set according to the subset. Pressing 'a' again returns us to the full data set limits.

When visualizing sample subsets, we may freely move between different sets of labels. For example, by pressing '3' we use 'tissue' property which shows us the specific non-cancer tissues of Dick:

Scatter plot with tissue data

To quickly return to the previous filter, use 'f' key or *Sample filter/Apply previous filter* command. This allows us to understand differences between distributions

Visible subset of samples may be stored in a new data set in Matlab workspace using Create data set with visible samples menu command.

6.1.6. Bringing class to top, z-order of classes ↩

Overlapping classes may easily obscure scatter plots of large data sets. sdscatter provides Class to top command in the Scatter menu which allows us to bring desired class on top. In this way, we can better understand what happens in the area of overlap.

We will demonstrate this function on the artificially-generated three-class data set created by the gendatf function:

>> a=sddata(gendatf(10000))
'Fruit set' 10000 by 2 sddata, 3 classes: 'apple'(3333) 'banana'(3333) 'stone'(3334) 
>> sdscatter(a)

Large data set in an interactive scatter plot

The stone class obscures the banana distribution. By selecting Class to top and banana, we change the order in which the classes are plotted, so that banana appears on top.

Changing plotting order of classes in a scatter plot.

sdscatter also offers two keystrokes for easy flipping through the plotting order (z-order) of classes using + and - keys (to make things simpler, the = works as + so three is no need to hold SHIFT).

6.1.7. Hand-painting class labels ↩

sdscatter allows us to define class labels directly by painting. In this way, we can interactively label interesting groups of samples such as outliers, areas of overlap or class modes.

Painting is accessible both from the Scatter menu and from context-sensitive menu.

Painting class labels in a scatter plot

We need to specify which class to paint. It can be either one of the existing classes or we can create a new class. In our example, we are interested in the area of overlap and will, therefore, create a new class called overlap.

In painting mode, the square is added to the scatter plot axis. By holding left mouse button, we assign the samples included in the square into the desired class.

Note that while painting, you can freely switch between features to find the best views for your problem. You can also hide some of the classes using Class visibility command. Painting assigns the labels only to visible data samples.

Stop hand-painting sample labels.

When finished, choose Stop painting from the context menu or from the Scatter menu.

6.1.8. Renaming classes ↩

sdscatter provides a simple way to rename classes. This facility is helpful to re-arrange the data set or to assign meaning to labels generated by cluster analysis.

The function is accessible through Rename class command in the context menu or in the Scatter menu.

Renaming classes in a scatter plot

We can, for example, rename the apple and banana classes into fruit. Using the Create data set in workspace command from Scatter menu, we can save this data set into the Matlab work-space. The resulting data set will have only two classes, namely stone and fruit.

>> b  %  Created sddata b with all label sets.
Fruit set, 10000 by 2 sddata, 2 classes: 'stone'(3334) 'fruit'(6666)
>> b.lab.list
sdlist (2 entries)
 ind name
   1 stone 
   2 fruit 

Note that interactive renaming of classes makes sense when used with interactively defined classes. For existing classes in the data set, it is simpler to use the sdrelab function as we discussed here.

6.1.9. Visualizing live feature distributions in scatter plot ↩

When visualizing large data sets, the scatter plot alone is often not sufficient to judge the class overlap. To visualize the overlap conditions, sdscatter offers to include feature distribution plot for each of the axes.

Select Show feature distributions in the Scatter menu or press 'd'. Scatter figure will be extended with an additional distribution plot for horizontal and vertical axis:

>> a
'medical D/ND' 6400 by 11 sddata, 3 classes: 'disease'(1495) 'no-disease'(4267) 'noise'(638) 

>> sdscatter(a)

Scatter plot showing per-feature class distributions.

The distribution sub-plots show histograms for each of the available classes. Because the axes limits are aligned, we may better understand where is the true area of high class density located. When you focus on a subset of classes, switch between sets of labels or paint labels, the plots are updated accordingly.

To remove the class histograms, select Hide feature distributions from the Scatter menu or press 'd' key again.

6.2. Interactive plot of per-class feature distributions ↩

The visualization of the per class distribution of each feature gives an indication of the class overlap. The sdfeatplot provides this plot. In order to visualize the distribution for different features use the up/down cursor keys.

Per class feature distribution.

The image shows the distribution for the two classes present in the data. By default the labels are used, but the 'lab' option allows to visualize the distribution of other properties present in the data set. The distribution is obtained computing the histogram. The default number of bins is 30, but it can be customized using the 'bin' option. The distribution maybe also be visualized using as bins the unique values of the data itself, this is achieved by pressing the 'u' keystroke. Of course the distribution looks more "noisy", see the right plot in the figure below.

Per class feature distribution with automatic or unique values grid.

The style of the distribution may be customized using the 'style' option.

>> sdfeatplot(out2,'style',{'g-','m--'})

Per class feature distribution with custom markers.

sdfeatplot provides several options to enhance the visualization if the features of interest are obtained from the computation of histograms. For example, pressing the 's' keystroke switches to the stem-plot, highlighting individual histogram bins. The bins for the grid maybe computed automatically, with linearly spread bins over the data range. Alternatively, the unique values may be visualized using the 'u' keystroke. This is especially useful in case the feature histogram is very sparse. In this case, the direct inspection of the bins values gives a better understanding (right plot in the figure below) compared to the distribution plot (plot on the left).

Stem-plot

Using the 'x' keystroke the x-axis for the bins maybe specified. This is especially useful if the data has logarithmic scale.

6.3. Working with image data ↩

perClass provides a set of tools for working with image data. It allows us to quickly build classifiers based on local image information. The central component of this framework is the sdimage command. It provides both, a provides powerful interactive visualization tool and allows construction of data sets with image data.

6.3.1. Visualizing images with sdimage ↩

Let us consider an RGB image of a traffic scene 'roadsign09.bmp', loaded with Matlab imread command:

>> im = imread('roadsign09.bmp');
>> figure; imagesc(im)

Road sign image

Using sdimage command on matrix im opens an interactive viewer:

>> sdimage(im);

Interactive image plot.

The blue layer on top of the image represents the set of labels of the image data set, internally used by sdimage. As any other sddata set, each sample (pixel) has a label, which is set to "unknown" by default. We may toggle this label layer using the space bar key. Additionally, we may also adjust label transparency from very transparent to opaque in the Image menu.

The three image channels are visualized as three separate image bands. We can move between the bands with the 'up' and 'down' cursor keys. Each pixel is a data sample, the figure title shows the pixel's value and class label ('unknown' by default). Because sdimage represents the image matrix by

sdimage loads the image if provided with the string filename:

>> sdimage('roadsign09.bmp')

6.3.2. Creating image data sets objects directly ↩

sdimage command allows us to create image data set directly on the Matlab prompt using 'sddata' option:

>> a=sdimage(im,'sddata')
412160 by 3 sddata, class: 'unknown'

sdimage also accepts the image filename, attempting the load the file using imread:

>> b=sdimage('roadsign11.bmp','sddata')
412160 by 3 sddata, class: 'unknown'

The objects a and b are standard sddata sets with one sample for each pixel and three features corresponding to R,G, and B bands respectively. Note, that pixel values were also converted into double precision.

6.3.3. Hand-painting class labels ↩

Interactive sdimage figure allows us to paint class labels for image regions. In order to enter the 'paint' mode, use the Paint menu in the Image menu, select the Create new class command:

Painting pixel labels in an image plot

A dialog window will ask for the name of the class. Let's say we are interested in labeling road, we provide the name of the class and paint in the image region. Via the Image menu, or by clicking the right mouse button we can change brush size or exit from the paint mode.

Hand-painted image labels

6.3.4. Cropping images ↩

Often, we only want to work with a smaller area of a large image. sdimage offers us a crop function which makes this very quick.

Select a Crop image item in the Image menu.

Cropping multi-band images with perClass

A cross-hair will appear. Choose two corners of a region you wish to crop. The process may be terminated by clicking right mouse button.

Cropped version of the image data set

The new sdimage figure is opened containing the data from the specified region. Cropped data contains all labels and properties of the complete image. The image size for the new data set will be set to the specified region. If you save the cropped image into data set c:

>> Creating data set c in the workspace.
12840 by 3 sddata, class: 'unknown'
>> getiminfo(a)

ans = 

imsize: [560 736]

>> getiminfo(c)

ans = 

imsize: [120 108]

6.3.5. Saving image data set to workspace ↩

The Create data set in workspace command in Image menu lets us to store the image data together with the painted labels in a new sddata object in Matlab workspace. We are asked to provide the variable name for this new data set.

Storing our image data set with the labeled road region in data2 variable, we will see the following message in Matlab command window:

>> Creating data set data2 in the workspace.
412160 by 3 sddata, 2 classes: 'unknown'(403001) 'road'(9159) 

6.3.6. Working with image subsets ↩

Image data sets preserve the image information. We may, for example use only a subset of data, e.g. the pixels labeled as 'road' in the data2 object above.

>> sub=data2(:,:,'road')
9159 by 3 sddata, class: 'road'

The image subset may be still visualized as an image:

>> sdimage(sub)

sdimage showing image subset

6.3.7. Creating image matrix from a data set ↩

Data set representation of image data is useful for training pattern recognition algorithms. However, often we may need to apply imaging operations, such as filtering, to our image regions. sdimage allows us to create a image matrix with pixel values using the matrix option:

>> sub
9159 by 3 sddata, class: 'road'

>> I=sdimage(sub,'matrix');

>> size(I)

ans =

   560   736     3

Matrix I is created with the size of the original image, the sub data was extracted from. The matrix is filled with zeros and only the pixels available in the sub data set are inserted into this matrix.

Note that the matrix I uses double precision:

>> class(I)

ans =

double

We may now perform any image processing operation such filtering and bring the resulting data back into a data set format. This can be done using the linear indices stored in the sub data set 'pixel' property:

>> sub2=setdata(sub, I(sub.pixel))
9159 by 1 sddata, class: 'road'

6.3.8. Storing multiple images in data sets ↩

Image data sets created from multiple images may be joined. This feature allows us to create larger training sets with pixel-level data from multiple images and train robust classifiers.

Each image data set, created using sdimage, contains 'image' property (labels). If the image is loaded by providing the filename, this will be used as its image label.

>> im1=sdimage('roadsign09.bmp','sddata')
412160 by 3 sddata, class: 'unknown'
>> im2=sdimage('roadsign11.bmp','sddata')
412160 by 3 sddata, class: 'unknown'

>> a=[im1; im2]
824320 by 3 sddata, class: 'unknown'

>> a'
824320 by 3 sddata, class: 'unknown'
sample props: 'lab'->'class' 'class'(L) 'pixel'(N) 'image'(L)
feature props: 'featlab'->'featname' 'featname'(L)
data props:  'data'(N)

>> a.image
sdlab with 824320 entries, 2 groups: 'roadsign09.bmp'(412160) 'roadsign11.bmp'(412160) 

If we create an image from a matrix, sdimage creates random image label to avoid name clash with other images.

>> im1=sdimage(imread('roadsign09.bmp'),'sddata')
412160 by 3 sddata, class: 'unknown'
>> im2=~sdimage`(imread('roadsign11.bmp'),'sddata')
412160 by 3 sddata, class: 'unknown'

>> a=[im1; im2]
824320 by 3 sddata, class: 'unknown'
>> a.image
sdlab with 824320 entries, 2 groups: 'image9552'(412160) 'image6571'(412160) 

Note, that the image name is generated randomly and no check for identical names when concatenating image data sets is performed. It is the responsibility of the user to make sure that different images in one data set are labeled differently.

6.3.9. Connecting sdimage and sdscatter ↩

It is often useful to inspect the connection between image neighborhoods and the scatter plot. In order to visualize this connection the sdscatter and sdimage commands can be used together.

We may simply show data set with image data using sdscatter and then connect the sdimage plot to the scatter figure using the returned figure handle.

>> data2    %  Created data set data2 in the workspace.
412160 by 3 sddata, 2 classes: 'unknown'(399848) 'road'(12312) 

>> h=sdscatter(data2)
h =
     2

>> sdimage(data2,h); 

Connected image and scatter plot.

By moving mouse pointer over the image, we may see where the image pixel appears in the feature space. Similarly, moving over the scatter plot shows us the corresponding pixel.

By painting the in the scatter plot, the linked image plot also updates. This helps us to analyze position of specific feature space clusters in image domain:

Painting sample labels and visualizing the result in an image.

6.3.10. Clustering image with k-means ↩

One way to quickly group image data is to perform clustering. Using the Cluster with k-means command in Image menu, the data set, underlying our image, is then clustered algorithm considering individual pixels as separate data samples and image bands as features.

We are prompted for the desired number of clusters.

Clustering image - selecting number of clusters

We will obtain a new set of image labels called 'cluster' containing classes called 'C1','C2' etc.

Clustering image using k-means

Typical next step is to interpret the clusters. This may be done by assigning meaningful names using Rename class command.

6.3.11. Defining connected components ↩

sdimage allows us to define spatially-connected components. This allows us to quickly access individual objects or regions in an image data set. The Connected components menu is available only if the current set of labels contains two or more classes.

Defining connected components

Connected component command processes the current set of labels. For each class, the connected components are found separately.

Connected components with sdimage

Small isolated components are joined together into a special class (called 'small objects'). This helps us to quickly remove the noise. By default, objects smaller than 10 pixels are removed. This can be changed by the first item in the Connected components menu. In order to separate all isolated objects, use the value of 1.

When we save the data set back to the Matlab workspace (pressing 's' key), we can see the 'object' labels. Note that, because we saved the image data when the 'object' label selected, the resulting data set keeps it as a current label set. Therefore, we may address it as data2.lab

>> Creating data set data2 in the workspace.
12210 by 3 sddata, 11 'object' groups: [242   160    66  2643   114  6481    72     4  2346    62    20]
>> data2.lab'
 ind name                size percentage
   1 C1-object1           242 ( 2.0%)
   2 C1-object2           160 ( 1.3%)
   3 C1-object3            66 ( 0.5%)
   4 C1-object4          2643 (21.6%)
   5 C1-small objects     114 ( 0.9%)
   6 C2-object1          6481 (53.1%)
   7 C2-object2            72 ( 0.6%)
   8 C2-small objects       4 ( 0.0%)
   9 C3-object1          2346 (19.2%)
  10 C3-object2            62 ( 0.5%)
  11 C3-small objects      20 ( 0.2%)

We can remove the small objects quickly with a regular expression. We simply select all classes, that do not contain the 'small' substring:

>> data2(:,:,'~/small').lab'
 ind name                size percentage
   1 C1-object1           242 ( 2.0%)
   2 C1-object2           160 ( 1.3%)
   3 C1-object3            66 ( 0.5%)
   4 C1-object4          2643 (21.9%)
   5 C2-object1          6481 (53.7%)
   6 C2-object2            72 ( 0.6%)
   7 C3-object1          2346 (19.4%)
   8 C3-object2            62 ( 0.5%)