06.05.2010  pavel

Selecting a random subset based on a specific set of labels

Often, we need to generate a random subset of samples using a specific set of labels. For example, in the medical problem we may be interested to sample not from the top-level disease/no-disease classes by from each patient or from each tissue type. randsubset helps you to do just that in a simple way.


By default randsubset samples each class. Running randsubset(a,500) in a medical problem may choose very few samples of a specific tissue or may entirely skip it:

>> a
'medical D/ND' 6400 by 11 sddata, 3 classes: 'disease'(1495) 'no-disease'(4267) 'noise'(638) 

>> b=randsubset(a,500)
'medical D/ND' 1500 by 11 sddata, 3 classes: 'disease'(500) 'no-disease'(500) 'noise'(500) 

>> b.tissue'   %   quick overview of tissue labels with transpose operator
 ind name             size percentage
   1 disease           500 (33.3%)
   2 no-disease        300 (20.0%)
   3 healthy            63 ( 4.2%)
   4 bone               32 ( 2.1%)
   5 organ wall         75 ( 5.0%)
   6 muscle             30 ( 2.0%)
   7 unknown           500 (33.3%)

We may, however, directly sample the tissues using randsubset(a,’tissue’,500). However, this command will cast an error because some of the tissue classes contain less samples than 500:

>> b=randsubset(a,'tissue',500)
??? Error using ==> sddata.randsubset at 146
More samples requested from class 'healthy' than available. Use 'atmax' option
to return at maximum N samples.

>> a.tissue'
 ind name             size percentage
   1 disease          1495 (23.4%)
   2 no-disease       2613 (40.8%)
   3 healthy           400 ( 6.2%)
   4 bone              231 ( 3.6%)
   5 organ wall        780 (12.2%)
   6 muscle            243 ( 3.8%)
   7 unknown           638 (10.0%)

The solution is to add the ‘atmax’ option that will assure that randsubset returns “at maximum” 500 samples:

>> b=randsubset(a,'tissue',500,'atmax')
'medical D/ND' 2874 by 11 sddata, 3 classes: 'disease'(500) 'no-disease'(1874) 'noise'(500) 
>> b.tissue'
 ind name             size percentage
   1 disease           500 (17.4%)
   2 no-disease        500 (17.4%)
   3 healthy           400 (13.9%)
   4 bone              231 ( 8.0%)
   5 organ wall        500 (17.4%)
   6 muscle            243 ( 8.5%)
   7 unknown           500 (17.4%)

Comments

Name:

Email:

Location:

URL:

Remember my personal information

Notify me of follow-up comments?

Submit the word you see below: