Selecting a random subset based on a specific set of labels
Often, we need to generate a random subset of samples using a specific set of labels. For example, in the medical problem we may be interested to sample not from the top-level disease/no-disease classes by from each patient or from each tissue type. randsubset helps you to do just that in a simple way.
By default randsubset samples each class. Running randsubset(a,500) in a medical problem may choose very few samples of a specific tissue or may entirely skip it:
>> a 'medical D/ND' 6400 by 11 sddata, 3 classes: 'disease'(1495) 'no-disease'(4267) 'noise'(638) >> b=randsubset(a,500) 'medical D/ND' 1500 by 11 sddata, 3 classes: 'disease'(500) 'no-disease'(500) 'noise'(500) >> b.tissue' % quick overview of tissue labels with transpose operator ind name size percentage 1 disease 500 (33.3%) 2 no-disease 300 (20.0%) 3 healthy 63 ( 4.2%) 4 bone 32 ( 2.1%) 5 organ wall 75 ( 5.0%) 6 muscle 30 ( 2.0%) 7 unknown 500 (33.3%)
We may, however, directly sample the tissues using randsubset(a,’tissue’,500). However, this command will cast an error because some of the tissue classes contain less samples than 500:
>> b=randsubset(a,'tissue',500) ??? Error using ==> sddata.randsubset at 146 More samples requested from class 'healthy' than available. Use 'atmax' option to return at maximum N samples. >> a.tissue' ind name size percentage 1 disease 1495 (23.4%) 2 no-disease 2613 (40.8%) 3 healthy 400 ( 6.2%) 4 bone 231 ( 3.6%) 5 organ wall 780 (12.2%) 6 muscle 243 ( 3.8%) 7 unknown 638 (10.0%)
The solution is to add the ‘atmax’ option that will assure that randsubset returns “at maximum” 500 samples:
>> b=randsubset(a,'tissue',500,'atmax') 'medical D/ND' 2874 by 11 sddata, 3 classes: 'disease'(500) 'no-disease'(1874) 'noise'(500) >> b.tissue' ind name size percentage 1 disease 500 (17.4%) 2 no-disease 500 (17.4%) 3 healthy 400 (13.9%) 4 bone 231 ( 8.0%) 5 organ wall 500 (17.4%) 6 muscle 243 ( 8.5%) 7 unknown 500 (17.4%)
