2009-02-25

bucket cross-validation? subject-wise k-fold cross-validation?

by Forrest Sheng Bao http://fsbao.net

This is an article about machine learning and biomedical engineering. And you will need MATLAB and MATLAb Neural Network Toolbox to run my demo code. An easier official example on using MATLAB NN Toolbox is here: http://www.mathworks.com/access/helpdesk/help/toolbox/nnet/radial_4.html#8370

I am recently doing a very special k-fold cross validation. I have 12 different sample sources, which are actually 12 persons. From each person, I have thousands of EEG samples. So, I am gonna do a 12-fold cross-validation where I use samples from one person as test samples and the rest samples from the rest 11 persons as training samples for my classifier. Thus, each fold is the data from one person. I don't know how to call this kind of CV since it's not a regular 12-fold CV in which samples are uniformly segmented into 12 groups. In my experiments, folds have a special constraint: They are divided by the source of EEG data.

I wrote a small MATLAB code to verify my MATLAB coding has no problem. You can also consider it as very simple application of machine learning. Suppose I have many dots on a Cartesian plane. One group of them locate in the first quadrant and the other group of them locate in the third quadrant. Now I want my classifier to classifier these two groups of dots by their coordinates. Thus, this is a binary classification problem. I assume those dots are from 5 groups. The 5 groups of dots are [2 2; 3 4; 5 6] , [-0.5 -0.3; -1 -1] , [1 4; 2 3; 4 5] , [-3 -4; -2 -2], [5 7; 8 9]. Here is how this kind of 5-fold CV works:


% This code is released under GNU General Public License version 3 or later
% Forrest Sheng Bao http://fsbao.net Feb. 25, 2009

feature=[2 2; 3 4; 5 6; -0.5 -0.3; -1 -1; 1 4; 2 3; 4 5; -3 -4; -2 -2; 5 7; 8 9];
Tc=[1 1 1 2 2 1 1 1 2 2 1 1];
deliminator=cumsum([3 2 3 2 2]); % number of samples in each group

total_correct_rate = 0;
total_time = 0;

true_positive=0;
true_negative=0;

wrong1=zeros(1,length(Tc));
wrong2=zeros(1,length(Tc));

result=zeros(1,length(Tc));

for bucket=1:length(deliminator) % main test loop, leave one bucket samples out
display(bucket)
if bucket==1
lb = 1;
ub = deliminator(1);
%number of samples in this fold is ub-lb+1
training_feature = feature(ub+1:length(feature),:);
training_target = Tc(ub+1:length(feature));
elseif bucket %less than% length(deliminator)
lb = deliminator(bucket-1)+1;
ub = deliminator(bucket);
%number of samples in this fold is ub-lb+1
training_feature = [feature(1:lb-1,:); feature(ub+1:length(feature),:)];
training_target = [Tc(1:lb-1) Tc(ub+1:length(feature))];
else
lb = deliminator(bucket-1)+1;
ub = deliminator(bucket);
%number of samples in this fold is ub-lb+1
training_feature = feature(1:lb-1,:);
training_target = Tc(1:lb-1);
end
test_feature = feature(lb:ub,:);
test_target = Tc(lb:ub);

%transform
P = training_feature';
test_feature = test_feature';
%end of transform

T = ind2vec(training_target);

net = newpnn(P,T);%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Y = sim(net,test_feature);

Yc = vec2ind(Y);
% test_target
correct=0;
for x=1:length(Yc)
Loo = lb+x-1;
if Yc(x)==test_target(x)
correct=correct+1;
if Yc(x)==1
true_positive=true_positive+1;
result(Loo)=1;
else
true_negative=true_negative+1;
result(Loo)=-1;
end
else
if Yc(x)==1
wrong1(Loo)=1;
result(Loo)=1;
else
wrong2(Loo)=1;
result(Loo)=-1;
end
end
end % end of for
% sprintf ('Accuracy of this test loop is %f', correct/length(Yc))
total_correct_rate = total_correct_rate + correct;
end
total_correct_rate = total_correct_rate/length(Tc)
The final accurate is 91.67%, which is 11/12 since there is one misclassification.

Another version is here coz I found that MATLAB NN Toolbox will consume extremely large memory if you feed in test samples as a matrix. So I replaced that by using a FOR loop.

for bucket=1:length(deliminator) % main test loop, leave one bucket samples out
display(bucket)
if bucket==1
lb = 1;
ub = deliminator(1);
%number of samples in this fold is ub-lb+1
training_feature = feature(ub+1:length(feature),:);
training_target = Tc(ub+1:length(feature));
elseif bucket %less_than% length(deliminator)
lb = deliminator(bucket-1)+1;
ub = deliminator(bucket);
%number of samples in this fold is ub-lb+1
training_feature = [feature(1:lb-1,:); feature(ub+1:length(feature),:)];
training_target = [Tc(1:lb-1) Tc(ub+1:length(feature))];
else
lb = deliminator(bucket-1)+1;
ub = deliminator(bucket);
%number of samples in this fold is ub-lb+1
training_feature = feature(1:lb-1,:);
training_target = Tc(1:lb-1);
end
test_feature = feature(lb:ub,:);
test_target = Tc(lb:ub);

%transform
P = training_feature';
test_feature = test_feature';
%end of transform

T = ind2vec(training_target);

net = newpnn(P,T);

for y=1:length(test_target)
Y = sim(net,test_feature(:,y));
Yc = vec2ind(Y);
correct=0;
if length(Yc)~=1
display('error on dimension of Yc')
end

Loo = lb+y-1;
if Yc==test_target
correct=correct+1;
if Yc==1
true_positive=true_positive+1;
result(Loo)=1;
else
true_negative=true_negative+1;
result(Loo)=-1;
end
else
if Yc==1
wrong1(Loo)=1;
result(Loo)=1;
else
wrong2(Loo)=1;
result(Loo)=-1;
end
end

total_correct_rate = total_correct_rate + correct;
end %end of y=1:length(test_target)
end

No comments: