### bucket cross-validation? subject-wise k-fold cross-validation?

by Forrest Sheng Bao http://fsbao.net

This is an article about machine learning and biomedical engineering. And you will need MATLAB and MATLAb Neural Network Toolbox to run my demo code. An easier official example on using MATLAB NN Toolbox is here: http://www.mathworks.com/access/helpdesk/help/toolbox/nnet/radial_4.html#8370

I am recently doing a very special k-fold cross validation. I have 12 different sample sources, which are actually 12 persons. From each person, I have thousands of EEG samples. So, I am gonna do a 12-fold cross-validation where I use samples from one person as test samples and the rest samples from the rest 11 persons as training samples for my classifier. Thus, each fold is the data from one person. I don't know how to call this kind of CV since it's not a regular 12-fold CV in which samples are uniformly segmented into 12 groups. In my experiments, folds have a special constraint: They are divided by the source of EEG data.

I wrote a small MATLAB code to verify my MATLAB coding has no problem. You can also consider it as very simple application of machine learning. Suppose I have many dots on a Cartesian plane. One group of them locate in the first quadrant and the other group of them locate in the third quadrant. Now I want my classifier to classifier these two groups of dots by their coordinates. Thus, this is a binary classification problem. I assume those dots are from 5 groups. The 5 groups of dots are [2 2; 3 4; 5 6] , [-0.5 -0.3; -1 -1] , [1 4; 2 3; 4 5] , [-3 -4; -2 -2], [5 7; 8 9]. Here is how this kind of 5-fold CV works:

% This code is released under GNU General Public License version 3 or later% Forrest Sheng Bao http://fsbao.net Feb. 25, 2009feature=[2 2; 3 4; 5 6; -0.5 -0.3; -1 -1; 1 4; 2 3; 4 5; -3 -4; -2 -2; 5 7; 8 9];Tc=[1 1 1 2 2 1 1 1 2 2 1 1];deliminator=cumsum([3 2 3 2 2]); % number of samples in each grouptotal_correct_rate = 0;total_time = 0;true_positive=0;true_negative=0;wrong1=zeros(1,length(Tc));wrong2=zeros(1,length(Tc));result=zeros(1,length(Tc));for bucket=1:length(deliminator) % main test loop, leave one bucket samples outdisplay(bucket)if bucket==1lb = 1;ub = deliminator(1);%number of samples in this fold is ub-lb+1training_feature = feature(ub+1:length(feature),:);training_target = Tc(ub+1:length(feature));elseif bucket %less than% length(deliminator)lb = deliminator(bucket-1)+1;ub = deliminator(bucket);%number of samples in this fold is ub-lb+1training_feature = [feature(1:lb-1,:); feature(ub+1:length(feature),:)];training_target = [Tc(1:lb-1) Tc(ub+1:length(feature))];elselb = deliminator(bucket-1)+1;ub = deliminator(bucket);%number of samples in this fold is ub-lb+1training_feature = feature(1:lb-1,:);training_target = Tc(1:lb-1);endtest_feature = feature(lb:ub,:);test_target = Tc(lb:ub);%transformP = training_feature';test_feature = test_feature';%end of transformT = ind2vec(training_target);net = newpnn(P,T);%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%Y = sim(net,test_feature);Yc = vec2ind(Y);%  test_targetcorrect=0;for x=1:length(Yc)Loo = lb+x-1;if Yc(x)==test_target(x)correct=correct+1;if Yc(x)==1true_positive=true_positive+1;      result(Loo)=1;    else      true_negative=true_negative+1;      result(Loo)=-1;    endelse  if Yc(x)==1    wrong1(Loo)=1;    result(Loo)=1;  else    wrong2(Loo)=1;    result(Loo)=-1;  endendend % end of for%      sprintf ('Accuracy of this test loop is %f', correct/length(Yc))total_correct_rate = total_correct_rate + correct;endtotal_correct_rate = total_correct_rate/length(Tc)
The final accurate is 91.67%, which is 11/12 since there is one misclassification.

Another version is here coz I found that MATLAB NN Toolbox will consume extremely large memory if you feed in test samples as a matrix. So I replaced that by using a FOR loop.

for bucket=1:length(deliminator) % main test loop, leave one bucket samples out  display(bucket)  if bucket==1    lb = 1;    ub = deliminator(1);    %number of samples in this fold is ub-lb+1    training_feature = feature(ub+1:length(feature),:);    training_target = Tc(ub+1:length(feature));  elseif bucket %less_than% length(deliminator)    lb = deliminator(bucket-1)+1;    ub = deliminator(bucket);    %number of samples in this fold is ub-lb+1    training_feature = [feature(1:lb-1,:); feature(ub+1:length(feature),:)];    training_target = [Tc(1:lb-1) Tc(ub+1:length(feature))];  else    lb = deliminator(bucket-1)+1;    ub = deliminator(bucket);    %number of samples in this fold is ub-lb+1    training_feature = feature(1:lb-1,:);    training_target = Tc(1:lb-1);  end  test_feature = feature(lb:ub,:);  test_target = Tc(lb:ub);  %transform  P = training_feature';  test_feature = test_feature';  %end of transform  T = ind2vec(training_target);  net = newpnn(P,T);  for y=1:length(test_target)    Y = sim(net,test_feature(:,y));    Yc = vec2ind(Y);    correct=0;    if length(Yc)~=1display('error on dimension of Yc')    end    Loo = lb+y-1;    if Yc==test_targetcorrect=correct+1;if Yc==1  true_positive=true_positive+1;  result(Loo)=1;else  true_negative=true_negative+1;  result(Loo)=-1;end    elseif Yc==1  wrong1(Loo)=1;  result(Loo)=1;else  wrong2(Loo)=1;  result(Loo)=-1;end    end      total_correct_rate = total_correct_rate + correct;  end %end of y=1:length(test_target)end