function [ids trn_samps tst_samps trn_zs tst_zs] = prepOneVsRest( ... c1labels, topics, member, ... maxNumCachedPerClass, numFolds, numTrainPerClass, numTestPerClass ) %PREPONEVSREST Prepares a sampling experiment. % Prepares a cross-validation experiment for RCV1 corpus as we've formatted % it. % % Arguments: % c1labels regex of things to search in topics (cell) % topics list of possible topics (cell) % member binary membership matrix (length(topics) rows by % number of docs) % maxNumCachedPerClass max num of docs in pool from which we sample % numFolds number of times to sample from pool % numTrainPerClass number of training per class per fold % numTestPerClass number of testing per class per fold % % Returns: % ids corpus docs used in sample pool % trn_samps randperms used to index into pool % tst_samps randperms used to index into pool % trn_zs corresponding (training) labels % tst_zs corresponding (testing) labels % % Authors: % Joshua V. Dillon; Feb. 21, 2007; jvdillon AAT purdue DDOT edu % find the documents that belong to c1labels topic_ids = find(regexplist(c1labels,topics)); % if any doc has 1 or more c1labels, z=1, else z=0 z = any(member(topic_ids,:),1); % determine class representaion sumz = full(sum(z)); % how many c1 docs are there? N = min([sumz,length(z)-sumz,maxNumCachedPerClass]); % (doing this in case future version supports asymmetric validation) N1=N; N2=N; % randomly find N1 class1 indecies (cols in the corpus) I = randperm(sumz); z1 = full(find(z)); ids1 = z1( I(1:N1) ); % randomly find N2 class2 indecies (cols in the corpus) I = randperm(length(z)-sumz); z2 = full(find(~z)); ids2 = z2( I(1:N2) ); % ids2: offset not req'd since id1 is mutex with id2 (see comment below) ids = [ids1 ids2]; % ids1 is mutex with ids2 % set CV paramters based on function input %(doing this in case future version supports asymmetric validation) N1trn = numTrainPerClass; N1tst = numTestPerClass; N2trn = numTrainPerClass; N2tst = numTestPerClass; % following are pre-grown trn_samps = zeros(N1trn+N2trn,numFolds); trn_zs = zeros(N1trn+N2trn,numFolds); tst_samps = zeros(N1tst+N2tst,numFolds); tst_zs = zeros(N1tst+N2tst,numFolds); % for each "fold" form some sample indecies (which will index into the % length(ids) number of corpus indecies for ii=1:numFolds I = randperm(N1); trn1 = I(1:N1trn); tst1 = I((1+N1trn):(N1tst+N1trn)); % tst1 is disjoint from trn1 I = randperm(N2); trn2 = I(1:N2trn); tst2 = I((1+N2trn):(N2tst+N2trn)); % tst2 is disjoint from trn2 % offsetting since cached kernel is constructed from I=[id1 id2]. this % will "key" into the cached matrix to extract a training kernel and a % testing kernel trn_samps(:,ii) = [trn1 trn2+N1]'; tst_samps(:,ii) = [tst1 tst2+N1]'; % labels are unpermuted trn_zs(:,ii) = [repmat(1,N1trn,1) ; repmat(-1,N2trn,1)]; tst_zs(:,ii) = [repmat(1,N1tst,1) ; repmat(-1,N2tst,1)]; end