function [docs voc tf member infrequent nulldoc wordHistThresh] = shrinkByInfrequent(docs, voc, tf, member, vocabSize) %SHRINKBYINFREQUENT shrinks corpus to vocabulary size of vocabSize % Designed for RCV1 as we've formatted it. % Does NO sanity checking as of yet % % Arguments: % docs x % voc x % tf x % member x % vocabSize x % % Returns: % docs x % voc x % tf x % member x % infrequent x % nulldoc x % wordHistThresh x % % Authors: % Joshua V. Dillon; Feb. 21, 2007; jvdillon AAT purdue DDOT edu %wordFreq = histc([docs{:}],1:length(voc))'; wordFreq = sum(tf,1); [Y,I] = sort(full(wordFreq),'descend'); infrequent = zeros(length(voc),1); infrequent( I((vocabSize+1):length(voc)) ) = 1; wordHistThresh = Y(vocabSize); % jvd: word freq >= this val % jvd: remove low frequency words from consideration I = find(infrequent); tf(:,I) = []; voc(I) = []; % jvd: remove infrequent words from each document for i=1:length(docs) % jvd: find all the words of this doc that are infrequent, then remove them docs{i}( find(infrequent(docs{i})) ) = []; end % jvd: remove docs with no words nulldoc = find(sum(tf,2)==0); docs(nulldoc) = []; tf(nulldoc,:) = []; member(nulldoc,:) = [];