forked from mwgeurts/libra
-
Notifications
You must be signed in to change notification settings - Fork 0
/
csimca.m
executable file
·447 lines (415 loc) · 19.4 KB
/
csimca.m
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
function result = csimca(x,group,varargin);
%CSIMCA performs the SIMCA method. This is a classification
% method on a data matrix x with a known group structure. On each group a
% robust PCA analysis is performed. Afterwards a classification
% rule is developped to determine the assignment of new observations.
%
% Required input arguments:
% x : training data set (matrix of size n by p).
% group : column vector containing the group numbers of the training
% set x. For the group numbers, any strict positive integer is
% allowed.
%
% Optional input arguments:
% k : Is a vector with size equal to the number of groups, or otherwise 0. Represents the number
% of components to be retained in each group. (default = 0.)
% method : Indicates which classification rule is wanted. `1' results in rule (R1)
% based on the scaled orthogonal and score distances. `2' corresponds with
% (R2) based on the squared scaled orthogonal and score distances. Default is 2.
% gamma : Represents the value(s) used in the classification rule: weight gamma is given to the od's,
% weight (1-gamma) to the sd's. (default = 0.5).
% misclassif : String which indicates how to estimate the probability of
% misclassification. It can be based on the training data ('training'),
% a validation set ('valid'), or cross-validation ('cv'). Default is 'training'.
% membershipprob : Vector which contains the membership probability of each
% group (sorted by increasing group number). These values are used to
% obtain the total misclassification percentage.
% valid : If misclassif was set to 'valid', this field should contain
% the validation set (a matrix of size m by p).
% groupvalid : If misclassif was set to 'valid', this field should contain the group numbers
% of the validation set (a column vector).
% predictset : Contains a new data set (a matrix of size mp by p) from which the
% class memberships are unknown and should be predicted.
% plots : If equal to 1, one figure is created with the training data and the
% boundaries for each group. This plot is
% only available for trivariate (or smaller) data sets. For technical reasons, a maximum
% of 6 groups is allowed. Default is one.
% plotspca : If equal to one, a score diagnostic plot is
% drawn (default). If 'plots' is equal to zero, this plot is suppressed.
% See also makeplot.m
% labsd : The 'labsd' observations with largest score distance are
% labeled on the diagnostic plot. (default = 3)
% labod : The 'labod' observations with largest orthogonal distance are
% labeled on the diagnostic plot. default = 3)
%
% Options for advanced users (input comes from the program RSIMCA.m with option 'classic' = 1):
%
% weightstrain : The weights for the training data. Corresponds to the flagtrain from RSIMCA. (default = 1)
% weightsvalid : The weights for the validation data. Corresponds to the flagvalid from RSIMCA. (default = 1)
%
% I/O: result=csimca(x,group,'method',1,'misclassif','training',...
% 'membershipprob',proportions,'valid',y,'groupvalid',groupy,'plots',0);
%
% The user should only give the input arguments that have to change their default value.
% The name of the input arguments needs to be followed by their value.
% The order of the input arguments is of no importance.
%
% Examples: out=csimca(x,group,'method','1')
% out=csimca(x,group,'plots',0)
% out=csimca(x,group,'valid',y,'groupvalid',groupy)
%
% The output is a structure containing the following fields:
% result.assignedgroup : If there is a validation set, this vector contains the assigned group numbers
% for the observations of the validation set. Otherwise it contains the
% assigned group numbers of the original observations based on the discriminant rules.
% result.pca : A cell containing the results of the different PCA analysis on the training sets.
% result.method : String containing the method used to obtain
% the discriminant rules (either 1 for 'R1' or 2 for 'R2'). This
% corresponds to the input argument method.
% result.flagtrain : Observations from the training set whose score distance and/or orthogonal distance
% exceeds a certain cut-off value can be considered as outliers and receive a flag equal
% to zero. The regular observations receive a flag 1. (See also robpca.m)
% result.flagvalid : Observations from the validation set whose score distance and/or orthogonal distance
% exceeds a certain cut-off value can be considered as outliers and receive a
% flag equal to zero. The regular observations receive a flag 1.
% If there is no validation set, this field is equal to zero.
% result.grouppredict : If there is a prediction set, this vector contains the assigned group numbers
% for the observations of the prediction set.
% result.flagpredict : Observations from the new data set (predict) whose robust distance (to the center of their group)
% exceeds a certain cut-off value can be considered as overall outliers and receive a
% flag equal to zero. The regular observations receive a flag 1.
% If there is no prediction set, this field is
% equal to zero.
% result.membershipprob : A vector with the membership probabilities. If no priors are given, they are estimated
% as the proportions of observations in the training set.
% result.misclassif : String containing the method used to estimate the misclassification probabilities
% (same as the input argument misclassif)
% result.groupmisclasprob : A vector containing the misclassification probabilities for each group.
% result.avemisclasprob : Overall probability of misclassification (weighted average of the misclassification
% probabilities over all groups).
% result.class : 'CSIMCA'
% result.x : The training data set (same as the input argument x) (only in output when p<=3).
% result.group : The group numbers of the training set (same as the input argument group) (only in output when p<=3).
%
% This function is part of LIBRA: the Matlab Library for Robust Analysis,
% available at:
% http://wis.kuleuven.be/stat/robust.html
%
% Written by Karlien Vanden Branden
% Last Update: 05/07/2005
%
if nargin<2
error('There are too few input arguments.')
end
% assigning default-values
[n,p]=size(x);
if size(group,1)~=1
group=group';
end
if n ~= length(group)
error('The number of observations is not the same as the length of the group vector!')
end
g=group;
counts=tabulate(g); %contingency table (outputmatrix with 3 colums): value - number - percentage
[lev,levi,levj]=unique(g);
if ~all(counts(:,2)) %some groups have zero values, omit those groups
disp(['Warning: group(s) ', num2str(counts(counts(:,2)==0,1)'), 'are empty']);
empty=counts(counts(:,2)==0,:);
counts=counts(counts(:,2)~=0,:);
else
empty=[];
end
ng=size(counts,1);
proportions = zeros(ng,1);
y=0; %initial values of the validation data set and its groupsvector
groupy=0;
labsd = 3;
labod = 3;
counter=1;
gamma = 0.5;
k = zeros(ng,1);
weightstrain = ones(1,n);
weightsvalid = 0;
default=struct('k',k,'method',2,'gamma',0.5,'misclassif','training',...
'membershipprob',proportions,'valid',y,'groupvalid',groupy,'plots',1,'plotspca',0,'labsd',labsd,...
'labod',labod,'weightstrain',weightstrain,'weightsvalid',0,'predictset',[]);
list=fieldnames(default);
options=default;
IN=length(list);
i=1;
%reading the user's input
if nargin>2
%
%placing inputfields in array of strings
%
for j=1:nargin-2
if rem(j,2)~=0
chklist{i}=varargin{j};
i=i+1;
end
end
%
%Checking which default parameters have to be changed
% and keep them in the structure 'options'.
%
while counter<=IN
index=strmatch(list(counter,:),chklist,'exact');
if ~isempty(index) %in case of similarity
for j=1:nargin-2 %searching the index of the accompanying field
if rem(j,2)~=0 %fieldnames are placed on odd index
if strcmp(chklist{index},varargin{j})
I=j;
end
end
end
options=setfield(options,chklist{index},varargin{I+1});
index=[];
end
counter=counter+1;
end
end
%Checking gamma
gamma = options.gamma;
if gamma >1 | gamma <0
error('An inappropriate number for gamma is given. A correct value lies between 0 and 1.');
end
%Checking prior (>0 )
prior=options.membershipprob;
if size(prior,1)~=1
prior=prior';
end
epsilon=10^-4;
if sum(prior) ~=0 & (any(prior < 0) | (abs(sum(prior)-1)) > epsilon)
error('Invalid membership probabilities.')
end
if length(prior)~=ng
error('The number of membership probabilities is not the same as the number of groups.')
end
%%%%%%%%%%%%%%%%%%MAIN FUNCTION %%%%%%%%%%%%%%%%%%%%%
%Checking if a validation set is given
if strmatch(options.misclassif, 'valid','exact')
if options.valid==0
error(['The misclassification error will be estimated through a validation set',...
'but no validation set is given!'])
else
validx = options.valid;
validgroup = options.groupvalid;
if size(validx,1)~=length(validgroup)
error('The number of observations in the validation set is not the same as the length of its group vector!')
end
if size(validgroup,1)~=1
validgroup = validgroup';
end
countsvalid=tabulate(validgroup);
countsvalid=countsvalid(countsvalid(:,2)~=0,:);
if size(countsvalid,1)==1
error('The validation set must contain observations from more than one group!')
elseif any(ismember(empty,countsvalid(:,1)))
error(['Group(s) ' ,num2str(empty(ismember(empty,countsvalid(:,1)))), 'was/were empty in the original dataset.'])
end
end
if (length(options.weightsvalid) == 1) | (length(options.weightsvalid)~=size(validx,1))
options.weightsvalid = ones(size(validx,1),1);
end
elseif options.valid~=0
validx = options.valid;
validgroup = options.groupvalid;
if size(validx,1) ~= length(validgroup)
error('The number of observations in the validation set is not the same as the length of its group vector!')
end
if size(validgroup,1)~=1
validgroup = validgroup';
end
options.misclassif='valid';
countsvalid=tabulate(validgroup);
countsvalid=countsvalid(countsvalid(:,2)~=0,:);
if size(countsvalid,1)==1
error('The validation set must contain more than one group!')
elseif any(ismember(empty,countsvalid(:,1)))
error(['Group(s) ' , num2str(empty(ismember(empty,countsvalid(:,1)))), ' was/were empty in the original dataset.'])
end
if (length(options.weightsvalid) == 1) | (length(options.weightsvalid)~=size(validx,1))
options.weightsvalid = ones(size(validx,1),1);
end
end
model.counts = counts(:,2);
model.x = x;
model.group = group;
labsd = floor(max(0,min(options.labsd,n)));
labod = floor(max(0,min(options.labod,n)));
%CSIMCA:
%PRINCIPAL COMPONENT ANALYSIS
%Perform PCA on each group separately:
% a) if k is not given: decide on the optimal number of components using CV.
% b) if k is given: perform cpca with the optimal number of components.
for iClass = 1:ng
indexgroup = find(g==iClass);
groupi = x(indexgroup,:);
if options.k(iClass) == 0
disp(['A scree plot is drawn for group ',num2str(iClass),'.'])
end
model.result{iClass} = cpca(groupi,'k',options.k(iClass),'plots',options.plotspca,...
'labsd',labsd,'labod',labod);
model.flag(1,indexgroup) = model.result{iClass}.flag.all;
end
%CLASSIFICATION
%Discriminant rule based on the training set x
[odsc,sdsc] = testmodel(model,x);
finalgrouptrain = assigngroup(odsc,sdsc,options.method,ng,gamma);
if sum(prior) == 0
result1.prior = counts(:,3)'./100;
else
result1.prior = prior;
end
%Compute scaled orthogonal and scaled score distances for the validation set
if strmatch(options.misclassif,'valid','exact')
[odsc,sdsc] = testmodel(model,validx);
finalgroup = assigngroup(odsc,sdsc,options.method,ng,gamma);
elseif strmatch(options.misclassif,'cv','exact') %use cv
model.k = k;
[odsc,sdsc] = leave1out(model);
finalgroup = assigngroup(odsc,sdsc,options.method,ng,gamma);
end
switch options.misclassif
case 'valid'
[v,vi,vj]=unique(validgroup);
odscgroup = [];
sdscgroup = [];
for iClass = 1:ng
indexgroup = find(validgroup == iClass);
odscgroup = [odscgroup;odsc(indexgroup,iClass)];
sdscgroup = [sdscgroup;sdsc(indexgroup,iClass)];
end
weightsvalid=zeros(length(odscgroup),1);
weightsvalid(((odscgroup <= 1) & (sdscgroup <= 1)))=1;
for igamma = 1:length(gamma)
for iClass=1:ng
misclas(iClass)=sum((validgroup(options.weightsvalid==1)==finalgroup(options.weightsvalid==1,igamma)') & ...
(validgroup(options.weightsvalid==1)==repmat(v(iClass),1,sum(options.weightsvalid))));
ingroup(iClass) = sum((validgroup(options.weightsvalid == 1) == repmat(v(iClass),1,sum(options.weightsvalid))));
end
misclas = (1 - (misclas./ingroup));
misclasprobpergroup(igamma,:)=misclas;
misclas=misclas.*result1.prior;
misclasprob(igamma)=sum(misclas);
end
case 'training'
for igamma = 1:length(gamma)
for iClass = 1:ng
result1.misclas(iClass) = sum((group(options.weightstrain==1)==finalgrouptrain(options.weightstrain==1,igamma)')& ...
(group(options.weightstrain==1)==repmat(lev(iClass),1,sum(options.weightstrain))));
result1.ingroup(iClass) = sum((group(options.weightstrain == 1) == repmat(lev(iClass),1,sum(options.weightstrain))));
end
misclas = (1 - (result1.misclas./result1.ingroup));
misclasprobpergroup(igamma,:) = misclas;
misclas = misclas.*result1.prior;
misclasprob(igamma) = sum(misclas);
end
weightsvalid=0;%only available with validation set
finalgroup = finalgrouptrain;
case 'cv'
for igamma = 1:length(gamma)
for iClass=1:ng
misclas(iClass)=sum((group(options.weightstrain==1)==finalgroup(options.weightstrain==1,igamma)') & ...
(group(options.weightstrain==1)==repmat(lev(iClass),1,sum(options.weightstrain))));
ingroup(iClass) = sum((group(options.weightstrain == 1) == repmat(lev(iClass),1,sum(options.weightstrain))));
end
misclas = (1 - (misclas./ingroup));
misclasprobpergroup(igamma,:)=misclas;
misclas=misclas.*result1.prior;
misclasprob(igamma)=sum(misclas);
end
weightsvalid=0; %only available with validation set
end
if ~isempty(options.predictset)
[odscpredict,sdscpredict] = testmodel(model,options.predictset);
finalgrouppredict = assigngroup(odscpredict,sdscpredict,options.method,ng,gamma)';
weightspredict = max((odscpredict <= 1) & (sdscpredict <= 1),[],2)';
else
finalgrouppredict = 0;
weightspredict = 0;
end
%Output structure
result = struct('assignedgroup',{finalgroup'},'pca',{model.result},'method',options.method,...
'flagtrain',{model.flag},'flagvalid',{weightsvalid'},'grouppredict',finalgrouppredict,'flagpredict',weightspredict,...
'membershipprob',{result1.prior},'misclassif',{options.misclassif},'groupmisclasprob',{misclasprobpergroup},...
'avemisclasprob',{misclasprob},'class',{'CSIMCA'},'x',x,'group',group);
if size(x,2)>3
result=rmfield(result,{'x','group'});
end
%Plots:
try
if options.plots
makeplot(result)
end
catch %output must be given even if plots are interrupted
%> delete(gcf) to get rid of the menu
end
%---------------------------
%Leave-One-Out procedure
function [odsc,sdsc] = leave1out(model)
nClass = length(model.result);
for iClass = 1:nClass
index = 1;
indexgroup = find(model.group == iClass);
teller_if_lus = 0;
groupi = model.x(indexgroup,:);
for i = 1:model.counts(iClass)
groupia = removal(groupi,i,0);
GRes = model.result;
GRes{iClass} = cpca(groupia,'k',model.result{iClass}.k,'plots',0);
%Calculate for each class the sd and the od for the observation that was left out:
for jClass = 1:nClass
dataicentered = model.x(indexgroup(index),:)-GRes{jClass}.M;
scorei = dataicentered*GRes{jClass}.P;
dataitilde = scorei*GRes{jClass}.P';
sd(indexgroup(index),jClass) = sqrt(scorei*(diag(1./GRes{jClass}.L))*scorei');
od(indexgroup(index),jClass) = norm(dataicentered-dataitilde);
if GRes{jClass}.cutoff.od ~= 0
odsc(indexgroup(index),jClass) = od(indexgroup(index),jClass)/GRes{jClass}.cutoff.od;
else
odsc(indexgroup(index),jClass) = 0;
end
sdsc(indexgroup(index),jClass) = sd(indexgroup(index),jClass)/GRes{jClass}.cutoff.sd;
end
index = index + 1;
end
end
%--------------------------------
function [odsc,sdsc] = testmodel(model,validx);
%Apply the given model on the test data to obtain different
%orthogonal distances and score distances.
nClass = length(model.result);
n = size(validx,1);
for jClass = 1:nClass
for index = 1:n
out{jClass} = model.result{jClass};
dataicentered = validx(index,:)-out{jClass}.M;
scorei = dataicentered*out{jClass}.P;
dataitilde = scorei*out{jClass}.P';
sd(index,jClass) = sqrt(scorei*(diag(1./out{jClass}.L))*scorei');
od(index,jClass) = norm(dataicentered-dataitilde);
if out{jClass}.cutoff.od ~= 0
odsc(index,jClass) = od(index,jClass)/out{jClass}.cutoff.od;
else
odsc(index,jClass) = 0;
end
sdsc(index,jClass) = sd(index,jClass)/out{jClass}.cutoff.sd;
end
end
%-------------------------
function result = assigngroup(odsc,sdsc,method,nClass,gamma);
%Obtain the group assignments for given od's and sd's.
if method == 1
sd = sdsc;
od = odsc;
elseif method == 2
sd = sdsc.^2;
od = odsc.^2;
end
for igamma = 1:length(gamma)
tdist = gamma(igamma).*od + (1-gamma(igamma)).*sd;
for i = 1:size(od,1)
result(i,igamma) = find(tdist(i,:) == min(tdist(i,:)));
end
end