forked from gbm-developers/gbm
-
Notifications
You must be signed in to change notification settings - Fork 1
/
CHANGES
341 lines (236 loc) · 13.7 KB
/
CHANGES
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
Changes in version 2.1
- The cross-validation loop is now parallelized. The functions attempt
to guess a sensible number of cores to use, or the user can specify
how many through new argument n.cores.
- A fair amount of code refactoring.
- Added type='response' for predict when distribution='adaboost'.
- Fixed a bug that caused offset not to be used if the first element
of offset was 0.
- Updated predict.gbm and plot.gbm to cope with objects created using
gbm version 1.6.
- Changed default value of verbose to 'CV'. gbm now defaults to letting
the user know which block of CV folds it is running. If verbose=TRUE
is specified, the final run of the model also prints its progress
to screen as in earlier versions.
- Fixed bug that caused predict to return wrong result when
distribution == 'multinomial' and length(n.trees) > 1.
- Fixed bug that caused n.trees to be wrong in relative.influence
if no CV or validation set was used.
- Relative influence was computed wrongly when distribution="multinomial". Fixed.
- Cross-validation predictions now included in the output object.
- Fixed bug in relative.influence that caused labels to be wrong when sort.=TRUE.
- Modified interact.gbm to do additional sanity check, updated help file
- Fixed bug in interact.gbm so that it now works for distribution="multinomial"
- Modified predict.gbm to improve performance on large datasets
Changes in version 2.0
Lots of new features added so it warrants a change to the first digit
of the version number.
Major changes:
- Several new distributions are now available thanks to Harry Southworth
and Daniel Edwards: multinomial and tdist.
- New distribution 'pairwise' for Learning to Rank Applications (LambdaMART),
including four different ranking measures, thanks to Stefan Schroedl.
- The gbm package is now managed on R-Forge by Greg Ridgeway and
Harry Southworth. Visit http://r-forge.r-project.org/projects/gbm/
to get the latest or to contribute to the package
Minor changes:
- the "quantile" distribution now handles weighted data
- relative.influence changed to give names to the returned vector
- Added print.gbm and show.gbm. These give basic summaries of the fitted
model
- Added support function and reconstructGBMdata() to facilitate reconstituting
the data for certain plots and summaries
- gbm was not using the weights when using cross-validation due to
a bug. That's been fixed (Thanks to Trevor Hastie for catching this)
- predict.gbm now tries to guess the number of trees, also defaults to using the training data if no newdata is given.
- relative.influence has has 2 new arguments, scale. and sort. that default to FALSE. The returned vector now has names.
- gbm now tries to guess what distribution you meant if you didn't specify.
- gbm has a new argument, class.stratifiy.cv, to control if cross-validation is stratified by class with distribution is "bernoulli" or "multinomial". Defaults to TRUE for multinomial, FALSE for bernoulli. The purpose is to avoid unusable training sets.
- gbm.perf now puts a vertical line at the best number of trees when method = "cv" or "test". Tries to guess what method you meant if you don't tell it.
- .First.lib had a bug that would crash gbm if gbm was installed as a local library. Fixed.
- plot.gbm has a new argument, type, defaulting to "link". For bernoulli, multinomial, poisson, "response" is allowed.
- models with large interactions (>24) were using up all the terminal nodes in the stack. The stack has been increased to 101 nodes allowing interaction.depth up to 49. A more graceful error is now issued if interaction.depth exceeds 49. (Thanks to Tom Dietterich for catching this).
- gbm now uses the R macro R_NaN in the C++ code rather than NAN, which would not compile on Sun OS.
- If covariates marked missing values with NaN instead of NA, the model fit would not be consistent (Thanks to JR Lockwood for noting this)
Changes in version 1.6
- Quantile regression is now available thanks to a contribution from
Brian Kriegler. Use list(name="quantile",alpha=0.05) as the
distribution parameter to construct a predictor of the 5% of the
conditional distribution
- gbm() now stores cv.folds in the returned gbm object
- Added a normalize parameter to summary.gbm that allows one to
choose whether or not to normalize the variable influence to sum to
100 or not
- Corrected a minor bug in plot.gbm that put the wrong variable
label on the x axis when plotting a numeric variable and a factor
variable
- the C function gbm_plot can now handle missing values. This does
not effect the R function plot.gbm(), but it makes gbm_plot
potentially more useful for computing partial dependence plots
- mgcv is no longer a required package, but the splines package is
needed for calibrate.plot()
- minor changes for compatibility with R 2.6.0 (thanks to Seth
Falcon)
- corrected a bug in the cox model computation when all terminal
nodes had exactly the minimum number of observations permitted,
which caused gbm and R to crash ungracefully. This was likely to
occur with small datasets (thanks to Brian Ring)
- corrected a bug in Laplace that always made the terminal node
predictions slightly larger than the median. Corrected again in a
minor release due to a bug caught by Jon McAuliffe
- corrected a bug in interact.gbm that caused it to crash for factors.
Caught by David Carslaw
- added a plot of cross-validated error to the plots generated by gbm.perf
Changes in version 1.5
- gbm would fail if there was only one x. Now drop=FALSE is set in all
data.frame subsetting (thanks to Gregg Keller for noticing this).
- Corrected gbm.perf() to check if bag.fraction=1 and skips trying to
create the OOB plots and estimates.
- Corrected a typo in the vignette specifying the gradient for the Cox
model.
- Fixed the OOB-reps.R demo. For non-Gaussian cases it was maximizing the
deviance rather than minimizing.
- Increased the largest factor variable allowed from 256 levels to 1024
levels. gbm stops if any factor variable exceeds 1024. Will try to make this
cleaner in the future.
- predict.gbm now allows n.trees to be a vector and efficiently computes
predictions for each indicated model. Avoids having to call predict.gbm
several times for different choices of n.trees.
- fixed a bug that occurred when using cross-validation for coxph. Was
computing length(y) when y is a Surv object which return 2*N rather than N.
This generated out-of-range indices for the training dataset.
- Changed the method for extracting the name of the outcome variable to work
around a change in terms.formula() when using "." in formulas.
Changes in version 1.4
- The formula interface now allows for "-x" to indicate not including certain
variables in the model fit.
- Fixed the formula interface to allow offset(). The offset argument has now
been removed from gbm().
- Added basehaz.gbm that computes the Breslow estimate of the baseline hazard.
At a later stage this will be substituted with a call to survfit, which is
much more general handling not only left-censored data.
- OOB estimator is known to be conservative. A warning is now issued when
using method="OOB" and there is no longer a default method for gbm.perf()
- cv.folds now an option to gbm and method="cv" is an option for gbm.perf.
Performs v-fold cross validation for estimating the optimal number of
iterations
- There is now a package vignette with details on the user options and the
mathematics behind the gbm engine.
Changes in version 1.3
- All likelihood based loss functions are now in terms of Deviance (-2*log
likelihood). As a result, gbm always minimizes the loss. Previous versions
minimized losses for some choices of distribution and maximized a likelihood
for other choices.
- Fixed the Poisson regression to avoid predicting +/- infinity which occurs
when a terminal node has only observations with y=0. The largest predicted
value is now +/-19, similar to what glm predicts for these extreme cases for
linear Poisson regression. The shrinkage factor will be applied to the -19
predictions so it will take 1/shrinkage gbm iterations locating pure terminal
nodes before gbm would actually return a predicted value of +/-19.
- Introduces shrink.gbm.pred() that does a lasso-style variable selection
Consider this function as still in an experimental phase.
- Bug fix in plot.gbm
- All calls to ISNAN now call ISNA (avoids using isnan)
Changes in version 1.2
- fixed gbm.object help file and updated the function to check for missing
values to the latest R standard.
- gbm.plot now allows i.var to be the names of the variables to plot or the
index of the variables used
- gbm now requires "stats" package into which "modreg" has been merged
- documentation for predict.gbm corrected
Changes in version 1.1
- all calculations of loss functions now compute averages rather than totals.
That is, all performance measures (text of progress, gbm.perf) now report
average log-likelihood rather than total log-likelihood (e.g. mean squared
error rather than sum of squared error). A slight exception applies to
distribution="coxph". For these models the averaging pertains only to the
uncensored observations. The denominator is sum(w[i]*delta[i]) rather than the
usual sum(w[i]).
- summary.gbm now has an experimental "method" argument. The default computes
the relative influence as before. The option "method=permutation.test.gbm"
performs a permutation test for the relative influence. Give it a try and let
me know how it works. It currently is not implemented for "distribution=coxph".
- added gbm.fit, a function that avoids the model.frame call, which is
tragically slow with lots of variables. gbm is now just a formula/model.frame
wrapper for the gbm.fit function. (based on a suggestion and code from Jim
Garrett)
- corrected a bug in the use of offsets. Now the user must pass the offset
vector with the offset argument rather than in the formula. Previously, offsets
were being used once as offsets and a second time as a predictor.
- predict.gbm now has a single.tree option. When set to TRUE the function will
return predictions from only that tree. The idea is that this may be useful for
reweighting the trees using a post-model fit adjustment.
- corrected a bug in CPoisson::BagImprovement that incorrectly computed the
bagged estimate of improvement
- corrected a bug for distribution="coxph" in gbm() and gbm.more(). If there
was a single predictor the functions would drop the unused array dimension issuing
an error.
- corrected gbm() distribution="coxph" when train.fraction=1.0. The program would
set two non-existant observations in the validation set and issue a warning.
- if a predictor variable has no variation a warning (rather than an error) is
now issued
- updated the documentation for calibrate.plot to match the implementation
- changed the some of the default values in gbm(), bag.fraction=0.5,
train.fraction=1.0, and shrinkage=0.001.
- corrected a bug in predict.gbm. The C code producing the predictions would
go into an infinite loop if predicting an observation with a level of a
categorical variable not seen in the training dataset. Now the routine uses
the missing value prediction. (Feng Zeng)
- added a "type" parameter to predict.gbm. The default ("link") is the same as
before, predictions are on the canonical scale (gradient scale). The new
option ("response") converts back the same scale as the outcome (probability
for bernoulli, mean for gaussian, etc.).
- gbm and gbm.more now have verbose options which can be set to FALSE to
suppress the progress and performance indicators. (several users requested
this nice feature)
- gbm.perf no longer prints out verbose information about the best iteration
estimate. It simply returns the estimate and creates the plots if requested.
- ISNAN, since R 1.8.0, R.h changed declarations for ISNAN(). These changes
broke gbm 1.0. I added the following code to buildinfo.h to fix this
#ifdef IEEE_754
#undef ISNAN
#define ISNAN(x) R_IsNaNorNA(x)
#endif
seems to work now but I'll look for a more elegant solution.
Changes in version 0.8
- Additional documentation about the loss functions, graphics, and methods is
now available with the package
- Fixed the initial value for the adaboost exponential loss. Prior
to version 0.8 the initial value was 0.0, now half the baseline
log-odds
- Changes in some headers and #define's to compile under gcc 3.2
(Brian Ripley)
Changes in version 0.7
- gbm.perf, the argument named best.iter.calc has been renamed
"method" for greater simplicity
- all entries in the design matrix are now coerced to doubles
(Thanks to Bonnie Ghosh)
- now checks that all predictors are either numeric, ordinal, or
factor
- summary.gbm now reports the correct relative influence when some
variables do not enter the model. (Thanks to Hugh Chipman)
- renamed several #define'd variables in buildinfo.h so they do
not conflict with standard winerror.h names.
Planned future changes
1. Add weighted median functionality to Laplace
2. Automate the fitting process, ie, selecting shrinkage and
number of iterations
3. Add overlay factor*continuous predictor plot as an option
rather than lattice plots
4. Add multinomial and ordered logistic regression procedures
Thanks to
RAND for sponsoring the development of this software through
statistical methods funding.
Kurt Hornik, Brian Ripley, and Jan De Leeuw for helping me get gbm up to the R
standard and into CRAN.
Dan McCaffrey for testing and evangelizing the utility of this
program.
Bonnie Ghosh for finding bugs.
Arnab Mukherji for testing and suggesting new features.
Daniela Golinelli for finding bugs and marrying me.
Andrew Morral for suggesting improvements and finding new
applications of the method in the evaluation of drug treatment
programs.
Katrin Hambarsoomians for finding bugs.
Hugh Chipman for finding bugs.
Jim Garrett for many suggestions and contributions.