<<introduction>> The field of machine learning offers many potent models for inference. Unfortunately, simply optimizing how well these models perform on a fixed training sample often leads to relatively poor performance on new test data compared to models that fit the training data less well. Regularization schemes are used to constrain the fitted model to improve performance on new data.
One popular regularization tactic is to corrupt the training data with independently sampled noise. This constrains the model to work on data that is different from the original training data in a way that does not change the correct inference. Sietsma and Dow demonstrated that adding Gaussian noise to the inputs improved the generalization of neural networks cite:sietsma1991creating. More recently, Srivastava et al. showed that setting a random collection of layer inputs of a neural network to zero for each training example greatly improved model test performance cite:srivastava2014dropout.
Both of these types of stochastic regularization have been shown to be
roughly interpretable as types of weight penalization, similar to
traditional statistical shrinkage techniques. Bishop showed that
using a small amount of additive noise is approximately a form of
generalized Tikhonov regularization cite:bishop1995training. Van der
Maaten et al. showed that dropout and several other types of sampled
noise can be replaced in linear models with modified loss functions
that have the same effect cite:van2013learning. Similarly, Wager et
al. showed that, for generalized linear models, using dropout is
approximately equivalent to an
As noted by Goodfellow et al., corrupting noise can be viewed as a form of dataset augmentation cite:goodfellow2016deep. Traditional data augmentation seeks to transform training points in ways that may drastically alter the point but minimally change the correct inference. Corruption generally makes the correct inference more ambiguous. Often, effective data augmentation requires domain-specific knowledge. However, data augmentation also tends to be much more effective than corruption, presumably because it prepares models for data similar to that which they may actually encounter. For example, DropConnect is a stochastic corruption method that is similar to dropout except that it randomly sets neural network weights, rather than inputs, to zero. Wan et al. showed that DropConnect (and dropout) could be used to reduce the error of a neural network on the MNIST cite:lecun1998gradient digits benchmark by roughly 20 percent. However, using only traditional augmentation they were able to reduce the error by roughly 70 percent cite:wan2013regularization. Since corruption seems to be a less effective regularizer than traditional data augmentation, we improved dropout by modifying it to be more closely related to the underlying data generation process.
<<thehb>> An obvious criticism of dropout as a data augmentation
scheme is that one does not usually expect to encounter randomly
zeroed features in real data, except perhaps in the “nightmare at test
time” cite:globerson2006nightmare scenario where important features
are actually anticipated to be missing. One therefore may wish to
replace some of the elements of a training point with values more
plausible than zeros. A natural solution is to sample a replacement
from the other training points. This guarantees that the replacement
arises from the correct joint distribution of the elements being
replaced. We call this scheme the hybrid bootstrap because it
produces hybrids of the training points by bootstrap
cite:efron1994introduction sampling. More formally, define
\begin{equation}
˜{\boldsymbol{x}} = \frac{1}{1-p}\boldsymbol{x} o \boldsymbol{ε},
\end{equation}
where
\begin{equation} \boldsymbol{\overset{\overset{\large.}{∪}}{x}} = \boldsymbol{x} o \boldsymbol{ε}
- \boldsymbol{\overset{\large.}{∪}} o (\boldsymbol{1} - \boldsymbol{ε}),
\end{equation}
where
#+RESULTS[463fd966c634e1314c7b36149da43c6f09d087e1]: Get
None
Typically dropout is performed with the normalization given in Equation dropout_def, but we do not use that normalization for this figure because it would make the lightly corrupted images dim; we do use the normalization elsewhere for dropout. This normalization does not seem to be useful for the hybrid bootstrap. One clear difference between the hybrid bootstrap and dropout for the image data of Figure hyb_drop_visual is that the dropout corrupted sample point remains recognizable even for corruption levels greater than 0.5, whereas the hybrid bootstrap sample, unsurprisingly, appears to be more strongly the corrupting digit 0 at such levels. In general, we find that lower fractions of covariates should be resampled for the hybrid bootstrap than should be dropped in dropout.
<<outline>> In this paper, we focus on applying the hybrid
bootstrap to image classification using glspl:cnn
cite:lecun1989backpropagation in the same layerwise way dropout is
typically incorporated. The basic hybrid bootstrap is an effective
tool in its own right, but we have also developed several
refinements that improve its performance both for general
prediction purposes and particularly for image classification. In
Section choose_p, we discuss a technique for simplifying the choice
of the hyperparameter
<<implementation>> We fit the glspl:cnn in this paper using
backpropagation and gls:sgd with momentum. The hybrid bootstrap
requires selection of
<<choose_p>>
#+RESULTS[56a57a04e0fe3ba07fa5e5329b84195fe6f95ee1]: Loss
None
The basic hybrid bootstrap requires selection of a hyperparameter
- Performance is much less sensitive to the choice of
$u$ than it is to the choice of$p$ (i.e. tuning is easier). - Occasionally employing near-zero levels of corruption ensures that the model performs well on the real training data.
choosing_p.pdf
The first advantage is illustrated in the top panel of Figure
loss_vs_p. Clearly there are many satisfactory choices of
\begin{equation}
\frac{1}{m}∇\boldsymbol{θ}∑i = 1^mL(f(\boldsymbol{x}(i);\boldsymbol{θ}), \boldsymbol{y}(i)),
\end{equation}
where
\begin{equation} \frac{1}{m}∑i = 1^m \left[ ∇f(\boldsymbol{x(i);\boldsymbol{θ})} L(f(\boldsymbol{x}(i);\boldsymbol{θ}), \boldsymbol{y}(i)) ⋅ \frac{Df(\boldsymbol{x}(i);\boldsymbol{θ})}{d\boldsymbol{θ}} \right]. \end{equation} The gradient of the loss in Equation chain_sgd_gradient is “small” when the loss is small; therefore, the individual contribution to the minibatch gradient is small from individual training examples with small losses. As training progresses, the model tends to have relatively small losses for relatively less-corrupted training points. Therefore, less-corrupted examples contribute less to the gradient after many epochs of training. We illustrate this in Figure gradient_figure by observing the Euclidean norm of the gradient in each layer as training of our experimental architecture on 1,000 MNIST training digits progresses.
#+RESULTS[813ee72e7d75562717e8fb0506cb96234cbf4816]: compute_grads
None
Clearly low probabilities of resampling are associated with smaller gradients. This relationship is somewhat less obvious for layers far from the output because the gradient size is affected by the amount of corruption between these layers and the output.
We have no reason to suppose that the uniform distribution is optimal
for sampling the hyperparameter
- We can easily ensure that
$p$ is between zero and one. - Uniformly distributed random numbers are readily available in most software packages.
- Using the uniform distribution ensures that values of
$p$ near zero are relatively probable compared to some symmetric, hump-shaped alternatives. This is a hedge to ensure regularized networks do not perform much worse than unregularized networks. For instance, using the uniform distribution helps assure that the optimization can “get started,” whereas heavily corrupted networks can sometimes fail to improve at all.
There are other plausible substitutes, such as the Beta distribution, which we have not investigated.
<<conv_sampling>>
The hybrid bootstrap of Equation hb_def does not account for the spatial structure exploited by glspl:cnn, so we investigated whether changing the sampling pattern based on this structure would improve the hybrid bootstrap’s performance on image tasks.
#+RESULTS[3dbfc3c01fd95df7365ae1d4d2c7f6497045f22a]: sampling_visualization_data_generator
None
In particular, we wondered if glspl:cnn would develop redundant filters to “solve” the problem of the hybrid bootstrap since the resampling locations are chosen independently for each filter. We therefore considered using the same spatial swapping pattern for every filter, which we call the spatial grid hybrid bootstrap since pixel positions are either swapped or not. Tompson et. al considered dropping whole filters as a modified form of dropout that they call SpatialDropout (their justification is also spatial) cite:tompson2015efficient. This approach seems a little extreme in the case of the hybrid bootstrap because the whole feature map would be swapped, but perhaps it could work since the majority of feature maps will still be associated with the target class. We call this variant the channel hybrid bootstrap to avoid confusion with the spatial grid hybrid bootstrap.
The feature maps following regularization corresponding to these schemes are visualized in Figure sampling_visualization. It is difficult to visually distinguish the spatial grid hybrid bootstrap from the basic hybrid bootstrap even though the feature maps for the spatial grid hybrid bootstrap are all swapped at the same locations, whereas the locations for the basic hybrid bootstrap are independently chosen. This may explain their similar performance.
We compare the error rates of the three hybrid bootstrap schemes in
the top left panel of Figure sampling_correlation_figure for various
values of
#+RESULTS[11f1252b5adc993237f6d8cb8120156192cfec1f]: sampling_validation
None
#+RESULTS[fb3d18c85f75bd548227c34e8e623f01ab52f6a8]: build_basic_and_2d_hybrid_bootstrap_networks
None
One possible measure of the redundancy of filters in a particular layer of a gls:cnn is the average absolute correlation between the output of the filters. We consider the median absolute correlation for 10 different initializations in the bottom panel of Figure sampling_correlation_figure. The middle two layers exhibit the pattern we expected: the spatial grid hybrid bootstrap leads to relatively small correlations between filters. However, this pattern does not hold for the first and last convolutional layer. If we attempt to reduce the initial absolute correlations of the filters with a rotation, even this pattern does not hold up.
Overall, the difference in performance between the spatial grid
hybrid bootstrap and the basic hybrid bootstrap is modest,
particularly near their optimal parameter value. We use the spatial
grid hybrid bootstrap for glspl:cnn on the basis that it seems to
perform at least as well as the basic hybrid bootstrap, and
outperforms the basic hybrid bootstrap if we select a
<<perf_vs_size>>
#+RESULTS[ab3559fa092d4232a3feac65bd41837b4968865d]: accuracy_vs_n_data_generator
None
We find the hybrid bootstrap to be particularly effective when only
a small number of training points are available. In the most
extreme case, only one training point per class exists. So-called
one-shot learning seeks to discriminate based on a single training
example. In Figure perf_vs_size_fig, we compare the performance of
dropout and the hybrid bootstrap for different training set sizes
using the hyperparameters
Both techniques perform remarkably well even for small dataset sizes but the hybrid bootstrap has a clear advantage. If one considers the logloss as a measure of model performance, the hybrid bootstrap works even when only one or two examples from each class are available. However, dropout is less effective than assigning equal odds to each class for those dataset sizes. The error rate of the network on dropout-corrupted data (shown in the top left panel of Figure perf_vs_size_fig) is quite low even though there is a large amount of dropout. This comparison is potentially unfair to dropout as an experienced practitioner may suspect that our test architecture contains too many parameters for such a small training set before using it. However, for the less-experienced who must rely on it, cross validation is challenging with only one training point.
<<benchmarks>> The previous sections employed smaller versions of the MNIST training digits for the sake of speed, but clearly the hybrid bootstrap is only useful if it works for larger datasets and for data besides the MNIST digits. To evaluate the hybrid bootstrap’s performance on three standard image benchmarks, we adopt a gls:cnn architecture very similar to the glspl:wrn of Zagoruyko and Komodakis cite:Zagoruyko2016WRN with three major differences. First, they applied dropout immediately prior to certain weight layers. Since their network uses skip connections, this means difficult regularization patterns can be bypassed, defeating the regularization. We therefore apply the hybrid bootstrap prior to each set of network blocks at a particular resolution. Second, we use 160 rather than 16 filters in the initial convolutional layer. This allows us to use the same level of hybrid bootstrap for each of the three regularization layers. Third, their training schedule halted after decreasing the learning rate three times by 80\%. Our version of the network continues to improve significantly at lower learning rates, so we decrease by the same amount five times. Our architecture is visualized in Figure benchmark_architecture.
We test this network on the CIFAR10 and CIFAR100 datasets, which
consist of RGB images with 50,000 training examples and 10,000 test
cases each and 10 and 100 classes respectively
cite:krizhevsky2009learning. We also evaluate this network on the
MNIST digits. We augment the CIFAR data with 15\% translations and
horizontal flips. We do not use data augmentation for the MNIST
digits. The images are preprocessed by centering and scaling
according to the channel-wise mean and standard deviation of the
training data. We use gls:sgd with Nesterov momentum 0.9 and start
with learning rate 0.1. The learning rate is decreased by 80\%
every 60 epochs and the network is trained for 360 epochs total.
The results are given in Table benchmark_table. We attempted to use
dropout in the same position as we use the hybrid bootstrap, but
this worked very poorly. At dropout levels
#+RESULTS[da46a2f06789207e800b767c8049030d1a786154]: wide_resnet_cifar100_bench
18.36
#+RESULTS[9fc0245db595f828d08f043a651df93545639195]: wide_resnet_cifar10_bench
3.4
#+RESULTS[bb75a19140b001ff85adef7f0b0218a9e3de7770]: wide_resnet_mnist_bench
0.3
#+RESULTS[4dc86460558ed3bb550f7cc2ccbd103d9ab82501]: wide_resnet_dropout_cifar100_bench
50.56
#+RESULTS[7a2ce1ac6464f24c197bc9cb4cb60963b9a7d874]: wide_resnet_less_dropout_cifar100_bench
28.83
#+RESULTS[74b919d88ea1c46c1a0cf290d0493f58d743c225]: wide_resnet_cifar100_bench_no_stoch_reg
20.1
#+RESULTS[0b3f3a1fd63d0993e0ede3772bb243e5496adbde]: wide_resnet_cifar10_bench_no_stoch_reg
4.13
#+RESULTS[38ce3805a006ea626b82eec1f4cb28ef111ce4d6]: wide_resnet_mnist_bench_no_stoch_reg
0.66
Dataset | Hybrid Bootstrap | No Stochastic Reg. | Dropout | No Stochastic Reg. |
(Our Architecture) | (Our Architecture) | (gls:wrn 28-10) | (gls:wrn 28-10) | |
/ | < | > | ||
---|---|---|---|---|
CIFAR10 | 3.4 | 4.13 | 3.89 | 4.00 |
CIFAR100 | 18.36 | 20.1 | 18.85 | 19.25 |
MNIST | 0.3 | 0.66 | NA | NA |
<<other_algorithms>> The hybrid bootstrap is not only useful for glspl:cnn. It is also applicable to other inferential algorithms and can be applied without modifying their underlying code by expanding the training set in the manner of traditional data augmentation.
#+RESULTS[9b50646efc11a2c3a2b9e3763b0b965c0afaf051]: mlp_hb
0.81
#+RESULTS[18d1d31eefa9c9efb3195006a0f9fec445ec9c6c]: mlp_drop
1.06
The multilayer perceptron is not of tremendous modern interest for
image classification, but it is still an effective model for other
tasks. Dropout is commonly used to regularize the multilayer
perceptron, but the hybrid bootstrap is even more effective. As an
example, we train a multilayer perceptron on the MNIST digits with
2 “hidden” layers of $213$ neurons each with gls:relu
activations and
#+RESULTS[128d6dfea7fc8ae7036ed96708d1a4eea0249af6]: accuracy_vs_expansion_generator
None
One of the most effective classes of prediction algorithms is that based on gradient boosted trees described by Friedman cite:friedman2001greedy. Boosted tree algorithms are not very competitive with glspl:cnn on image classification problems, but they are remarkably effective for prediction problems in general and have the same need for regularization as other nonparametric models. We use XGBoost cite:chen2016xgboost, a popular implementation of gradient boosted trees.
Vinayak and Gilad-Bachrach proposed dropping the constituent models
of the booster during training, similar to dropout
cite:pmlr-v38-korlakaivinayak15. This requires modifying the
underlying model fitting, which we have not attempted with the
hybrid bootstrap. However, if we naively generate hybrid bootstrap
data on 1,000 MNIST digits with the hyperparameters
#+RESULTS[82484748dba8f5baa13fd4e8e3f5e23c05fa2409]: xgboost_cancer_generator
None
#+RESULTS[a93c78f32107ab2b560bfb2a11b8297ec62cf9e3]: titanic_data_prep
#+RESULTS[d81e7d084baf57d6a74b8d2429be530e5b6c18d8]: xgboost_titanic_generator
None
We also compare dropout and the hybrid bootstrap for the breast
cancer dataset where malignancy is predicted from a set of 30
features cite:street1993nuclear. XGBoost provides its own
The hybrid bootstrap is an effective form of regularization. It can be applied in the same fashion as the tremendously popular dropout technique but offers superior performance. The hybrid bootstrap can easily be incorporated into other existing algorithms. Simply construct hybrid bootstrap data as we do in Section other_algorithms. Unlike other noising schemes, the hybrid bootstrap does not change the support of the data. However, the hybrid bootstrap does have some disadvantages. The hybrid bootstrap requires the choice of at least one additional hyperparameter. We have attempted to mitigate this disadvantage by sampling the hybrid bootstrap level, which makes performance less sensitive to the hyperparameter. The hybrid bootstrap performs best when the original dataset is greatly expanded. The magnitude of this disadvantage depends on the scenario in which supervised learning is being used. We think that any case where dropout is being used is a good opportunity to use the hybrid bootstrap. However, there are some cases, such as linear regression, where the hybrid bootstrap seems to offer roughly the same predictive performance as existing methods, such as ridge regression, but at a much higher computational cost. The hybrid bootstrap’s performance may depend on the basis in which the data are presented. This disadvantage is common to many algorithms. One reason we think the hybrid bootstrap works so well for neural networks is that they can create features in a new basis at each layer that can themselves be hybrid bootstrapped, so the initial basis is not as important as it may be for other algorithms.
We have given many examples of the hybrid bootstrap working, but have devoted little attention to explaining why it works. There is a close relationship between hypothesis testing and regularization. For instance, the limiting behavior of ridge regression is to drive regression coefficients to zero, a state which is a common null hypothesis. The limiting behavior of the hybrid bootstrap is to make the class (or continuous target) statistically independent of the regressors, as in a permutation test. Perhaps the hybrid bootstrap forces models to possess a weaker dependence between predictor variables and the quantity being predicted than they otherwise would. We recognize this is a vague explanation (and could be said of other forms of regularization), but we do find that the hybrid bootstrap has a lot of practical utility.
While we were writing this paper, Michael Jahrer independently used the basic hybrid bootstrap as input noise (under the alias “swap noise”) for denoising autoencoders as a component of his winning submission to the Porto Seguro Safe Driver Prediction Kaggle competition. Clearly this further establishes the utility of the hybrid bootstrap!
We have also recently learned that there are currently at least three distinct groups that have papers at various points in the publishing process concerning convex combinations of training points, which are similar to hybrid bootstrap combinations cite:convex1,convex2,convex3.
@@latex:\printglossaries@@ bibliographystyle:unsrt bibliography:refs