Skip to content

Commit

Permalink
Updates
Browse files Browse the repository at this point in the history
  • Loading branch information
Alexander März committed Aug 10, 2023
1 parent 890844b commit 5401417
Show file tree
Hide file tree
Showing 3 changed files with 15 additions and 8 deletions.
17 changes: 11 additions & 6 deletions docs/dgbm.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# Introduction

The development of modelling approaches that approximate and describe the data generating processes underlying the observed data in as much detail as possible is a guiding principle in both statistics and machine learning. We therefore strongly agree with the statement of Hothorn et al. (2014) that *''the ultimate goal of any regression analysis is to obtain information about the entire conditional distribution $F_{Y}(y|\mathbf{x})$ of a response given a set of explanatory variables''*. It has not been too long, though, that most regression models focused on estimating the conditional mean $\mathbb{E}(Y|\mathbf{X} = \mathbf{x})$ only, implicitly treating higher moments of the conditional distribution $F_{Y}(y|\mathbf{x})$ as fixed nuisance parameters. As such, models that minimize an $\ell_{2}$-type loss for the conditional mean are not able to fully exploit the information contained in the data, since this is equivalent to assuming a Normal distribution with constant variance. In real world situations, however, the data generating process is usually less well behaved, exhibiting characteristics such as heteroskedasticity, varying degrees of skewness and kurtosis or intermittent and sporadic behaviour. In recent years, however, there has been a clear shift in both academic and corporate research toward modelling the entire conditional distribution. This change in attention is most evident in the recent M5 forecasting competition (Makridakis et al., 2022a,b), which differed from previous ones in that it consisted of two parallel competitions: in addition to providing accurate point forecasts, participants were also asked to forecast nine different quantiles to approximate the distribution of future sales.
The development of modelling approaches that approximate and describe the data generating processes underlying the observed data in as much detail as possible is a guiding principle in both statistics and machine learning. We therefore strongly agree with the statement of Hothorn et al. (2014) that *''the ultimate goal of any regression analysis is to obtain information about the entire conditional distribution $F_{Y}(y|\mathbf{x})$ of a response given a set of explanatory variables''*.

It has not been too long, though, that most regression models focused on estimating the conditional mean $\mathbb{E}(Y|\mathbf{X} = \mathbf{x})$ only, implicitly treating higher moments of the conditional distribution $F_{Y}(y|\mathbf{x})$ as fixed nuisance parameters. As such, models that minimize an $\ell_{2}$-type loss for the conditional mean are not able to fully exploit the information contained in the data, since this is equivalent to assuming a Normal distribution with constant variance. In real world situations, however, the data generating process is usually less well behaved, exhibiting characteristics such as heteroskedasticity, varying degrees of skewness and kurtosis or intermittent and sporadic behaviour. In recent years, however, there has been a clear shift in both academic and corporate research toward modelling the entire conditional distribution. This change in attention is most evident in the recent M5 forecasting competition (Makridakis et al., 2022a,b), which differed from previous ones in that it consisted of two parallel competitions: in addition to providing accurate point forecasts, participants were also asked to forecast nine different quantiles to approximate the distribution of future sales.

# Distributional Gradient Boosting Machines

Expand Down Expand Up @@ -36,7 +38,7 @@ Within the original distributional regression framework, the functions $f_{k}(\c

### Multivariate Targets

To allow for a more flexible framework that explicitly models the dependencies of a $D$-dimensional response $\mathbf{y} = (y_{i1}, \ldots, y_{iD})^{T}, i=1, \ldots, N$, Klein et al. (2015) introduce a multivariate version of distributional regression. Similar to the univariate case, multivariate distributional regression relates all $\theta_{k}$, $k = 1, \ldots, K$ parameters of a multivariate density $f_{i}\big(y_{i1}, \ldots, y_{iD} | \theta_{i1}(\mathbf{x}\big), \ldots, \theta_{iK}(\mathbf{x})\big)$ to a set of covariates $\mathbf{x}$. A common choice for multivariate probabilistic regression is to assume a multivariate Gaussian distribution, with the density given
To allow for a more flexible framework that explicitly models the dependencies of a $D$-dimensional response $\mathbf{y} = (y_{i1}, \ldots, y_{iD})^{T}, i=1, \ldots, N$, Klein et al. (2015a) introduce a multivariate version of distributional regression. Similar to the univariate case, multivariate distributional regression relates all $\theta_{k}$, $k = 1, \ldots, K$ parameters of a multivariate density $f_{i}\big(y_{i1}, \ldots, y_{iD} | \theta_{i1}(\mathbf{x}\big), \ldots, \theta_{iK}(\mathbf{x})\big)$ to a set of covariates $\mathbf{x}$. A common choice for multivariate probabilistic regression is to assume a multivariate Gaussian distribution, with the density given

\begin{equation}
f\big(\mathbf{y}|\theta_{\mathbf{x}}\big) = \frac{1}{\sqrt{(2\pi)^{D}|\Sigma_{\mathbf{x}}|}}\exp\left(-\frac{1}{2}(\mathbf{y} - \mu_{\mathbf{x}})^{T} \Sigma^{-1}_{\mathbf{x}} (\mathbf{y} - \mu_{\mathbf{x}})\right)
Expand All @@ -51,7 +53,7 @@ where $\mu_{\mathbf{x}} \in \mathbb{R}^{D}$ represents a vector of conditional m
\end{bmatrix}
\end{equation}

with the variances on the diagonal and the covariances on the off-diagonal, for $ i=1, \ldots, N $. Other examples include the Cholesky Decomposition or a Low-Rank Covariance approximation of the covariance matrix. For additional details and available distributions, see März (2022).
with the variances on the diagonal and the covariances on the off-diagonal, for $ i=1, \ldots, N $. Other examples include the Cholesky Decomposition or a Low-Rank Covariance approximation of the covariance matrix. For additional details and available distributions, see März (2022a).

### Normalizing Flows

Expand All @@ -68,14 +70,17 @@ Based on the complete transformation function $h=h_{J}\circ\ldots\circ h_{1}$, t

where scaling with the Jacobian determinant $|h^{\prime}(\mathbf{y})| = |\partial h(\mathbf{y}) / \partial \mathbf{y}|$ ensures $f_{Y}(\mathbf{y})$ to be a proper density integrating to one.

## GBMLSS: Gradient Boosting Machines for Location, Scale and Shape
We draw inspiration from GAMLSS and label our models as XGBoost for Location, Scale and Shape (XGBoostLSS). Despite its nominal reference to GAMLSS, our framework is designed in such a way to accommodate the modeling of a wide range of parametrizable distributions that go beyond location, scale and shape. XGBoostLSS requires the specification of a suitable distribution from which Gradients and Hessians are derived. These represent the partial first and second order derivatives of the log-likelihood with respect to the parameter of interest. GBMLSS are based on multi-parameter optimization, where a separate tree is grown for each parameter. Estimation of Gradients and Hessians, as well as the evaluation of the loss function is done simultaneously for all parameters. Gradients and Hessians are derived using PyTorch's automatic differentiation capabilities. The flexibility offered by automatic differentiation allows users to easily implement novel or customized parametric distributions for which Gradients and Hessians are difficult to derive analytically. It also facilitates the usage of Normalizing Flows, or to add additional constraints to the loss function. To improve the convergence and stability of GBMLSS estimation, unconditional Maximum Likelihood estimates of the parameters are used as offset values. To enable a deeper understanding of the data generating process, GBMLSS also provide attribute importance and partial dependence plots using the Shapley-Value approach.
## Gradient Boosting Machines for Location, Scale and Shape

We draw inspiration from GAMLSS and label our model as XGBoost for Location, Scale and Shape (XGBoostLSS). Despite its nominal reference to GAMLSS, our framework is designed in such a way to accommodate the modeling of a wide range of parametrizable distributions that go beyond location, scale and shape. XGBoostLSS requires the specification of a suitable distribution from which Gradients and Hessians are derived. These represent the partial first and second order derivatives of the log-likelihood with respect to the parameter of interest. GBMLSS are based on multi-parameter optimization, where a separate tree is grown for each parameter. Estimation of Gradients and Hessians, as well as the evaluation of the loss function is done simultaneously for all parameters. Gradients and Hessians are derived using PyTorch's automatic differentiation capabilities. The flexibility offered by automatic differentiation allows users to easily implement novel or customized parametric distributions for which Gradients and Hessians are difficult to derive analytically. It also facilitates the usage of Normalizing Flows, or to add additional constraints to the loss function. To improve the convergence and stability of GBMLSS estimation, unconditional Maximum Likelihood estimates of the parameters are used as offset values. To enable a deeper understanding of the data generating process, GBMLSS also provide attribute importance and partial dependence plots using the Shapley-Value approach.

# References

- Nadja Klein, Thomas Kneib, Stephan Klasen, and Stefan Lang. Bayesian structured additive distributional regression for multivariate responses. Journal of the Royal Statistical Society: Series C (Applied Statistics), 64(4):569–591, 2015a.
- Nadja Klein, Thomas Kneib, and Stefan Lang. Bayesian Generalized Additive Models for Location, Scale, and Shape for Zero-Inflated and Overdispersed Count Data. Journal of the American Statistical Association, 110(509):405–419, 2015b.
- Alexander März (2022). Multi-Target XGBoostLSS Regression. arXiv pre-print.
- Alexander März. Multi-Target XGBoostLSS Regression. arXiv pre-print, 2022a.
- Alexander März, and Thomas Kneib. Distributional Gradient Boosting Machines, 2022b.
- Alexander März. XGBoostLSS - An extension of XGBoost to probabilistic forecasting, 2019.
- Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos. The M5 competition: Background, organization, and implementation. International Journal of Forecasting, 38(4):1325–1336, 2022a.
- Spyros Makridakis, Evangelos Spiliotis, Vassilios Assimakopoulos, Zhi Chen, Anil Gaba, Ilia Tsetlin, and Robert L. Winkler. The M5 uncertainty competition: Results, findings and conclusions. International Journal of Forecasting, 38(4):1365–1385, 2022b.
- R. A. Rigby and D. M. Stasinopoulos. Generalized additive models for location, scale and shape. Journal of the Royal Statistical Society: Series C (Applied Statistics), 54(3):507–554, 2005.
Expand Down
4 changes: 3 additions & 1 deletion docs/distributions.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Available Distributions
XGBoostLSS is built upon PyTorch and Pyro, enabling users to harness a diverse set of distributional families and to leverage automatic differentiation capabilities. This greatly expands the options for probabilistic modeling and uncertainty estimation and allows users to tackle complex regression tasks. XGBoostLSS currently supports the following distributions.
XGBoostLSS is built upon PyTorch and Pyro, enabling users to harness a diverse set of distributional families and to leverage automatic differentiation capabilities. This greatly expands the options for probabilistic modeling and uncertainty estimation and allows users to tackle complex regression tasks.

XGBoostLSS currently supports the following distributions.

| Distribution | Usage |Type | Support | Number of Parameters |
| :----------------------------------------------------------------------------------------------------------------------------------: |:------------------------: |:-------------------------------------: | :-----------------------------: | :-----------------------------: |
Expand Down
2 changes: 1 addition & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ plugins:
show_submodules: true

nav:
- About: index.md
- Overview: index.md
- Distributional Modelling: dgbm.md
- Available Distributions: distributions.md
- Examples:
Expand Down

0 comments on commit 5401417

Please sign in to comment.