Skip to content

Commit

Permalink
Add support for mixture distributions
Browse files Browse the repository at this point in the history
  • Loading branch information
Alexander März committed Aug 25, 2023
1 parent 6661fc2 commit 7ac3bb3
Showing 1 changed file with 12 additions and 8 deletions.
20 changes: 12 additions & 8 deletions docs/examples/GaussianMixture_Regression_CaliforniaHousing.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Mixture densities or mixture distributions offer an extension to the notion of traditional univariate distributions by allowing the observed data to be thought of as arising from multiple underlying processes. In its essence, a mixture distribution is a weighted combination of several component distributions, where each component contributes to the overall mixture distribution, with the weights indicating the importance of each component. For instance, if you imagine the observed data distribution having multiple modes, a mixture of Gaussians could be employed to capture each mode with a separate Gaussian distribution. For each component of the mixture, there would be a set of parameters that depend on covariates, and additional mixing coefficients which are also modeled as a function of covariates. This is particularly useful when a single parametric distribution cannot adequately capture the underlying data generating process. A mixture distribution can be represented as follows:\n",
"Mixture densities or mixture distributions offer an extension to the notion of traditional univariate distributions by allowing the observed data to be thought of as arising from multiple underlying processes. In its essence, a mixture distribution is a weighted combination of several component distributions, where each component contributes to the overall mixture distribution, with the weights indicating the importance of each component. For instance, if you imagine the observed data distribution having multiple modes, a mixture of Gaussians could be employed to capture each mode with a separate Gaussian distribution. \n",
"\n",
"<center>\n",
"<img src=\"https://raw.githubusercontent.com/StatMixedML/XGBoostLSS/master/docs/mixture.png\" width=400/>\n",
"</center>\n",
"\n",
"For each component of the mixture, there would be a set of parameters that depend on covariates, and additional mixing coefficients which are also modeled as a function of covariates. This is particularly useful when a single parametric distribution cannot adequately capture the underlying data generating process. A mixture distribution can be represented as follows:\n",
"\n",
"\\begin{equation}\n",
"f\\bigl(y_{i} | \\boldsymbol{\\theta}_{i}(x_{i})\\bigr) = \\sum_{m=1}^{M} w_{i,m}(x_{i}) \\cdot f_{m}\\bigl(y_{i} | \\boldsymbol{\\theta}_{i,m}(x_{i})\\bigr)\n",
Expand Down Expand Up @@ -95,10 +101,10 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Distribution Selection\r\n",
"\r\n",
"In the following, we specify a list of candidate distributions. The function dist_select returns the negative log-likelihood of each distribution for the target variable. The distribution with the lowest negative log-likelihood is selected. The function also plots the density of the target variable and the fitted density, using the best suitable distribution among the specified ones. However, note that choosing the best performing mixture-distribution based solely on training data may lead to overfitting, since mixture-densities can approximate any distribution arbitrarily well. It is therefore crucial to carefully select the specifications to strike a balance between model complexity and generalization abilit.\r\n",
".\r\n"
"# Distribution Selection\n",
"\n",
"In the following, we specify a list of candidate distributions. The function dist_select returns the negative log-likelihood of each distribution for the target variable. The distribution with the lowest negative log-likelihood is selected. The function also plots the density of the target variable and the fitted density, using the best suitable distribution among the specified ones. However, note that choosing the best performing mixture-distribution based solely on training data may lead to overfitting, since mixture-densities can approximate any distribution arbitrarily well. It is therefore crucial to carefully select the specifications to strike a balance between model complexity and generalization abilit.\n",
".\n"
]
},
{
Expand Down Expand Up @@ -1129,9 +1135,7 @@
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
"text": []
}
],
"source": [
Expand Down

0 comments on commit 7ac3bb3

Please sign in to comment.