This exercise first explores sample statistics like mean and variance. It continues with the definition of the gaussian distributions. It ends with applying gaussian mixture models (gmm) and the expectation-maximization algorithm used to optimize gmm.
- To get started, take a look at the
src/sample_mean_corr_rhein.py
file.
Implement functions to compute sample mean and the standard deviation. Use
to calculate the mean and
to compute the standard deviation.
Return to the Rhine data-set. Load the data from ./data/pegel.tab
. Compute the water level mean and standard deviation before and after the year 2000.
We now want to use autocorrelation to analyse the discrete time signal of the rhine level measurements. Implement the auto_corr
function in src/sample_mean_corr_rhein.py
. It should implement the engineering version without the normalization and return the autocorrelation
with
with auto_corr
function. Once you have checked your implementation using nox -s test
, you can use np.correlate
for efficiency. Plot the autocorrelation for the rhine level measurements since 2000.
Normalize your data via
for all np.random.randn
by plotting both results with plt.plot
.
- Consider the
src/plot_gaussian.py
module. Implement thegaussian_pdf
function.
In one dimension gaussian probability density function is defined as
- Consider the
src/mixture_concpets.py
module. Implement a two-dimensional gaussian pdf following,
plt.imshow
function. np.linspace
and np.meshgrid
will help you.
We can use bell-curve sums for classification! A Gaussian mixture model has the density
With the normal distribution
After guessing an initial choice for all
tells us the probability with which point get_classification
in src/mixture_concepts.py
.
The np.argmax
function gets you an association between the data points and the Gaussians.
Use its output to select the points which belong to each class.
Optimizing the Gaussian parameters
- update
- update
- update
- update
Above fit_gmm
using these four steps. np.expand_dims
makes it possible to process
multiple samples simultaneously due to https://numpy.org/doc/stable/user/basics.broadcasting.html .
- The data in
mixture_mixture_diabetes.py
is real. It originally appeared in a medical journal (https://doi.org/10.1007/BF00423145). The plot below shows healthy and sick patients.
Train a gmm to find the diabetic patients.
- Standard packages like sci-kit-learn implement GMMs, too. Take a minute to read https://scikit-learn.org/stable/modules/mixture.html .