MaxCorr is a python package for the estimation of Maximal (Non-Linear) Correlations in sets of bivariate and multivariate data.
The package is available in PyPI and can be installed using the following command:
pip install maxcorr
Depending on the chosen algorithm and/or backend library to perform the numerical computation, additional dependencies might be required. Specifically, such dependencies can be downloaded passing a parameter between square brackets during the installation, i.e.:
maxcorr[torch]
, includes also a version of PyTorch (required forDensityIndicator
or any indicator usingtorch
as backend);maxcorr[tensorflow]
, includes also a version of Tensorflow (required for any indicator usingtensorflow
as backend);maxcorr[lattice]
, includes also a version of Tensorflow and Tensorflow Lattice (required forLatticeIndicator
).
A full installation including all the optional dependencies can be downloaded using:
pip install 'maxcorr[full]'
MaxCorr currently includes three semantics for non-linear correlations, namely:
[HGR]
The Hirschfield-Gebelin-Rényi coefficient (Rényi, 1959) is a non-linear correlation coefficient based on Pearson's correlation.
It is defined as the maximal correlation that can be achieved by transforming two random variables (
where
[GeDI]
The Generalized Disparate Impact (Giuliani et al., 2023) is a non-linear correlation coefficient which extends the concept of Disparate Impact (Aghaei et al., 2019) defined in the field of algorithmic fairness and whose value is computed as the ratio between the covariance of the two variables
namely we bound the two copula transformation to maintain the same standard deviation of the original random variables in order to avoid the explosion of the indicator by means of a simple rescaling of the mapping functions. Since HGR is based on Pearson's correlation, hence it is scale invariant, this aspect was not problematic in its definition; however, this also allows us to redefine HGR by imposing the same constraints used in GeDI without loss of generality, i.e.:
making GeDI equivalent to HGR up to a scaling factor which depends on the standard deviation of the original random variables, i.e.:
[NLC] The Non-Linear Covariance is a non-linear extension of the covariance measure that comes naturally after the definition of the first two semantics. By leveraging the same constraints used in GeDI, we define NLC as:
hence, adopting the same strategy used before, we have that NLC is also equivalent to HGR up to a scaling factor, i.e.:
MaxCorr currently implements six algorithms to estimate the non-linear correlations:
[Double Kernel]
This algorithm is inspired by the one proposed by Giuliani et al. (2023) for the computation of the Generalized Disparate Impact, and extended to account for HGR computation as well.
Given two vectors torch
or tensorflow
as backends, the algorithm also return gradient information along with the solution; moreover, if no mapping functions are specified, the indicator uses polynomial kernel expansions
[Single Kernel]
This is a variant of the previous algorithm which allows to account for functional dependencies only, although in either directions.
Formally, given a set of functions torch
or tensorflow
as backends.
Again, if no mapping functions are specified, the indicator uses polynomial kernel expansions.
[Neural]
This approach was proposed by Grari et al. (2020) and models the two copula transformation torch
and tensorflow
backends, hence it can return gradient information in both cases.
However, in order to function, it is necessary to install at least one of the two libraries required to perform the neural training, even in case of numpy
backend.
[Lattice]
This is a custom variant of Grari et al.'s approach which uses Lattice Models to approximate the two copula transformation.
In order to use this algorithm, the additional dependency [lattice]
must be added when installing the package; moreover, since the computational method relies on Tensorflow Lattice, gradient information is only returned when using tensorflow
as backend.
[Density]
This approach was proposed by Mary et al. (2019) and computes the maximal correlation using a theoretical upper-bound of HGR known as Witsenhausen's characterization (Witsenhausen, 1975).
Since the algorithm needs to perform Kernel-Density Estimation using torch
primitives, PyTorch needs to be installed in the machine and gradient information is only returned when using torch
as backend.
[Randomized]
This approach was proposed by Lopez-Paz et al. (2013) and computes the maximal correlation by mapping the input vectors
Indicators can be built using the main function maxcorr.indicator
, specifying these three options.
Eventually, the correlation can be computed using the method compute(a, b)
method, passing the inputs vectors/matrices as parameters.
import numpy as np
from maxcorr import indicator
a = np.random.normal(loc=3.0, scale=2.0, size=100)
b = np.random.normal(loc=2.0, scale=5.0, size=100)
ind = indicator(semantics='hgr', algorithm='dk', backend='numpy')
ind.compute(a, b)
Moreover, algorithms might have specific parameters that can be passed as keyword arguments to the indicator
function, or by calling the constructor method of each respective class which explicitly asks for specific indicator parameters.
from maxcorr import indicator
from maxcorr.indicators import DoubleKernelIndicator
dk1 = indicator(algorithm='dk', kernel_a=5, kernel_b=1)
dk2 = DoubleKernelIndicator(kernel_a=5, kernel_b=1)
For a more in-depth exposition, follow the tutorial file tutorial.ipynb
.