Data generation module for benchmarking methods for causal structure learning (CSL) from mixed discrete-continuous and nonlinear observational data based upon the mixed additive noise model (MANM). The related paper "MANM-CS: Data Generation for Benchmarking Causal Structure Learning from Mixed Discrete-Continuous and Nonlinear Data" was published at the NeurIPS-21 Workshop "Causal Inference & Machine Learning: Why now?".
python3 -m pip install manm-cs
python3 -m manm_cs --num_nodes 10 --edge_density 0.5 --num_samples 10000 --discrete_node_ratio 0.5
Start by cloning this repository.
git clone git@github.com:hpi-epic/manm-cs.git
cd manm-cs
Please make sure you have Python 3 installed. We tested the execution of our data generation with Python 3.9. We recommend installing the requirements defined in requirements.txt using virtualenv.
MacOS / Linux
# Install virtualenv
python3 -m pip install --user virtualenv
# Create a new virtual environment
python3 -m venv env
# Activate the virtual environment
source env/bin/activate
Windows
# Install virtualenv
py -m pip install --user virtualenv
# Create a new virtual environment
py -m venv env
# Activate the virtual environment
.\env\Scripts\activate
After the creation of a new virtual environment, we can install the project dependencies defined in setup.cfg for both platforms.
python3 -m pip install .
You can start the data generation with following command. The generated graph and the dataset are saved as ground_truth.gml and samples.csv in the current working directory. Available parameters for data generation can be seen with python3 -m manm_cs --help
.
python3 -m manm_cs \
--num_nodes 10 \
--edge_density 0.5 \
--num_samples 10000 \
--discrete_node_ratio 0.5
python3 -m pip install --upgrade build twine
python3 -m build
# Upload to testPyPi
# use __token__ as username and the pypi token as password
python3 -m twine upload --repository testpypi dist/*
# Upload to PyPi
python3 -m twine upload dist/*
name | Value Range | Default | Description |
---|---|---|---|
num_nodes | [1, Inf) | None | Defines the number of nodes to be in the generated DAG. |
edge_density | [0, 1] | None | Defines the density of edges in the generated DAG. |
discrete_node_ratio | [0, 1] | None | Defines the percentage of nodes that shall be of discrete type. Depending on its value the appropriate model (multivariate normal, mixed gaussian, discrete only) is chosen. |
num_samples | [1, Inf) | None | Defines the number of samples that shall be generated from the DAG. |
discrete_signal_to_noise_ratio | [0, 1] | 0.5 | Defines the ratio of uniform noise added within the mixed additive noise model, i.e., 0 = no noise, and 1 = uniform discrete noise |
min_discrete_value_classes | [2, Inf) | 3 | Defines the minimum number of discrete classes a discrete variable shall have. |
max_discrete_value_classes | [2, Inf) | 4 | Defines the maximum number of discrete classes a discrete variable shall have. |
continuous_noise_std | [0, Inf) | 1.0 | Defines the standard deviation of gaussian noise added to continuous variables. |
functions | ([0, 1], func) | id | A list of probabilities and mathmatical functions for relationships between two continuous nodes. Note, the input are tuples (probability, function), where the sum of all probabilities has to equal 1. Command line supported functions are: [linear, quadratic, cubic, tanh, sin, cos] |
num_processes | [1, Inf) | 1 | Number of processes used for data sampling |
conditional_gaussian | 0 or 1 | 1 | '1' Defines that conditional gaussian model is assumed for a mixture of variables. Otherwise '0', discrete variables can have continuous parents. |
beta_lower_limit | (0, Inf) | 0.5 | Lower limit for beta values for influence of continuous parents. Betas are sampled uniform from the union of [-upper,-lower] and [lower,upper]. Upper limit see below. |
beta_upper_limit | (0, Inf) | 1 | Upper limit for beta values for influence of continuous parents. Betas are sampled uniform from the union of [-upper,-lower] and [lower,upper]. Lower limit see above. |
graph_structure_file | None | Defines a path to a .gml file for a fixed DAG structure (ignoring node and edge characteristics) used during manm_cs graph building. Note graph_structure_file is mutually exclusive to num_nodes and edge_density. | |
variables_scaling | {'normal', 'standard', 'rank', 'rank', 'uniform'} | None | 'Scale the continuous variables ('normal' or standard') or all variables ('rank' or 'uniform’) in the dataset once all samples are generated. |
scale_parents | 0 or 1 | 0 | Defines if the influence of the parents on a child node is scaled, e.g., the sum of values of the parents is divided by the number of parents for a continuous child node. |
MIT