GitHub - zealscott/SynMeter: A principled library for tuning, training and evaluating tabular data synthesis on fidelity, privacy and utility.

^{^{Generated by DALL·E 3}}
Systematic Assessment of Tabular Data Synthesis Algorithms

A principled library for tuning, training, and evaluating tabular data synthesis.

What's New

[Sep 18, 2024] We add a new SOTA HP synthesizer TabSyn to SynMeter! Try it out!

Why SynMeter:

💫 Easy to add new synthesizers, seamlessly tuning, training, and evaluating various synthesizers.
🌀 Principled evaluation metrics for fidelity, privacy, and utility for both Differential Private (DP) and Heuristic Private (HP) synthesizers.
🔥 Several SOTA synthesizers, by type:
- Statistical methods: MST, PrivSyn
- GAN-based: CTGAN, PATE-GAN
- VAE-based: TVAE
- Diffusion-based: TabDDPM, TabSyn, TableDiffusion
- LLM-based: GReaT

🚀 Installation

Create a new conda environment and setup:

conda create -n synmeter python==3.9
conda activate synmeter
pip install -r requirements.txt # install dependencies
pip install -e . # package the library

Change the base dictionary in ./lib/info/ROOT_DIR:

ROOT_DIR = root_to_synmeter

💥 Usage

Datasets

SynMeter provides 12 standardized datasets with train/val/test datasets for benchmarking, which can be downloaded from here: Google Drive
You can also easily use an additional dataset by putting it to ./dataset.

Tune evaluators for utility evaluations

Machine learning affinity requires machine learning models with tuned hyperparameters, SynMeter provides 8 commonly-used machine learning models and their configurations in ./exp/evaluators.
You can tune these evaluators on your customized dataset:

python scripts/tune_evaluator.py -d [dataset] -c [cuda]

Tune synthesizer

We provide a unified tuning objective for model tuning, thus, all kinds of synthesizers can be tuned by just a single command:

python scripts/tune_synthesizer.py -d [dataset] -m [synthesizer] -s [seed] -c [cuda]

Train synthesizer

After tuning, a configuration should be recorded to /exp/dataset/synthesizer, SynMeter can use it to train and store the synthesizer:

python scripts/train_synthesizer.py -d [dataset] -m [synthesizer] -s [seed] -c [cuda]

Evaluate synthesizer

Assessing the fidelity of the synthetic data:

python scripts/eval_fidelity.py -d [dataset] -m [synthesizer] -s [seed] -t [target]

Assessing the privacy of the synthetic data:

python scripts/eval_privacy.py -d [dataset] -m [synthesizer] -s [seed]

Assessing the utility of the synthetic data:

python scripts/eval_utility.py -d [dataset] -m [synthesizer] -s [seed]

The results of the evaluations should be saved under the corresponding dictionary /exp/dataset/synthesizer.

📖 Customize your own synthesizer

One advantage of SynMeter is to provide the easiest way to add new synthesis algorithms, three steps are needed:

Write new synthesis code in modularity into ./synthesizer/my_synthesiszer
Create a base configuration in ./exp/base_config.
Create a calling python function in ./synthesizer, which contain three functions: train, sample, and tune.

Then, you are free to tune, run, and test the new synthesizer!

🔑 Methods

Statistical Methods

Method	Type	Description	Reference
MST	DP	The method uses probabilistic graphical models to learn the dependence of low-dimensional marginals for data synthesis.	Paper, Code
PrivSyn	DP	A non-parametric DP synthesizer, which iteratively updates the synthetic dataset to make it match the target noise marginals.	Paper, Code

Generative Adversarial Networks (GANs)

Method	Type	Description	Reference
CTGAN	HP	A conditional generative adversarial network that can handle tabular data.	Paper, Code
PATE-GAN	DP	The method uses the Private Aggregation of Teacher Ensembles (PATE) framework and applies it to GANs.	Paper, Code

Variational Autoencoders (VAE)

Method	Type	Description	Reference
TVAE	HP	A conditional VAE network which can handle tabular data.	Paper, Code

Diffusion Models

Method	Type	Description	Reference
TabDDPM	HP	Use diffusion model for tabular data synthesis	Paper, Code
TabSyn	HP	Use latent diffusion model and VAE for synthesis.	Paper, Code
TableDiffusion	DP	Generating tabular datasets under differential privacy.	Paper, Code

Large Language Model (LLM)-based Models

Method	Type	Description	Reference
GReaT	HP	Use LLM to fine tune a tabular dataset.	Paper, Code

⚡ Evaluation Metrics

Fidelity metrics: we consider the Wasserstein distance as a principled fidelity metric, which is calculated by all one and two-way marginals.
Privacy metrics: we devise the Membership Disclosure Score (MDS) to measure the membership privacy risks of both HP and DP synthesizers.
Utility metrics: we use machine learning affinity and query error to measure the utility of synthetic data.

Please see our paper for details and usages.

🌈 Acknowledge

Many excellent synthesis algorithms and open-source libraries are used in this project:

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
datasets		datasets
docs		docs
evaluator		evaluator
exp		exp
lib		lib
scripts		scripts
synthesizer		synthesizer
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

^{^{Generated by DALL·E 3}}
Systematic Assessment of Tabular Data Synthesis Algorithms

A principled library for tuning, training, and evaluating tabular data synthesis.

What's New

Why SynMeter:

🚀 Installation

💥 Usage

Datasets

Tune evaluators for utility evaluations

Tune synthesizer

Train synthesizer

Evaluate synthesizer

📖 Customize your own synthesizer

🔑 Methods

Statistical Methods

Generative Adversarial Networks (GANs)

Variational Autoencoders (VAE)

Diffusion Models

Large Language Model (LLM)-based Models

⚡ Evaluation Metrics

🌈 Acknowledge

About

Releases

Packages

Languages

License

zealscott/SynMeter

Folders and files

Latest commit

History

Repository files navigation

Generated by DALL·E 3 Systematic Assessment of Tabular Data Synthesis Algorithms

A principled library for tuning, training, and evaluating tabular data synthesis.

What's New

Why SynMeter:

🚀 Installation

💥 Usage

Datasets

Tune evaluators for utility evaluations

Tune synthesizer

Train synthesizer

Evaluate synthesizer

📖 Customize your own synthesizer

🔑 Methods

Statistical Methods

Generative Adversarial Networks (GANs)

Variational Autoencoders (VAE)

Diffusion Models

Large Language Model (LLM)-based Models

⚡ Evaluation Metrics

🌈 Acknowledge

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

^{^{Generated by DALL·E 3}}
Systematic Assessment of Tabular Data Synthesis Algorithms

Packages