Installation

A library for data valuation.

pyDVL collects algorithms for Data Valuation and Influence Function computation.

Data Valuation is the task of estimating the intrinsic value of a data point wrt. the training set, the model and a scoring function. We currently implement methods from the following papers:

Ghorbani, Amirata, and James Zou. Data Shapley: Equitable Valuation of Data for Machine Learning. In International Conference on Machine Learning, 2242–51. PMLR, 2019.
Wang, Tianhao, Yu Yang, and Ruoxi Jia. Improving Cooperative Game Theory-Based Data Valuation via Data Utility Learning. arXiv, 2022.
Jia, Ruoxi, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li, Ce Zhang, Costas Spanos, and Dawn Song. Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms. Proceedings of the VLDB Endowment 12, no. 11 (1 July 2019): 1610–23.
Okhrati, Ramin, and Aldo Lipani. A Multilinear Sampling Algorithm to Estimate Shapley Values. In 25th International Conference on Pattern Recognition (ICPR 2020), 7992–99. IEEE, 2021.
Yan, T., & Procaccia, A. D. If You Like Shapley Then You’ll Love the Core. Proceedings of the AAAI Conference on Artificial Intelligence, 35(6) (2021): 5751-5759.
Jia, Ruoxi, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve Gürel, Bo Li, Ce Zhang, Dawn Song, and Costas J. Spanos. Towards Efficient Data Valuation Based on the Shapley Value. In 22nd International Conference on Artificial Intelligence and Statistics, 1167–76. PMLR, 2019.

Influence Functions compute the effect that single points have on an estimator / model. We implement methods from the following papers:

Koh, Pang Wei, and Percy Liang. Understanding Black-Box Predictions via Influence Functions. In Proceedings of the 34th International Conference on Machine Learning, 70:1885–94. Sydney, Australia: PMLR, 2017.

Installation

To install the latest release use:

$ pip install pyDVL

You can also install the latest development version from TestPyPI:

pip install pyDVL --index-url https://test.pypi.org/simple/

For more instructions and information refer to Installing pyDVL in the documentation.

Usage

The steps required to compute values for your samples are:

Create a Dataset object with your train and test splits.
Create an instance of a SupervisedModel (basically any sklearn compatible predictor)
Create a Utility object to wrap the Dataset, the model and a scoring function.
Use one of the methods defined in the library to compute the values.

This is how it looks for Truncated Montecarlo Shapley, an efficient method for Data Shapley values:

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from pydvl.value import *

data = Dataset.from_sklearn(load_breast_cancer(), train_size=0.7)
model = LogisticRegression()
u = Utility(model, data, Scorer("accuracy", default=0.0))
values = compute_shapley_values(
        u,
        mode=ShapleyMode.TruncatedMontecarlo,
        done=MaxUpdates(100) | AbsoluteStandardError(threshold=0.01),
        truncation=RelativeTruncation(u, rtol=0.01),
        )

For more instructions and information refer to Getting Started in the documentation. We provide several examples with details on the algorithms and their applications.

Caching

pyDVL offers the possibility to cache certain results and speed up computation. It uses Memcached For that.

You can run it either locally or, using Docker:

docker container run --rm -p 11211:11211 --name pydvl-cache -d memcached:latest

You can read more in the caching module's documentation.

Contributing

Please open new issues for bugs, feature requests and extensions. You can read about the structure of the project, the toolchain and workflow in the guide for contributions.

License

pyDVL is distributed under LGPL-3.0. A complete version can be found in two files: here and here.

All contributions will be distributed under this license.

Name		Name	Last commit message	Last commit date
Latest commit History 1,720 Commits
.github		.github
apt-cache		apt-cache
badges		badges
build_scripts		build_scripts
data		data
docs		docs
notebooks		notebooks
public		public
src/pydvl		src/pydvl
tests		tests
.bumpversion.cfg		.bumpversion.cfg
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
COPYING.LESSER		COPYING.LESSER
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
logo.svg		logo.svg
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements-notebooks.txt		requirements-notebooks.txt
requirements.txt		requirements.txt
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Usage

Caching

Contributing

License

About

Releases

Packages

Languages

License

mdbenito/pyDVL

Folders and files

Latest commit

History

Repository files navigation

Installation

Usage

Caching

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages