diff --git a/.github/CODE_OF_CONDUCT.md b/.github/CODE_OF_CONDUCT.md index fac374d..be21695 100644 --- a/.github/CODE_OF_CONDUCT.md +++ b/.github/CODE_OF_CONDUCT.md @@ -55,7 +55,7 @@ further defined and clarified by project maintainers. ## Enforcement Instances of abusive, harassing, or otherwise unacceptable behavior may be -reported by contacting the project team. All +reported by contacting the project team at team@uplift-modeling.com. All complaints will be reviewed and investigated and will result in a response that is deemed necessary and appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md index 250e4d8..75df633 100644 --- a/.github/CONTRIBUTING.md +++ b/.github/CONTRIBUTING.md @@ -19,8 +19,8 @@ So, please make a pull request to the ``dev`` branch. 1. Fork the [project repository](https://github.com/maks-sh/scikit-uplift). 2. Clone your fork of the scikit-uplift repo from your GitHub account to your local disk: ``` bash - $ git clone git@github.com:YourLogin/scikit-uplift.git - $ cd scikit-learn + $ git clone https://github.com/YourName/scikit-uplift + $ cd scikit-uplift ``` 3. Add the upstream remote. This saves a reference to the main scikit-uplift repository, which you can use to keep your repository synchronized with the latest changes: ``` bash @@ -36,7 +36,7 @@ So, please make a pull request to the ``dev`` branch. $ git checkout -b feature/my_new_feature ``` and start making changes. Always use a feature branch. It’s a good practice. -6. Develop the feature on your feature branch on your computer, using Git to do the version control. When you’re done editing, add changed files using ``git add`` and then ``git commit``. +6. Develop the feature on your feature branch on your computer, using Git to do the version control. When you’re done editing, add changed files using ``git add .`` and then ``git commit`` Then push the changes to your GitHub account with: ``` bash @@ -55,4 +55,4 @@ We follow the PEP8 style guide for Python. Docstrings follow google style. * Use the present tense ("Add feature" not "Added feature") * Use the imperative mood ("Move cursor to..." not "Moves cursor to...") * Limit the first line to 72 characters or less -* Reference issues and pull requests liberally after the first line \ No newline at end of file +* Reference issues and pull requests liberally after the first line diff --git a/.github/PULL_REQUEST_TEMPLATE/pull_request_template.md b/.github/pull_request_template.md similarity index 100% rename from .github/PULL_REQUEST_TEMPLATE/pull_request_template.md rename to .github/pull_request_template.md diff --git a/.github/workflows/PyPi_upload.yml b/.github/workflows/PyPi_upload.yml new file mode 100644 index 0000000..19fc7f9 --- /dev/null +++ b/.github/workflows/PyPi_upload.yml @@ -0,0 +1,28 @@ +name: Upload to PyPi + +on: + release: + types: [published] + +jobs: + deploy: + + runs-on: ubuntu-latest + + steps: + - uses: actions/checkout@v2 + - name: Set up Python + uses: actions/setup-python@v2 + with: + python-version: '3.x' + - name: Install dependencies + run: | + python -m pip install --upgrade pip + pip install setuptools wheel twine + - name: Build and publish + env: + TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }} + TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }} + run: | + python setup.py sdist bdist_wheel + twine upload dist/* diff --git a/.github/workflows/ci-test.yml b/.github/workflows/ci-test.yml new file mode 100644 index 0000000..13566c7 --- /dev/null +++ b/.github/workflows/ci-test.yml @@ -0,0 +1,48 @@ +name: Python package + +on: + push: + branches: [ master ] + pull_request: + + +jobs: + test: + name: Check tests + runs-on: ${{ matrix.operating-system }} + strategy: + matrix: + operating-system: [ubuntu-latest, windows-latest, macos-latest] + python-version: [3.6, 3.7, 3.8, 3.9] + fail-fast: false + + steps: + - uses: actions/checkout@v2 + - name: Set up Python ${{ matrix.python-version }} + uses: actions/setup-python@v2 + with: + python-version: ${{ matrix.python-version }} + - name: Install dependencies and lints + run: pip install pytest .[tests] + - name: Run PyTest + run: pytest + + check_sphinx_build: + name: Check Sphinx build for docs + runs-on: ubuntu-latest + strategy: + matrix: + python-version: [3.8] + steps: + - name: Checkout + uses: actions/checkout@v2 + - name: Set up Python + uses: actions/setup-python@v2 + with: + python-version: ${{ matrix.python-version }} + - name: Update pip + run: python -m pip install --upgrade pip + - name: Install dependencies + run: pip install -r docs/requirements.txt + - name: Run Sphinx + run: sphinx-build -b html docs /tmp/_docs_build \ No newline at end of file diff --git a/Readme.rst b/Readme.rst index 1c8e853..e53623d 100644 --- a/Readme.rst +++ b/Readme.rst @@ -36,27 +36,36 @@ scikit-uplift =============== -**scikit-uplift** is a Python module for classic approaches for uplift modeling built on top of scikit-learn. +**scikit-uplift (sklift)** is an uplift modeling python package that provides fast sklearn-style models implementation, evaluation metrics and visualization tools. -Uplift prediction aims to estimate the causal impact of a treatment at the individual level. +Uplift modeling estimates a causal effect of treatment and uses it to effectively target customers that are most likely to respond to a marketing campaign. + +**Use cases for uplift modeling:** + +* Target customers in the marketing campaign. Quite useful in promotion of some popular product where there is a big part of customers who make a target action by themself without any influence. By modeling uplift you can find customers who are likely to make the target action (for instance, install an app) only when treated (for instance, received a push). + +* Combine a churn model and an uplift model to offer some bonus to a group of customers who are likely to churn. + +* Select a tiny group of customers in the campaign where a price per customer is high. Read more about uplift modeling problem in `User Guide `__, -also articles in russian on habr.com: `Part 1 `__ + +Articles in russian on habr.com: `Part 1 `__ and `Part 2 `__. **Features**: -* Comfortable and intuitive style of modelling like scikit-learn; +* Сomfortable and intuitive scikit-learn-like API; -* Applying any estimator adheres to scikit-learn conventions; +* Applying any estimator compatible with scikit-learn (e.g. Xgboost, LightGBM, Catboost, etc.); * All approaches can be used in sklearn.pipeline (see example (`EN `__ |Open In Colab3|_, `RU `__ |Open In Colab4|_)); -* Almost all implemented approaches solve both the problem of classification and regression; +* Almost all implemented approaches solve classification and regression problem; -* A lot of metrics (Such as *Area Under Uplift Curve* or *Area Under Qini Curve*) are implemented to evaluate your uplift model; +* More uplift metrics that you have ever seen in one place! Include brilliants like *Area Under Uplift Curve* (AUUC) or *Area Under Qini Curve* (Qini coefficient) with ideal cases; -* Useful graphs for analyzing the built model. +* Nice and useful viz for analyzing a performance model. Installation ------------- @@ -149,7 +158,7 @@ See the **RetailHero tutorial notebook** (`EN `_ for more details. - By participating in this project, you agree to abide by its `Code of Conduct `__. +If you have any questions, please contact us at team@uplift-modeling.com + Contributing ~~~~~~~~~~~~~~~ diff --git a/docs/Readme.rst b/docs/Readme.rst index b998e78..6ea4463 100644 --- a/docs/Readme.rst +++ b/docs/Readme.rst @@ -1,9 +1,9 @@ -.. _scikit-uplift.readthedocs.io: https://scikit-uplift.readthedocs.io/en/latest/ +.. _uplift-modeling.com: https://www.uplift-modeling.com/en/latest/index.html Documentation =============== -The full documentation is available at `scikit-uplift.readthedocs.io`_. +The full documentation is available at `uplift-modeling.com`_. Or you can build the documentation locally using `Sphinx `_ 1.4 or later: diff --git a/docs/_static/images/user_guide/ug_uplift_approaches.png b/docs/_static/images/user_guide/ug_uplift_approaches.png new file mode 100644 index 0000000..86a9f60 Binary files /dev/null and b/docs/_static/images/user_guide/ug_uplift_approaches.png differ diff --git a/docs/api/datasets/clear_data_dir.rst b/docs/api/datasets/clear_data_dir.rst new file mode 100644 index 0000000..c5db1cc --- /dev/null +++ b/docs/api/datasets/clear_data_dir.rst @@ -0,0 +1,5 @@ +***************************************** +`sklift.datasets <./>`_.clear_data_dir +***************************************** + +.. autofunction:: sklift.datasets.datasets.clear_data_dir \ No newline at end of file diff --git a/docs/api/datasets/fetch_criteo.rst b/docs/api/datasets/fetch_criteo.rst new file mode 100644 index 0000000..b3f72da --- /dev/null +++ b/docs/api/datasets/fetch_criteo.rst @@ -0,0 +1,9 @@ +.. _Criteo: + +************************************** +`sklift.datasets <./>`_.fetch_criteo +************************************** + +.. autofunction:: sklift.datasets.datasets.fetch_criteo + +.. include:: ../../../sklift/datasets/descr/criteo.rst \ No newline at end of file diff --git a/docs/api/datasets/fetch_hillstrom.rst b/docs/api/datasets/fetch_hillstrom.rst new file mode 100644 index 0000000..d71d722 --- /dev/null +++ b/docs/api/datasets/fetch_hillstrom.rst @@ -0,0 +1,9 @@ +.. _Hillstrom: + +**************************************** +`sklift.datasets <./>`_.fetch_hillstrom +**************************************** + +.. autofunction:: sklift.datasets.datasets.fetch_hillstrom + +.. include:: ../../../sklift/datasets/descr/hillstrom.rst \ No newline at end of file diff --git a/docs/api/datasets/fetch_lenta.rst b/docs/api/datasets/fetch_lenta.rst new file mode 100644 index 0000000..dd2f225 --- /dev/null +++ b/docs/api/datasets/fetch_lenta.rst @@ -0,0 +1,9 @@ +.. _Lenta: + +*********************************** +`sklift.datasets <./>`_.fetch_lenta +*********************************** + +.. autofunction:: sklift.datasets.datasets.fetch_lenta + +.. include:: ../../../sklift/datasets/descr/lenta.rst \ No newline at end of file diff --git a/docs/api/datasets/fetch_x5.rst b/docs/api/datasets/fetch_x5.rst new file mode 100644 index 0000000..cb42b2f --- /dev/null +++ b/docs/api/datasets/fetch_x5.rst @@ -0,0 +1,9 @@ +.. _X5: + +*********************************** +`sklift.datasets <./>`_.fetch_x5 +*********************************** + +.. autofunction:: sklift.datasets.datasets.fetch_x5 + +.. include:: ../../../sklift/datasets/descr/x5.rst \ No newline at end of file diff --git a/docs/api/datasets/get_data_dir.rst b/docs/api/datasets/get_data_dir.rst new file mode 100644 index 0000000..33b7486 --- /dev/null +++ b/docs/api/datasets/get_data_dir.rst @@ -0,0 +1,5 @@ +***************************************** +`sklift.datasets <./>`_.get_data_dir +***************************************** + +.. autofunction:: sklift.datasets.datasets.get_data_dir \ No newline at end of file diff --git a/docs/api/datasets/index.rst b/docs/api/datasets/index.rst new file mode 100644 index 0000000..2103149 --- /dev/null +++ b/docs/api/datasets/index.rst @@ -0,0 +1,13 @@ +************************ +`sklift <../>`_.datasets +************************ + +.. toctree:: + :maxdepth: 3 + + ./clear_data_dir + ./get_data_dir + ./fetch_lenta + ./fetch_x5 + ./fetch_criteo + ./fetch_hillstrom \ No newline at end of file diff --git a/docs/api/index.rst b/docs/api/index.rst index 658e657..363c63d 100644 --- a/docs/api/index.rst +++ b/docs/api/index.rst @@ -15,3 +15,4 @@ This is the modules reference of scikit-uplift. ./models/index ./metrics/index ./viz/index + ./datasets/index \ No newline at end of file diff --git a/docs/changelog.md b/docs/changelog.md index 309cedc..4eee210 100644 --- a/docs/changelog.md +++ b/docs/changelog.md @@ -8,68 +8,90 @@ * 🔨 something that previously didn’t work as documentated – or according to reasonable expectations – should now work. * ❗️ you will need to change your code to have the same effect in the future; or a feature will be removed in the future. +## Version 0.3.0 + +### [sklift.datasets](https://www.uplift-modeling.com/en/latest/en/latest/api/datasets/index.html) + +* 🔥 Add [sklift.datasets](https://www.uplift-modeling.com/en/latest/en/latest/user_guide/index.html) by [@ElisovaIra](https://github.com/ElisovaIra), [@RobbStarkk](https://github.com/RobbStarkk), [@acssar](https://github.com/acssar), [@tankudo](https://github.com/tankudo), [@flashlight101](https://github.com/flashlight101), [@semenova-pd](https://github.com/semenova-pd), [@timfex](https://github.com/timfex) + +### [sklift.models](https://www.uplift-modeling.com/en/latest/en/latest/api/models.html) + +* 📝 Add different checkers by [@ElisovaIra](https://github.com/ElisovaIra) + +### [sklift.metrics](https://www.uplift-modeling.com/en/latest/en/latest/api/metrics.html) + +* 📝 Add different checkers by [@ElisovaIra](https://github.com/ElisovaIra) + +### [sklift.viz](https://www.uplift-modeling.com/en/latest/en/latest/api/viz.html) + +* 📝 Fix conflicting and duplicating default values by [@denniskorablev](https://github.com/denniskorablev) + +### [User Guide](https://www.uplift-modeling.com/en/latest/en/latest/user_guide/index.html) + +* 📝 Fix typos + ## Version 0.2.0 -### [User Guide](https://scikit-uplift.readthedocs.io/en/latest/user_guide/index.html) +### [User Guide](https://www.uplift-modeling.com/en/latest/en/latest/user_guide/index.html) -* 🔥 Add [User Guide](https://scikit-uplift.readthedocs.io/en/latest/user_guide/index.html) +* 🔥 Add [User Guide](https://www.uplift-modeling.com/en/latest/en/latest/user_guide/index.html) -### [sklift.models](https://scikit-uplift.readthedocs.io/en/latest/api/models.html) +### [sklift.models](https://www.uplift-modeling.com/en/latest/en/latest/api/models.html) -* 💥 Add `treatment interaction` method to [SoloModel](https://scikit-uplift.readthedocs.io/en/latest/api/models/SoloModel.html) approach by [@AdiVarma27](https://github.com/AdiVarma27). +* 💥 Add `treatment interaction` method to [SoloModel](https://www.uplift-modeling.com/en/latest/en/latest/api/models/SoloModel.html) approach by [@AdiVarma27](https://github.com/AdiVarma27). -### [sklift.metrics](https://scikit-uplift.readthedocs.io/en/latest/api/metrics.html) +### [sklift.metrics](https://www.uplift-modeling.com/en/latest/en/latest/api/metrics.html) -* 💥 Add [uplift_by_percentile](https://scikit-uplift.readthedocs.io/en/latest/api/metrics/uplift_by_percentile.html) function by [@ElisovaIra](https://github.com/ElisovaIra). -* 💥 Add [weighted_average_uplift](https://scikit-uplift.readthedocs.io/en/latest/api/metrics/weighted_average_uplift.html) function by [@ElisovaIra](https://github.com/ElisovaIra). -* 💥 Add [perfect_uplift_curve](https://scikit-uplift.readthedocs.io/en/latest/api/metrics/perfect_uplift_curve.html) function. -* 💥 Add [perfect_qini_curve](https://scikit-uplift.readthedocs.io/en/latest/api/metrics/perfect_qini_curve.html) function. -* 🔨 Add normalization in [uplift_auc_score](https://scikit-uplift.readthedocs.io/en/latest/api/metrics/uplift_auc_score.html) and [qini_auc_score](https://scikit-uplift.readthedocs.io/en/latest/api/metrics/qini_auc_score.html) functions. -* ❗ Remove metrics `auuc` and `auqc`. In exchange for them use respectively [uplift_auc_score](https://scikit-uplift.readthedocs.io/en/latest/api/metrics/uplift_auc_score.html) and [qini_auc_score](https://scikit-uplift.readthedocs.io/en/latest/api/metrics/qini_auc_score.html) +* 💥 Add [uplift_by_percentile](https://www.uplift-modeling.com/en/latest/en/latest/api/metrics/uplift_by_percentile.html) function by [@ElisovaIra](https://github.com/ElisovaIra). +* 💥 Add [weighted_average_uplift](https://www.uplift-modeling.com/en/latest/en/latest/api/metrics/weighted_average_uplift.html) function by [@ElisovaIra](https://github.com/ElisovaIra). +* 💥 Add [perfect_uplift_curve](https://www.uplift-modeling.com/en/latest/en/latest/api/metrics/perfect_uplift_curve.html) function. +* 💥 Add [perfect_qini_curve](https://www.uplift-modeling.com/en/latest/en/latest/api/metrics/perfect_qini_curve.html) function. +* 🔨 Add normalization in [uplift_auc_score](https://www.uplift-modeling.com/en/latest/en/latest/api/metrics/uplift_auc_score.html) and [qini_auc_score](https://www.uplift-modeling.com/en/latest/en/latest/api/metrics/qini_auc_score.html) functions. +* ❗ Remove metrics `auuc` and `auqc`. In exchange for them use respectively [uplift_auc_score](https://www.uplift-modeling.com/en/latest/en/latest/api/metrics/uplift_auc_score.html) and [qini_auc_score](https://www.uplift-modeling.com/en/latest/en/latest/api/metrics/qini_auc_score.html) -### [sklift.viz](https://scikit-uplift.readthedocs.io/en/latest/api/viz.html) +### [sklift.viz](https://www.uplift-modeling.com/en/latest/en/latest/api/viz.html) -* 💥 Add [plot_uplift_curve](https://scikit-uplift.readthedocs.io/en/latest/api/viz/plot_uplift_curve.html) function. -* 💥 Add [plot_qini_curve](https://scikit-uplift.readthedocs.io/en/latest/api/viz/plot_qini_curve.html) function. +* 💥 Add [plot_uplift_curve](https://www.uplift-modeling.com/en/latest/en/latest/api/viz/plot_uplift_curve.html) function. +* 💥 Add [plot_qini_curve](https://www.uplift-modeling.com/en/latest/en/latest/api/viz/plot_qini_curve.html) function. * ❗ Remove `plot_uplift_qini_curves`. ### Miscellaneous * 💥 Add contributors in main Readme and in main page of docs. -* 💥 Add [contributing guide](https://scikit-uplift.readthedocs.io/en/latest/contributing.html). +* 💥 Add [contributing guide](https://www.uplift-modeling.com/en/latest/en/latest/contributing.html). * 💥 Add [code of conduct](https://github.com/maks-sh/scikit-uplift/blob/master/.github/CODE_OF_CONDUCT.md). -* 📝 Reformat [Tutorials](https://scikit-uplift.readthedocs.io/en/latest/tutorials.html) page. +* 📝 Reformat [Tutorials](https://www.uplift-modeling.com/en/latest/en/latest/tutorials.html) page. * 📝 Add github buttons in docs. * 📝 Add logo compatibility with pypi. ## Version 0.1.2 -### [sklift.models](https://scikit-uplift.readthedocs.io/en/v0.1.2/api/models.html) +### [sklift.models](https://www.uplift-modeling.com/en/latest/en/v0.1.2/api/models.html) -* 🔨 Fix bugs in [TwoModels](https://scikit-uplift.readthedocs.io/en/v0.1.2/api/models.html#sklift.models.models.TwoModels) for regression problem. +* 🔨 Fix bugs in [TwoModels](https://www.uplift-modeling.com/en/latest/en/v0.1.2/api/models.html#sklift.models.models.TwoModels) for regression problem. * 📝 Minor code refactoring. -### [sklift.metrics](https://scikit-uplift.readthedocs.io/en/v0.1.2/api/metrics.html) +### [sklift.metrics](https://www.uplift-modeling.com/en/latest/en/v0.1.2/api/metrics.html) * 📝 Minor code refactoring. -### [sklift.viz](https://scikit-uplift.readthedocs.io/en/v0.1.2/api/viz.html) +### [sklift.viz](https://www.uplift-modeling.com/en/latest/en/v0.1.2/api/viz.html) -* 💥 Add bar plot in [plot_uplift_by_percentile](https://scikit-uplift.readthedocs.io/en/v0.1.2/api/viz.html#sklift.viz.base.plot_uplift_by_percentile) by [@ElisovaIra](https://github.com/ElisovaIra). -* 🔨 Fix bug in [plot_uplift_by_percentile](https://scikit-uplift.readthedocs.io/en/v0.1.2/api/viz.html#sklift.viz.base.plot_uplift_by_percentile). +* 💥 Add bar plot in [plot_uplift_by_percentile](https://www.uplift-modeling.com/en/latest/en/v0.1.2/api/viz.html#sklift.viz.base.plot_uplift_by_percentile) by [@ElisovaIra](https://github.com/ElisovaIra). +* 🔨 Fix bug in [plot_uplift_by_percentile](https://www.uplift-modeling.com/en/latest/en/v0.1.2/api/viz.html#sklift.viz.base.plot_uplift_by_percentile). * 📝 Minor code refactoring. ## Version 0.1.1 -### [sklift.viz](https://scikit-uplift.readthedocs.io/en/v0.1.1/api/viz.html) +### [sklift.viz](https://www.uplift-modeling.com/en/latest/en/v0.1.1/api/viz.html) -* 💥 Add [plot_uplift_by_percentile](https://scikit-uplift.readthedocs.io/en/v0.1.1/api/viz.html#sklift.viz.base.plot_uplift_by_percentile) by [@ElisovaIra](https://github.com/ElisovaIra). -* 🔨 Fix bug with import [plot_treatment_balance_curve](https://scikit-uplift.readthedocs.io/en/v0.1.1/api/viz.html#sklift.viz.base.plot_treatment_balance_curve). +* 💥 Add [plot_uplift_by_percentile](https://www.uplift-modeling.com/en/latest/en/v0.1.1/api/viz.html#sklift.viz.base.plot_uplift_by_percentile) by [@ElisovaIra](https://github.com/ElisovaIra). +* 🔨 Fix bug with import [plot_treatment_balance_curve](https://www.uplift-modeling.com/en/latest/en/v0.1.1/api/viz.html#sklift.viz.base.plot_treatment_balance_curve). -### [sklift.metrics](https://scikit-uplift.readthedocs.io/en/v0.1.1/api/metrics.html) +### [sklift.metrics](https://www.uplift-modeling.com/en/latest/en/v0.1.1/api/metrics.html) -* 💥 Add [response_rate_by_percentile](https://scikit-uplift.readthedocs.io/en/v0.1.1/api/viz.html#sklift.metrics.metrics.response_rate_by_percentile) by [@ElisovaIra](https://github.com/ElisovaIra). -* 🔨 Fix bug with import [uplift_auc_score](https://scikit-uplift.readthedocs.io/en/v0.1.1/api/metrics.html#sklift.metrics.metrics.uplift_auc_score) and [qini_auc_score](https://scikit-uplift.readthedocs.io/en/v0.1.1/metrics.html#sklift.metrics.metrics.qini_auc_score). +* 💥 Add [response_rate_by_percentile](https://www.uplift-modeling.com/en/latest/en/v0.1.1/api/viz.html#sklift.metrics.metrics.response_rate_by_percentile) by [@ElisovaIra](https://github.com/ElisovaIra). +* 🔨 Fix bug with import [uplift_auc_score](https://www.uplift-modeling.com/en/latest/en/v0.1.1/api/metrics.html#sklift.metrics.metrics.uplift_auc_score) and [qini_auc_score](https://www.uplift-modeling.com/en/latest/en/v0.1.1/metrics.html#sklift.metrics.metrics.qini_auc_score). * 📝 Fix typos in docstrings. ### Miscellaneous @@ -79,25 +101,25 @@ ## Version 0.1.0 -### [sklift.models](https://scikit-uplift.readthedocs.io/en/v0.1.0/api/models.html) +### [sklift.models](https://www.uplift-modeling.com/en/latest/en/v0.1.0/api/models.html) -* 📝 Fix typo in [TwoModels](https://scikit-uplift.readthedocs.io/en/v0.1.0/api/models.html#sklift.models.models.TwoModels) docstring by [@spiaz](https://github.com/spiaz). +* 📝 Fix typo in [TwoModels](https://www.uplift-modeling.com/en/latest/en/v0.1.0/api/models.html#sklift.models.models.TwoModels) docstring by [@spiaz](https://github.com/spiaz). * 📝 Improve docstrings and add references to all approaches. -### [sklift.metrics](https://scikit-uplift.readthedocs.io/en/v0.1.0/api/metrics.html) +### [sklift.metrics](https://www.uplift-modeling.com/en/latest/en/v0.1.0/api/metrics.html) -* 💥 Add [treatment_balance_curve](https://scikit-uplift.readthedocs.io/en/v0.1.0/api/metrics.html#sklift.metrics.metrics.treatment_balance_curve) by [@spiaz](https://github.com/spiaz). -* ❗️ The metrics `auuc` and `auqc` are now respectively renamed to [uplift_auc_score](https://scikit-uplift.readthedocs.io/en/v0.1.0/api/metrics.html#sklift.metrics.metrics.uplift_auc_score) and [qini_auc_score](https://scikit-uplift.readthedocs.io/en/v0.1.0/metrics.html#sklift.metrics.metrics.qini_auc_score). So, `auuc` and `auqc` will be removed in 0.2.0. -* ❗️ Add a new parameter `startegy` in [uplift_at_k](https://scikit-uplift.readthedocs.io/en/v0.1.0/metrics.html#sklift.metrics.metrics.uplift_at_k). +* 💥 Add [treatment_balance_curve](https://www.uplift-modeling.com/en/latest/en/v0.1.0/api/metrics.html#sklift.metrics.metrics.treatment_balance_curve) by [@spiaz](https://github.com/spiaz). +* ❗️ The metrics `auuc` and `auqc` are now respectively renamed to [uplift_auc_score](https://www.uplift-modeling.com/en/latest/en/v0.1.0/api/metrics.html#sklift.metrics.metrics.uplift_auc_score) and [qini_auc_score](https://www.uplift-modeling.com/en/latest/en/v0.1.0/metrics.html#sklift.metrics.metrics.qini_auc_score). So, `auuc` and `auqc` will be removed in 0.2.0. +* ❗️ Add a new parameter `startegy` in [uplift_at_k](https://www.uplift-modeling.com/en/latest/en/v0.1.0/metrics.html#sklift.metrics.metrics.uplift_at_k). -### [sklift.viz](https://scikit-uplift.readthedocs.io/en/v0.1.0/api/viz.html) +### [sklift.viz](https://www.uplift-modeling.com/en/latest/en/v0.1.0/api/viz.html) -* 💥 Add [plot_treatment_balance_curve](https://scikit-uplift.readthedocs.io/en/v0.1.0/api/viz.html#sklift.viz.base.plot_treatment_balance_curve) by [@spiaz](https://github.com/spiaz). -* 📝 fix typo in [plot_uplift_qini_curves](https://scikit-uplift.readthedocs.io/en/v0.1.0/api/viz.html#sklift.viz.base.plot_uplift_qini_curves) by [@spiaz](https://github.com/spiaz). +* 💥 Add [plot_treatment_balance_curve](https://www.uplift-modeling.com/en/latest/en/v0.1.0/api/viz.html#sklift.viz.base.plot_treatment_balance_curve) by [@spiaz](https://github.com/spiaz). +* 📝 fix typo in [plot_uplift_qini_curves](https://www.uplift-modeling.com/en/latest/en/v0.1.0/api/viz.html#sklift.viz.base.plot_uplift_qini_curves) by [@spiaz](https://github.com/spiaz). ### Miscellaneous * ❗️ Remove sklift.preprocess submodule. * 💥 Add compatibility of tutorials with colab and add colab buttons by [@ElMaxuno](https://github.com/ElMaxuno). * 💥 Add Changelog. -* 📝 Change the documentation structure. Add next pages: [Tutorials](https://scikit-uplift.readthedocs.io/en/v0.1.0/tutorials.html), [Release History](https://scikit-uplift.readthedocs.io/en/v0.1.0/changelog.html) and [Hall of fame](https://scikit-uplift.readthedocs.io/en/v0.1.0/hall_of_fame.html). \ No newline at end of file +* 📝 Change the documentation structure. Add next pages: [Tutorials](https://www.uplift-modeling.com/en/latest/en/v0.1.0/tutorials.html), [Release History](https://www.uplift-modeling.com/en/latest/en/v0.1.0/changelog.html) and [Hall of fame](https://www.uplift-modeling.com/en/latest/en/v0.1.0/hall_of_fame.html). \ No newline at end of file diff --git a/docs/conf.py b/docs/conf.py index 14cdfdc..3a6f2fa 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -51,9 +51,12 @@ def get_version(): "sphinx.ext.mathjax", "sphinx.ext.napoleon", "recommonmark", - "sphinx.ext.intersphinx" + "sphinx.ext.intersphinx", + "sphinxcontrib.bibtex" ] +bibtex_bibfiles = ['refs.bib'] + master_doc = 'index' # Add any paths that contain templates here, relative to this directory. diff --git a/docs/contributing.md b/docs/contributing.md index a50a828..36275e0 100644 --- a/docs/contributing.md +++ b/docs/contributing.md @@ -19,8 +19,8 @@ So, please make a pull request to the ``dev`` branch. 1. Fork the [project repository](https://github.com/maks-sh/scikit-uplift). 2. Clone your fork of the scikit-uplift repo from your GitHub account to your local disk: ``` bash - $ git clone git@github.com:YourLogin/scikit-uplift.git - $ cd scikit-learn + $ git clone https://github.com/YourName/scikit-uplift + $ cd scikit-uplift ``` 3. Add the upstream remote. This saves a reference to the main scikit-uplift repository, which you can use to keep your repository synchronized with the latest changes: ``` bash @@ -36,7 +36,7 @@ So, please make a pull request to the ``dev`` branch. $ git checkout -b feature/my_new_feature ``` and start making changes. Always use a feature branch. It’s a good practice. -6. Develop the feature on your feature branch on your computer, using Git to do the version control. When you’re done editing, add changed files using ``git add`` and then ``git commit``. +6. Develop the feature on your feature branch on your computer, using Git to do the version control. When you’re done editing, add changed files using ``git add .`` and then ``git commit`` Then push the changes to your GitHub account with: ``` bash diff --git a/docs/hall_of_fame.rst b/docs/hall_of_fame.rst index 3e093ce..721926b 100644 --- a/docs/hall_of_fame.rst +++ b/docs/hall_of_fame.rst @@ -4,7 +4,10 @@ Hall of Fame Here are the links to the competitions, names of the winners and to their solutions, where scikit-uplift was used. -`X5 RetailHero Uplift Modeling contest `_ -============================================================================================= +`X5 Retail Hero: Uplift Modeling for Promotional Campaign `_ +======================================================================================================================== + +Predict how much the purchase probability could increase as a result of sending an advertising SMS. + 2. `Kirill Liksakov `_ `solution `_ diff --git a/docs/index.rst b/docs/index.rst index f6604a5..dfe04f6 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -8,28 +8,39 @@ scikit-uplift ************** -**scikit-uplift (sklift)** is a Python module for basic approaches of uplift modeling built on top of scikit-learn. +**scikit-uplift (sklift)** is an uplift modeling python package that provides fast sklearn-style models implementation, evaluation metrics and visualization tools. -Uplift prediction aims to estimate the causal impact of a treatment at the individual level. +The main idea is to provide easy-to-use and fast python package for uplift modeling. It delivers the model interface with the familiar scikit-learn API. One can use any popular estimator (for instance, from the Catboost library). -Read more about uplift modeling problem in :ref:`User Guide `, -also articles in russian on habr.com: `Part 1 `__ +*Uplift modeling* estimates a causal effect of treatment and uses it to effectively target customers that are most likely to respond to a marketing campaign. + +**Use cases for uplift modeling:** + +* Target customers in the marketing campaign. Quite useful in promotion of some popular product where there is a big part of customers who make a target action by themself without any influence. By modeling uplift you can find customers who are likely to make the target action (for instance, install an app) only when treated (for instance, received a push). + +* Combine a churn model and an uplift model to offer some bonus to a group of customers who are likely to churn. + +* Select a tiny group of customers in the campaign where a price per customer is high. + +Read more about *uplift modeling* problem in `User Guide `__, + +Articles in russian on habr.com: `Part 1 `__ and `Part 2 `__. Features ######### -- Comfortable and intuitive style of modelling like scikit-learn; +- Сomfortable and intuitive scikit-learn-like API; -- Applying any estimator adheres to scikit-learn conventions; +- Applying any estimator compatible with scikit-learn (e.g. Xgboost, LightGBM, Catboost, etc.); -- All approaches can be used in sklearn.pipeline. See example of usage: |Open In Colab3|_; +- All approaches can be used in `sklearn.pipeline`. See the example of usage: |Open In Colab3|_; -- Almost all implemented approaches solve both the problem of classification and regression; +- Almost all implemented approaches solve classification and regression problem; -- A lot of metrics (Such as *Area Under Uplift Curve* or *Area Under Qini Curve*) are implemented to evaluate your uplift model; +- More uplift metrics that you have ever seen in one place! Include brilliants like *Area Under Uplift Curve* (AUUC) or *Area Under Qini Curve* (Qini coefficient) with ideal cases; -- Useful graphs for analyzing the built model. +- Nice and useful viz for analyzing a performance model. **The package currently supports the following methods:** @@ -50,18 +61,20 @@ Project info * GitHub repository: https://github.com/maks-sh/scikit-uplift * Github examples: https://github.com/maks-sh/scikit-uplift/tree/master/notebooks -* Documentation: https://scikit-uplift.readthedocs.io/en/latest/ -* Contributing guide: https://scikit-uplift.readthedocs.io/en/latest/contributing.html +* Documentation: https://www.uplift-modeling.com/en/latest/index.html +* Contributing guide: https://www.uplift-modeling.com/en/latest/contributing.html * License: `MIT `__ Community ############# -We welcome new contributors of all experience levels. +Sklift is being actively maintained and welcomes new contributors of all experience levels. -- Please see our `Contributing Guide `_ for more details. +- Please see our `Contributing Guide `_ for more details. - By participating in this project, you agree to abide by its `Code of Conduct `__. +If you have any questions, please contact us at team@uplift-modeling.com + .. image:: https://sourcerer.io/fame/maks-sh/maks-sh/scikit-uplift/images/0 :target: https://sourcerer.io/fame/maks-sh/maks-sh/scikit-uplift/links/0 :alt: Top contributor 1 diff --git a/docs/quick_start.rst b/docs/quick_start.rst index 89b3cc9..92794e2 100644 --- a/docs/quick_start.rst +++ b/docs/quick_start.rst @@ -13,7 +13,10 @@ Quick Start See the **RetailHero tutorial notebook** (`EN`_ |Open In Colab1|_, `RU`_ |Open In Colab2|_) for details. -**Train and predict your uplift model** +Train and predict your uplift model +==================================== + +Use the intuitive python API to train uplift models. .. code-block:: python :linenos: @@ -38,7 +41,8 @@ See the **RetailHero tutorial notebook** (`EN`_ |Open In Colab1|_, `RU`_ |Open I # predict uplift uplift_preds = tm.predict(X_val) -**Evaluate your uplift model** +Evaluate your uplift model +=========================== .. code-block:: python :linenos: @@ -66,14 +70,15 @@ See the **RetailHero tutorial notebook** (`EN`_ |Open In Colab1|_, `RU`_ |Open I tm_wau = weighted_average_uplift(y_true=y_val, uplift=uplift_preds, treatment=treat_val) -**Vizualize the results** +Vizualize the results +====================== .. code-block:: python :linenos: from sklift.viz import plot_qini_curve - plot_qini_curve(y_true=y_val, uplift=uplift_preds, treatment=treat_val) + plot_qini_curve(y_true=y_val, uplift=uplift_preds, treatment=treat_val, negative_effect=True) .. image:: _static/images/quick_start_qini.png :width: 514px diff --git a/docs/requirements.txt b/docs/requirements.txt index 342151a..05ffdcc 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -1,3 +1,4 @@ sphinx-autobuild sphinx_rtd_theme -recommonmark \ No newline at end of file +recommonmark +sphinxcontrib-bibtex \ No newline at end of file diff --git a/docs/user_guide/index.rst b/docs/user_guide/index.rst index 11c433f..0b9ab78 100644 --- a/docs/user_guide/index.rst +++ b/docs/user_guide/index.rst @@ -7,7 +7,7 @@ User Guide .. image:: https://habrastorage.org/webt/hf/7i/nu/hf7inuu3agtnwl1yo0g--mznzno.jpeg :alt: Cover of User Guide for uplift modeling and causal inference -Uplift modeling estimates the effect of communication action on some customer outcome and gives an opportunity to efficiently target customers which are most likely to respond to a marketing campaign. +Uplift modeling estimates the effect of communication action on some customer outcomes and gives an opportunity to efficiently target customers which are most likely to respond to a marketing campaign. It is relatively easy to implement, but surprisingly poorly covered in the machine learning courses and literature. This guide is going to shed some light on the essentials of causal inference estimating and uplift modeling. @@ -44,5 +44,5 @@ If you find this User Guide useful for your research, please consider citing: year = {2020}, publisher = {GitHub}, journal = {GitHub repository}, - howpublished = {\url{https://scikit-uplift.readthedocs.io/en/latest/user_guide/index.html}} + howpublished = {\url{https://www.uplift-modeling.com/en/latest/user_guide/index.html}} } \ No newline at end of file diff --git a/docs/user_guide/introduction/cate.rst b/docs/user_guide/introduction/cate.rst index 4e4374b..4cb07c5 100644 --- a/docs/user_guide/introduction/cate.rst +++ b/docs/user_guide/introduction/cate.rst @@ -2,7 +2,7 @@ Causal Inference: Basics ****************************************** -In a perfect world, we want to calculate a difference in a person's reaction received communication and the reaction without receiving any communication. +In a perfect world, we want to calculate a difference in a person's reaction received communication, and the reaction without receiving any communication. But there is a problem: we can not make a communication (send an e-mail) and do not make a communication (no e-mail) at the same time. .. image:: https://habrastorage.org/webt/fl/fi/dz/flfidz416o7of5j0nmgdjqqkzfe.jpeg @@ -29,8 +29,8 @@ But we can estimate CATE or *uplift* of an object: Where: -- :math:`W_i \in {0, 1}` - a binary variable: 1 if person :math:`i` receives the treatment :guilabel:`treatment group`, and 0 if person :math:`i` receives no treatment :guilabel:`control group`; -- :math:`Y_i` - person :math:`i`’s observed outcome, which is actually equal: +- :math:`W_i \in {0, 1}` - a binary variable: 1 if person :math:`i` receives the :guilabel:`treatment group`, and 0 if person :math:`i` receives no treatment :guilabel:`control group`; +- :math:`Y_i` - person :math:`i`’s observed outcome, which is equal: .. math:: Y_i = W_i * Y_i^1 + (1 - W_i) * Y_i^0 = \ @@ -41,12 +41,12 @@ Where: This won’t identify the CATE unless one is willing to assume that :math:`W_i` is independent of :math:`Y_i^1` and :math:`Y_i^0` conditional on :math:`X_i`. This assumption is the so-called *Unconfoundedness Assumption* or the *Conditional Independence Assumption* (CIA) found in the social sciences and medical literature. This assumption holds true when treatment assignment is random conditional on :math:`X_i`. -Briefly this can be written as: +Briefly, this can be written as: .. math:: CIA : \{Y_i^0, Y_i^1\} \perp \!\!\! \perp W_i \vert X_i -Also introduce additional useful notation. +Also, introduce additional useful notation. Let us define the :guilabel:`propensity score`, :math:`p(X_i) = P(W_i = 1| X_i)`, i.e. the probability of treatment given :math:`X_i`. References diff --git a/docs/user_guide/introduction/clients.rst b/docs/user_guide/introduction/clients.rst index 0506112..cc8cf3f 100644 --- a/docs/user_guide/introduction/clients.rst +++ b/docs/user_guide/introduction/clients.rst @@ -2,7 +2,7 @@ Types of customers ****************************************** -We can determine 4 types of customers based on a response to a treatment: +We can determine 4 types of customers based on a response to treatment: .. image:: ../../_static/images/user_guide/ug_clients_types.jpg :alt: Classification of customers based on their response to a treatment @@ -10,10 +10,10 @@ We can determine 4 types of customers based on a response to a treatment: :height: 282 px :align: center -- :guilabel:`Do-Not-Disturbs` *(a.k.a. Sleeping-dogs)* have a strong negative response to a marketing communication. They are going to purchase if *NOT* treated and will *NOT* purchase *IF* treated. It is not only a wasted marketing budget but also a negative impact. For instance, customers targeted could result in rejecting current products or services. In terms of math: :math:`W_i = 1, Y_i = 0` or :math:`W_i = 0, Y_i = 1`. +- :guilabel:`Do-Not-Disturbs` *(a.k.a. Sleeping-dogs)* have a strong negative response to marketing communication. They are going to purchase if *NOT* treated and will *NOT* purchase *IF* treated. It is not only a wasted marketing budget but also a negative impact. For instance, customers targeted could result in rejecting current products or services. In terms of math: :math:`W_i = 1, Y_i = 0` or :math:`W_i = 0, Y_i = 1`. - :guilabel:`Lost Causes` will *NOT* purchase the product *NO MATTER* they are contacted or not. The marketing budget in this case is also wasted because it has no effect. In terms of math: :math:`W_i = 1, Y_i = 0` or :math:`W_i = 0, Y_i = 0`. - :guilabel:`Sure Things` will purchase *ANYWAY* no matter they are contacted or not. There is no motivation to spend the budget because it also has no effect. In terms of math: :math:`W_i = 1, Y_i = 1` or :math:`W_i = 0, Y_i = 1`. -- :guilabel:`Persuadables` will always respond *POSITIVE* to the marketing communication. They is going to purchase *ONLY* if contacted (or sometimes they purchase *MORE* or *EARLIER* only if contacted). This customer's type should be the only target for the marketing campaign. In terms of math: :math:`W_i = 0, Y_i = 0` or :math:`W_i = 1, Y_i = 1`. +- :guilabel:`Persuadables` will always respond *POSITIVE* to marketing communication. They are going to purchase *ONLY* if contacted (or sometimes they purchase *MORE* or *EARLIER* only if contacted). This customer's type should be the only target for the marketing campaign. In terms of math: :math:`W_i = 0, Y_i = 0` or :math:`W_i = 1, Y_i = 1`. Because we can't communicate and not communicate with the customer at the same time, we will never be able to observe exactly which type a particular customer belongs to. diff --git a/docs/user_guide/introduction/comparison.rst b/docs/user_guide/introduction/comparison.rst index e6ced35..a3e3b0c 100644 --- a/docs/user_guide/introduction/comparison.rst +++ b/docs/user_guide/introduction/comparison.rst @@ -9,14 +9,14 @@ There are several ways to use machine learning to select customers for a marketi :alt: Comparison with other models - :guilabel:`The Look-alike model` (or Positive Unlabeled Learning) evaluates a probability that the customer is going to accomplish a target action. A training dataset contains known positive objects (for instance, users who have installed an app) and random negative objects (a random subset of all other customers who have not installed the app). The model searches for customers who are similar to those who made the target action. -- :guilabel:`The Response model` evaluates the probability that the customer is going to accomplish the target action if there was a communication (a.k.a treatment). In this case the training dataset is data collected after some interaction with the customers. In contrast to the first approach, we have confirmed positive and negative observations at our disposal (for instance, the customer who decides to issue a credit card or to decline an offer). +- :guilabel:`The Response model` evaluates the probability that the customer is going to accomplish the target action if there was a communication (a.k.a treatment). In this case, the training dataset is data collected after some interaction with the customers. In contrast to the first approach, we have confirmed positive and negative observations at our disposal (for instance, the customer who decides to issue a credit card or to decline an offer). - :guilabel:`The Uplift model` evaluates the net effect of communication by trying to select only those customers who are going to perform the target action only when there is some advertising exposure presenting to them. The model predicts a difference between the customer's behavior when there is a treatment (communication) and when there is no treatment (no communication). When should we use uplift modeling? Uplift modeling is used when the customer's target action is likely to happen without any communication. For instance, we want to promote a popular product but we don't want to spend our marketing budget on customers who will buy the product anyway with or without communication. -If the product is not popular and it is has to be promoted to be bought, then a task turns to the response modeling task. +If the product is not popular and it has to be promoted to be bought, then a task turns to the response modeling task. References ========== diff --git a/docs/user_guide/introduction/data_collection.rst b/docs/user_guide/introduction/data_collection.rst index b614a81..5e582c0 100644 --- a/docs/user_guide/introduction/data_collection.rst +++ b/docs/user_guide/introduction/data_collection.rst @@ -11,7 +11,7 @@ There are few additional steps different from a standard data collection procedu Data collected from the marketing experiment consists of the customer's responses to the marketing offer (target). -The only difference between the experiment and the future uplift model's campaign is a fact that in the first case we choose random customers to make a promotion. In the second case the choice of a customer to communicate with is based on the predicted value returned by the uplift model. If the marketing campaign significantly differs from the experiment used to collect data, the model will be less accurate. +The only difference between the experiment and the future uplift model's campaign is a fact that in the first case we choose random customers to make a promotion. In the second case, the choice of a customer to communicate with is based on the predicted value returned by the uplift model. If the marketing campaign significantly differs from the experiment used to collect data, the model will be less accurate. There is a trick: before running the marketing campaign, it is recommended to randomly subset a small part of the customer base and divide it into a control and a treatment group again, similar to the previous experiment. Using this data, you will not only be able to accurately evaluate the effectiveness of the campaign but also collect additional data for a further model retraining. diff --git a/docs/user_guide/models/classification.rst b/docs/user_guide/models/classification.rst new file mode 100644 index 0000000..e363950 --- /dev/null +++ b/docs/user_guide/models/classification.rst @@ -0,0 +1,34 @@ +*********************** +Approach classification +*********************** + +Uplift modeling techniques can be grouped into :guilabel:`data preprocessing` and :guilabel:`data processing` approaches. + +.. image:: ../../_static/images/user_guide/ug_uplift_approaches.png + :align: center + :alt: Classification of uplift modeling techniques: data preprocessing and data processing + +Data preprocessing +==================== + +In the :guilabel:`preprocessing` approaches, existing out-of-the-box learning methods are used, after pre- or post-processing of the data and outcomes. + +A popular and generic data preprocessing approach is :ref:`the flipped label approach `, also called class transformation approach. + +Other data preprocessing approaches extend the set of predictor variables to allow for the estimation of uplift. An example is :ref:`the single model with treatment as feature `. + +Data processing +==================== + +In the :guilabel:`data processing` approaches, new learning methods and methodologies are developed that aim to optimize expected uplift more directly. + +Data processing techniques include two categories: :guilabel:`indirect` and :guilabel:`direct` estimation approaches. + +:guilabel:`Indirect` estimation approaches include :ref:`the two-model model approach `. + +:guilabel:`Direct` estimation approaches are typically adaptations from decision tree algorithms. The adoptions include modified the splitting criteria and dedicated pruning techniques. + +References +========== + +1️⃣ Devriendt, Floris, Tias Guns and Wouter Verbeke. “Learning to rank for uplift modeling.” ArXiv abs/2002.05897 (2020): n. pag. diff --git a/docs/user_guide/models/index.rst b/docs/user_guide/models/index.rst index 93a4fb6..dcccfce 100644 --- a/docs/user_guide/models/index.rst +++ b/docs/user_guide/models/index.rst @@ -13,6 +13,7 @@ Models :maxdepth: 3 :caption: Contents + ./classification ./solo_model ./revert_label ./two_models \ No newline at end of file diff --git a/docs/user_guide/models/revert_label.rst b/docs/user_guide/models/revert_label.rst index b08634f..935e701 100644 --- a/docs/user_guide/models/revert_label.rst +++ b/docs/user_guide/models/revert_label.rst @@ -13,9 +13,9 @@ The main idea is to predict a slightly changed target :math:`Z_i`: .. math:: Z_i = Y_i \cdot W_i + (1 - Y_i) \cdot (1 - W_i), -* :math:`Z_i` - new target for the :math:`i` customer; +* :math:`Z_i` - a new target for the :math:`i` customer; -* :math:`Y_i` - previous target for the :math:`i` customer; +* :math:`Y_i` - a previous target for the :math:`i` customer; * :math:`W_i` - treatment flag assigned to the :math:`i` customer. diff --git a/docs/user_guide/models/two_models.rst b/docs/user_guide/models/two_models.rst index e7e210c..0a6477e 100644 --- a/docs/user_guide/models/two_models.rst +++ b/docs/user_guide/models/two_models.rst @@ -38,7 +38,7 @@ The authors of this method proposed to use the same idea to solve the problem of .. hint:: In sklift this approach corresponds to the :class:`.TwoModels` class and the **ddr_control** method. -At the beginning we train the classifier based on the control data: +At the beginning, we train the classifier based on the control data: .. math:: P^C = P(Y=1| X, W = 0), @@ -68,7 +68,7 @@ the :math:`P_C` classifier. In sklift this approach corresponds to the :class:`.TwoModels` class and the **ddr_treatment** method. There is an important remark about the data nature. -It is important to calibrate model's scores into probabilities if treatment and control data have a different nature. +It is important to calibrate the model's scores into probabilities if treatment and control data have a different nature. Model calibration techniques are well described `in the scikit-learn documentation`_. References diff --git a/notebooks/pipeline_usage_EN.ipynb b/notebooks/pipeline_usage_EN.ipynb index 018c9e7..da90436 100644 --- a/notebooks/pipeline_usage_EN.ipynb +++ b/notebooks/pipeline_usage_EN.ipynb @@ -51,8 +51,8 @@ "execution_count": 1, "metadata": { "ExecuteTime": { - "end_time": "2020-05-30T22:38:40.696778Z", - "start_time": "2020-05-30T22:38:40.692482Z" + "end_time": "2021-02-07T01:01:39.897817Z", + "start_time": "2021-02-07T01:01:39.890409Z" } }, "outputs": [], @@ -60,32 +60,6 @@ "!pip install scikit-uplift xgboost==1.0.2 category_encoders==2.1.0 -U" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Secondly, load the data:" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": { - "ExecuteTime": { - "end_time": "2020-05-30T22:38:40.705782Z", - "start_time": "2020-05-30T22:38:40.701316Z" - } - }, - "outputs": [], - "source": [ - "import urllib.request\n", - "\n", - "\n", - "csv_path = '/content/Hilstorm.csv'\n", - "url = 'http://www.minethatdata.com/Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv'\n", - "urllib.request.urlretrieve(url, csv_path)" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -99,20 +73,21 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 2, "metadata": { "ExecuteTime": { - "end_time": "2020-05-30T22:38:41.739525Z", - "start_time": "2020-05-30T22:38:40.711390Z" - } + "end_time": "2021-02-07T01:01:42.438253Z", + "start_time": "2021-02-07T01:01:39.901510Z" + }, + "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Shape of the dataset before processing: (64000, 12)\n", - "Shape of the dataset after processing: (42693, 10)\n" + "Shape of the dataset before processing: (64000, 8)\n", + "Shape of the dataset after processing: (42693, 8)\n" ] }, { @@ -144,8 +119,6 @@ " zip_code\n", " newbie\n", " channel\n", - " visit\n", - " treatment\n", " \n", " \n", " \n", @@ -159,8 +132,6 @@ " Surburban\n", " 0\n", " Phone\n", - " 0\n", - " 1\n", " \n", " \n", " 1\n", @@ -172,8 +143,6 @@ " Rural\n", " 1\n", " Web\n", - " 0\n", - " 0\n", " \n", " \n", " 2\n", @@ -185,8 +154,6 @@ " Surburban\n", " 1\n", " Web\n", - " 0\n", - " 1\n", " \n", " \n", " 4\n", @@ -198,8 +165,6 @@ " Urban\n", " 0\n", " Web\n", - " 0\n", - " 1\n", " \n", " \n", " 5\n", @@ -211,49 +176,46 @@ " Surburban\n", " 0\n", " Phone\n", - " 1\n", - " 1\n", " \n", " \n", "\n", "" ], "text/plain": [ - " recency history_segment history mens womens zip_code newbie channel \\\n", - "0 10 2) $100 - $200 142.44 1 0 Surburban 0 Phone \n", - "1 6 3) $200 - $350 329.08 1 1 Rural 1 Web \n", - "2 7 2) $100 - $200 180.65 0 1 Surburban 1 Web \n", - "4 2 1) $0 - $100 45.34 1 0 Urban 0 Web \n", - "5 6 2) $100 - $200 134.83 0 1 Surburban 0 Phone \n", - "\n", - " visit treatment \n", - "0 0 1 \n", - "1 0 0 \n", - "2 0 1 \n", - "4 0 1 \n", - "5 1 1 " + " recency history_segment history mens womens zip_code newbie channel\n", + "0 10 2) $100 - $200 142.44 1 0 Surburban 0 Phone\n", + "1 6 3) $200 - $350 329.08 1 1 Rural 1 Web\n", + "2 7 2) $100 - $200 180.65 0 1 Surburban 1 Web\n", + "4 2 1) $0 - $100 45.34 1 0 Urban 0 Web\n", + "5 6 2) $100 - $200 134.83 0 1 Surburban 0 Phone" ] }, - "execution_count": 3, + "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", + "from sklift.datasets import fetch_hillstrom\n", "\n", "\n", "%matplotlib inline\n", "\n", - "dataset = pd.read_csv(csv_path)\n", + "bunch = fetch_hillstrom(target_col='visit')\n", + "\n", + "dataset, target, treatment = bunch['data'], bunch['target'], bunch['treatment']\n", + "\n", "print(f'Shape of the dataset before processing: {dataset.shape}')\n", - "dataset = dataset[dataset['segment']!='Mens E-Mail']\n", - "dataset.loc[:, 'treatment'] = dataset['segment'].map({\n", + "\n", + "# Selecting two segments\n", + "dataset = dataset[treatment!='Mens E-Mail']\n", + "target = target[treatment!='Mens E-Mail']\n", + "treatment = treatment[treatment!='Mens E-Mail'].map({\n", " 'Womens E-Mail': 1,\n", " 'No E-Mail': 0\n", "})\n", "\n", - "dataset = dataset.drop(['segment', 'conversion', 'spend'], axis=1)\n", "print(f'Shape of the dataset after processing: {dataset.shape}')\n", "dataset.head()" ] @@ -267,11 +229,11 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 3, "metadata": { "ExecuteTime": { - "end_time": "2020-05-30T22:38:42.307545Z", - "start_time": "2020-05-30T22:38:41.743319Z" + "end_time": "2021-02-07T01:01:42.579775Z", + "start_time": "2021-02-07T01:01:42.442595Z" } }, "outputs": [], @@ -279,15 +241,9 @@ "from sklearn.model_selection import train_test_split\n", "\n", "\n", - "Xyt_tr, Xyt_val = train_test_split(dataset, test_size=0.5, random_state=42)\n", - "\n", - "X_tr = Xyt_tr.drop(['visit', 'treatment'], axis=1)\n", - "y_tr = Xyt_tr['visit']\n", - "treat_tr = Xyt_tr['treatment']\n", - "\n", - "X_val = Xyt_val.drop(['visit', 'treatment'], axis=1)\n", - "y_val = Xyt_val['visit']\n", - "treat_val = Xyt_val['treatment']" + "X_tr, X_val, y_tr, y_val, treat_tr, treat_val = train_test_split(\n", + " dataset, target, treatment, test_size=0.5, random_state=42\n", + ")" ] }, { @@ -299,11 +255,11 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 4, "metadata": { "ExecuteTime": { - "end_time": "2020-05-30T22:38:42.330862Z", - "start_time": "2020-05-30T22:38:42.310277Z" + "end_time": "2021-02-07T01:01:42.600915Z", + "start_time": "2021-02-07T01:01:42.585066Z" } }, "outputs": [ @@ -329,11 +285,11 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 5, "metadata": { "ExecuteTime": { - "end_time": "2020-05-30T22:38:42.430704Z", - "start_time": "2020-05-30T22:38:42.333721Z" + "end_time": "2021-02-07T01:01:42.703537Z", + "start_time": "2021-02-07T01:01:42.603875Z" } }, "outputs": [], @@ -363,11 +319,11 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 6, "metadata": { "ExecuteTime": { - "end_time": "2020-05-30T22:38:43.630594Z", - "start_time": "2020-05-30T22:38:42.433041Z" + "end_time": "2021-02-07T01:01:44.020040Z", + "start_time": "2021-02-07T01:01:42.707311Z" } }, "outputs": [ @@ -402,11 +358,11 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 7, "metadata": { "ExecuteTime": { - "end_time": "2020-05-30T22:38:43.777122Z", - "start_time": "2020-05-30T22:38:43.632881Z" + "end_time": "2021-02-07T01:01:44.184968Z", + "start_time": "2021-02-07T01:01:44.047865Z" } }, "outputs": [ diff --git a/notebooks/pipeline_usage_RU.ipynb b/notebooks/pipeline_usage_RU.ipynb index be4a3e9..16892f5 100644 --- a/notebooks/pipeline_usage_RU.ipynb +++ b/notebooks/pipeline_usage_RU.ipynb @@ -45,44 +45,13 @@ "execution_count": 1, "metadata": { "ExecuteTime": { - "end_time": "2020-05-30T22:40:55.967561Z", - "start_time": "2020-05-30T22:40:55.963558Z" + "end_time": "2021-02-07T01:01:58.302718Z", + "start_time": "2021-02-07T01:01:58.298524Z" } }, "outputs": [], "source": [ - "!pip install scikit-uplift xgboost==1.0.2 category_encoders==2.1.0 -U" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "ExecuteTime": { - "end_time": "2020-04-26T14:28:36.188277Z", - "start_time": "2020-04-26T14:28:36.106561Z" - } - }, - "source": [ - "Загрузим данные:" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": { - "ExecuteTime": { - "end_time": "2020-05-30T22:40:55.981317Z", - "start_time": "2020-05-30T22:40:55.976955Z" - } - }, - "outputs": [], - "source": [ - "import urllib.request\n", - "\n", - "\n", - "csv_path = '/content/Hilstorm.csv'\n", - "url = 'http://www.minethatdata.com/Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv'\n", - "urllib.request.urlretrieve(url, csv_path)" + "# !pip install scikit-uplift xgboost==1.0.2 category_encoders==2.1.0 -U" ] }, { @@ -98,11 +67,11 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 2, "metadata": { "ExecuteTime": { - "end_time": "2020-05-30T22:40:56.877657Z", - "start_time": "2020-05-30T22:40:55.985275Z" + "end_time": "2021-02-07T01:01:59.884250Z", + "start_time": "2021-02-07T01:01:58.315398Z" } }, "outputs": [ @@ -110,8 +79,8 @@ "name": "stdout", "output_type": "stream", "text": [ - "Размер датасета до обработки: (64000, 12)\n", - "Размер датасета после обработки: (42693, 10)\n" + "Размер датасета до обработки: (64000, 8)\n", + "Размер датасета после обработки: (42693, 8)\n" ] }, { @@ -143,8 +112,6 @@ " zip_code\n", " newbie\n", " channel\n", - " visit\n", - " treatment\n", " \n", " \n", " \n", @@ -158,8 +125,6 @@ " Surburban\n", " 0\n", " Phone\n", - " 0\n", - " 1\n", " \n", " \n", " 1\n", @@ -171,8 +136,6 @@ " Rural\n", " 1\n", " Web\n", - " 0\n", - " 0\n", " \n", " \n", " 2\n", @@ -184,8 +147,6 @@ " Surburban\n", " 1\n", " Web\n", - " 0\n", - " 1\n", " \n", " \n", " 4\n", @@ -197,8 +158,6 @@ " Urban\n", " 0\n", " Web\n", - " 0\n", - " 1\n", " \n", " \n", " 5\n", @@ -210,49 +169,46 @@ " Surburban\n", " 0\n", " Phone\n", - " 1\n", - " 1\n", " \n", " \n", "\n", "" ], "text/plain": [ - " recency history_segment history mens womens zip_code newbie channel \\\n", - "0 10 2) $100 - $200 142.44 1 0 Surburban 0 Phone \n", - "1 6 3) $200 - $350 329.08 1 1 Rural 1 Web \n", - "2 7 2) $100 - $200 180.65 0 1 Surburban 1 Web \n", - "4 2 1) $0 - $100 45.34 1 0 Urban 0 Web \n", - "5 6 2) $100 - $200 134.83 0 1 Surburban 0 Phone \n", - "\n", - " visit treatment \n", - "0 0 1 \n", - "1 0 0 \n", - "2 0 1 \n", - "4 0 1 \n", - "5 1 1 " + " recency history_segment history mens womens zip_code newbie channel\n", + "0 10 2) $100 - $200 142.44 1 0 Surburban 0 Phone\n", + "1 6 3) $200 - $350 329.08 1 1 Rural 1 Web\n", + "2 7 2) $100 - $200 180.65 0 1 Surburban 1 Web\n", + "4 2 1) $0 - $100 45.34 1 0 Urban 0 Web\n", + "5 6 2) $100 - $200 134.83 0 1 Surburban 0 Phone" ] }, - "execution_count": 3, + "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", + "from sklift.datasets import fetch_hillstrom\n", "\n", "\n", "%matplotlib inline\n", "\n", - "dataset = pd.read_csv(csv_path)\n", + "bunch = fetch_hillstrom(target_col='visit')\n", + "\n", + "dataset, target, treatment = bunch['data'], bunch['target'], bunch['treatment']\n", + "\n", "print(f'Размер датасета до обработки: {dataset.shape}')\n", - "dataset = dataset[dataset['segment']!='Mens E-Mail']\n", - "dataset.loc[:, 'treatment'] = dataset['segment'].map({\n", + "\n", + "# Selecting two segments\n", + "dataset = dataset[treatment!='Mens E-Mail']\n", + "target = target[treatment!='Mens E-Mail']\n", + "treatment = treatment[treatment!='Mens E-Mail'].map({\n", " 'Womens E-Mail': 1,\n", " 'No E-Mail': 0\n", "})\n", "\n", - "dataset = dataset.drop(['segment', 'conversion', 'spend'], axis=1)\n", "print(f'Размер датасета после обработки: {dataset.shape}')\n", "dataset.head()" ] @@ -266,11 +222,11 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 3, "metadata": { "ExecuteTime": { - "end_time": "2020-05-30T22:40:57.345775Z", - "start_time": "2020-05-30T22:40:56.881856Z" + "end_time": "2021-02-07T01:01:59.976727Z", + "start_time": "2021-02-07T01:01:59.889576Z" } }, "outputs": [], @@ -278,15 +234,9 @@ "from sklearn.model_selection import train_test_split\n", "\n", "\n", - "Xyt_tr, Xyt_val = train_test_split(dataset, test_size=0.5, random_state=42)\n", - "\n", - "X_tr = Xyt_tr.drop(['visit', 'treatment'], axis=1)\n", - "y_tr = Xyt_tr['visit']\n", - "treat_tr = Xyt_tr['treatment']\n", - "\n", - "X_val = Xyt_val.drop(['visit', 'treatment'], axis=1)\n", - "y_val = Xyt_val['visit']\n", - "treat_val = Xyt_val['treatment']" + "X_tr, X_val, y_tr, y_val, treat_tr, treat_val = train_test_split(\n", + " dataset, target, treatment, test_size=0.5, random_state=42\n", + ")" ] }, { @@ -298,11 +248,11 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 4, "metadata": { "ExecuteTime": { - "end_time": "2020-05-30T22:40:57.360026Z", - "start_time": "2020-05-30T22:40:57.348343Z" + "end_time": "2021-02-07T01:02:00.003357Z", + "start_time": "2021-02-07T01:01:59.983254Z" } }, "outputs": [ @@ -328,11 +278,11 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 5, "metadata": { "ExecuteTime": { - "end_time": "2020-05-30T22:40:57.422245Z", - "start_time": "2020-05-30T22:40:57.365310Z" + "end_time": "2021-02-07T01:02:00.079199Z", + "start_time": "2021-02-07T01:02:00.009314Z" } }, "outputs": [], @@ -367,11 +317,11 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 6, "metadata": { "ExecuteTime": { - "end_time": "2020-05-30T22:40:58.579795Z", - "start_time": "2020-05-30T22:40:57.424949Z" + "end_time": "2021-02-07T01:02:01.332880Z", + "start_time": "2021-02-07T01:02:00.085047Z" } }, "outputs": [ @@ -401,11 +351,11 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 7, "metadata": { "ExecuteTime": { - "end_time": "2020-05-30T22:40:58.719049Z", - "start_time": "2020-05-30T22:40:58.581922Z" + "end_time": "2021-02-07T01:02:01.476617Z", + "start_time": "2021-02-07T01:02:01.335371Z" } }, "outputs": [ diff --git a/requirements.txt b/requirements.txt index 11e054c..806e482 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,4 +1,5 @@ scikit-learn>=0.21.0 numpy>=1.16 pandas -matplotlib \ No newline at end of file +matplotlib +requests \ No newline at end of file diff --git a/sklift/__init__.py b/sklift/__init__.py index 7fd229a..0404d81 100644 --- a/sklift/__init__.py +++ b/sklift/__init__.py @@ -1 +1 @@ -__version__ = '0.2.0' +__version__ = '0.3.0' diff --git a/sklift/datasets/__init__.py b/sklift/datasets/__init__.py new file mode 100644 index 0000000..f3304c6 --- /dev/null +++ b/sklift/datasets/__init__.py @@ -0,0 +1,13 @@ +from .datasets import ( + get_data_dir, + clear_data_dir, + fetch_x5, fetch_lenta, + fetch_criteo, fetch_hillstrom +) + +__all__ = [ + get_data_dir, + clear_data_dir, + fetch_x5, fetch_lenta, + fetch_criteo, fetch_hillstrom +] \ No newline at end of file diff --git a/sklift/datasets/datasets.py b/sklift/datasets/datasets.py new file mode 100644 index 0000000..0451482 --- /dev/null +++ b/sklift/datasets/datasets.py @@ -0,0 +1,456 @@ +import os +import shutil + +import pandas as pd +import requests +from sklearn.utils import Bunch + + +def get_data_dir(): + """Return the path of the scikit-uplift data dir. + + This folder is used by some large dataset loaders to avoid downloading the data several times. + + By default the data dir is set to a folder named ‘scikit_learn_data’ in the user home folder. + + Returns: + string: The path to scikit-uplift data dir. + + """ + return os.path.join(os.path.expanduser("~"), "scikit-uplift-data") + + +def _create_data_dir(path): + """Creates a directory, which stores the datasets. + + Args: + path (str): The path to scikit-uplift data dir. + + """ + if not os.path.isdir(path): + os.makedirs(path) + + +def _download(url, dest_path): + """Download the file from url and save it locally. + + Args: + url (str): URL address, must be a string. + dest_path (str): Destination of the file. + + """ + if isinstance(url, str): + req = requests.get(url, stream=True) + req.raise_for_status() + + with open(dest_path, "wb") as fd: + for chunk in req.iter_content(chunk_size=2 ** 20): + fd.write(chunk) + else: + raise TypeError("URL must be a string") + + +def _get_data(data_home, url, dest_subdir, dest_filename, download_if_missing): + """Return the path to the dataset. + + Args: + data_home (str): The path to scikit-uplift data dir. + url (str): The URL to the dataset. + dest_subdir (str): The name of the folder in which the dataset is stored. + dest_filename (str): The name of the dataset. + download_if_missing (bool): If False, raise a IOError if the data is not locally available instead of + trying to download the data from the source site. + + Returns: + string: The path to the dataset. + + """ + if data_home is None: + if dest_subdir is None: + data_dir = get_data_dir() + else: + data_dir = os.path.join(get_data_dir(), dest_subdir) + else: + if dest_subdir is None: + data_dir = os.path.abspath(data_home) + else: + data_dir = os.path.join(os.path.abspath(data_home), dest_subdir) + + _create_data_dir(data_dir) + + dest_path = os.path.join(data_dir, dest_filename) + + if not os.path.isfile(dest_path): + if download_if_missing: + _download(url, dest_path) + else: + raise IOError("Dataset missing") + return dest_path + + +def clear_data_dir(path=None): + """Delete all the content of the data home cache. + + Args: + path (str): The path to scikit-uplift data dir + + """ + if path is None: + path = get_data_dir() + if os.path.isdir(path): + shutil.rmtree(path, ignore_errors=True) + + +def fetch_lenta(data_home=None, dest_subdir=None, download_if_missing=True, return_X_y_t=False, as_frame=True): + """Load and return the Lenta dataset (classification). + + An uplift modeling dataset containing data about Lenta's customers grociery shopping and + related marketing campaigns. + + Major columns: + + - ``group`` (str): treatment/control group flag + - ``response_att`` (binary): target + - ``gender`` (str): customer gender + - ``age`` (float): customer age + - ``main_format`` (int): store type (1 - grociery store, 0 - superstore) + + Read more in the :ref:`docs `. + + Args: + data_home (str): The path to the folder where datasets are stored. + dest_subdir (str): The name of the folder in which the dataset is stored. + download_if_missing (bool): Download the data if not present. Raises an IOError if False and data is missing. + return_X_y_t (bool): If True, returns (data, target, treatment) instead of a Bunch object. + as_frame (bool): If True, returns a pandas Dataframe or Series for the data, target and treatment objects + in the Bunch returned object; Bunch return object will also have a frame member. + + Returns: + Bunch or tuple: dataset. + + Bunch: + By default dictionary-like object, with the following attributes: + + * ``data`` (ndarray or DataFrame object): Dataset without target and treatment. + * ``target`` (Series object): Column target by values. + * ``treatment`` (Series object): Column treatment by values. + * ``DESCR`` (str): Description of the Lenta dataset. + * ``feature_names`` (list): Names of the features. + * ``target_name`` (str): Name of the target. + * ``treatment_name`` (str): Name of the treatment. + + Tuple: + tuple (data, target, treatment) if `return_X_y` is True + + """ + + url = 'https:/winterschool123.s3.eu-north-1.amazonaws.com/lentadataset.csv.gz' + filename = 'lentadataset.csv.gz' + + csv_path = _get_data(data_home=data_home, url=url, dest_subdir=dest_subdir, + dest_filename=filename, + download_if_missing=download_if_missing) + + data = pd.read_csv(csv_path) + if as_frame: + target = data['response_att'] + treatment = data['group'] + data = data.drop(['response_att', 'group'], axis=1) + feature_names = list(data.columns) + else: + target = data[['response_att']].to_numpy() + treatment = data[['group']].to_numpy() + data = data.drop(['response_att', 'group'], axis=1) + feature_names = list(data.columns) + data = data.to_numpy() + + module_path = os.path.dirname(__file__) + with open(os.path.join(module_path, 'descr', 'lenta.rst')) as rst_file: + fdescr = rst_file.read() + + if return_X_y_t: + return data, target, treatment + + return Bunch(data=data, target=target, treatment=treatment, DESCR=fdescr, + feature_names=feature_names, target_name='response_att', treatment_name='group') + + +def fetch_x5(data_home=None, dest_subdir=None, download_if_missing=True, as_frame=True): + """Load and return the X5 RetailHero dataset (classification). + + The dataset contains raw retail customer purchaces, raw information about products and general info about customers. + + Major columns: + + - ``treatment_flg`` (binary): treatment/control group flag + - ``target`` (binary): target + - ``customer_id`` (str): customer id aka primary key for joining + + Read more in the :ref:`docs `. + + Args: + data_home (str, unicode): The path to the folder where datasets are stored. + dest_subdir (str, unicode): The name of the folder in which the dataset is stored. + download_if_missing (bool): Download the data if not present. Raises an IOError if False and data is missing. + as_frame (bool): If True, returns a pandas Dataframe or Series for the data, target and treatment objects + in the Bunch returned object; Bunch return object will also have a frame member. + + Returns: + Bunch: dataset. + + Dictionary-like object, with the following attributes. + + * ``data`` (Bunch object): dictionary-like object without target and treatment: + + * ``clients`` (ndarray or DataFrame object): General info about clients. + * ``train`` (ndarray or DataFrame object): A subset of clients for training. + * ``purchases`` (ndarray or DataFrame object): clients’ purchase history prior to communication. + * ``target`` (Series object): Column target by values. + * ``treatment`` (Series object): Column treatment by values. + * ``DESCR`` (str): Description of the Lenta dataset. + * ``feature_names`` (Bunch object): Names of the features. + * ``target_name`` (str): Name of the target. + * ``treatment_name`` (str): Name of the treatment. + + References: + https://ods.ai/competitions/x5-retailhero-uplift-modeling/data + """ + + url_clients = 'https://timds.s3.eu-central-1.amazonaws.com/clients.csv.gz' + file_clients = 'clients.csv.gz' + csv_clients_path = _get_data(data_home=data_home, url=url_clients, dest_subdir=dest_subdir, + dest_filename=file_clients, + download_if_missing=download_if_missing) + clients = pd.read_csv(csv_clients_path) + clients_names = list(clients.column) + + url_train = 'https://timds.s3.eu-central-1.amazonaws.com/uplift_train.csv.gz' + file_train = 'uplift_train.csv.gz' + csv_train_path = _get_data(data_home=data_home, url=url_train, dest_subdir=dest_subdir, + dest_filename=file_train, + download_if_missing=download_if_missing) + train = pd.read_csv(csv_train_path) + train_names = list(train.columns) + + url_purchases = 'https://timds.s3.eu-central-1.amazonaws.com/purchases.csv.gz' + file_purchases = 'purchases.csv.gz' + csv_purchases_path = _get_data(data_home=data_home, url=url_purchases, dest_subdir=dest_subdir, + dest_filename=file_purchases, + download_if_missing=download_if_missing) + purchases = pd.read_csv(csv_purchases_path) + purchases_names = list(purchases.columns) + + if as_frame: + target = train['target'] + treatment = train['treatment_flg'] + else: + target = train[['target']].to_numpy() + treatment = train[['treatment_flg']].to_numpy() + train = train.to_numpy() + clients = clients.to_numpy() + purchases = purchases.to_numpy() + + data = Bunch(clients=clients, train=train, purchases=purchases) + data_names = Bunch(clients_names=clients_names, train_names=train_names, + purchases_names=purchases_names) + + module_path = os.path.dirname(__file__) + with open(os.path.join(module_path, 'descr', 'x5.rst')) as rst_file: + fdescr = rst_file.read() + + return Bunch(data=data, target=target, treatment=treatment, DESCR=fdescr, + data_names=data_names, target_name='target', treatment_name='treatment_flg') + + +def fetch_criteo(target_col='visit', treatment_col='treatment', data_home=None, dest_subdir=None, + download_if_missing=True, percent10=True, return_X_y_t=False, as_frame=True): + """Load and return the Criteo Uplift Prediction Dataset (classification). + + This dataset is constructed by assembling data resulting from several incrementality tests, a particular randomized + trial procedure where a random part of the population is prevented from being targeted by advertising. + + Major columns: + + * ``treatment`` (binary): treatment + * ``exposure`` (binary): treatment + * ``visit`` (binary): target + * ``conversion`` (binary): target + * ``f0, ... , f11`` (float): feature values + + Read more in the :ref:`docs `. + + Args: + target_col (string, 'visit' or 'conversion', default='visit'): Selects which column from dataset + will be target. + treatment_col (string,'treatment' or 'exposure' default='treatment'): Selects which column from dataset + will be treatment. + data_home (string): Specify a download and cache folder for the datasets. + dest_subdir (string): The name of the folder in which the dataset is stored. + download_if_missing (bool, default=True): If False, raise an IOError if the data is not locally available + instead of trying to download the data from the source site. + percent10 (bool, default=True): Whether to load only 10 percent of the data. + return_X_y_t (bool, default=False): If True, returns (data, target, treatment) instead of a Bunch object. + as_frame (bool): If True, returns a pandas Dataframe or Series for the data, target and treatment objects + in the Bunch returned object; Bunch return object will also have a frame member. + + Returns: + Bunch or tuple: dataset. + + Bunch: + By default dictionary-like object, with the following attributes: + + * ``data`` (ndarray or DataFrame object): Dataset without target and treatment. + * ``target`` (Series object): Column target by values. + * ``treatment`` (Series object): Column treatment by values. + * ``DESCR`` (str): Description of the Lenta dataset. + * ``feature_names`` (list): Names of the features. + * ``target_name`` (str): Name of the target. + * ``treatment_name`` (str): Name of the treatment. + + Tuple: + tuple (data, target, treatment) if `return_X_y` is True + + References: + “A Large Scale Benchmark for Uplift Modeling” + Eustache Diemert, Artem Betlei, Christophe Renaudin; (Criteo AI Lab), Massih-Reza Amini (LIG, Grenoble INP) + """ + if percent10: + url = 'https://criteo-bucket.s3.eu-central-1.amazonaws.com/criteo10.csv.gz' + csv_path = _get_data(data_home=data_home, url=url, dest_subdir=dest_subdir, + dest_filename='criteo10.csv.gz', + download_if_missing=download_if_missing) + else: + url = "https://criteo-bucket.s3.eu-central-1.amazonaws.com/criteo.csv.gz" + csv_path = _get_data(data_home=data_home, url=url, dest_subdir=dest_subdir, + dest_filename='criteo.csv.gz', + download_if_missing=download_if_missing) + + if treatment_col == 'exposure': + data = pd.read_csv(csv_path, usecols=[i for i in range(12)]) + treatment = pd.read_csv(csv_path, usecols=['exposure'], dtype={'exposure': 'Int8'}) + if as_frame: + treatment = treatment['exposure'] + elif treatment_col == 'treatment': + data = pd.read_csv(csv_path, usecols=[i for i in range(12)]) + treatment = pd.read_csv(csv_path, usecols=['treatment'], dtype={'treatment': 'Int8'}) + if as_frame: + treatment = treatment['treatment'] + else: + raise ValueError(f"treatment_col value must be from {['treatment', 'exposure']}. " + f"Got value {treatment_col}.") + feature_names = list(data.columns) + + if target_col == 'conversion': + target = pd.read_csv(csv_path, usecols=['conversion'], dtype={'conversion': 'Int8'}) + if as_frame: + target = target['conversion'] + elif target_col == 'visit': + target = pd.read_csv(csv_path, usecols=['visit'], dtype={'visit': 'Int8'}) + if as_frame: + target = target['visit'] + else: + raise ValueError(f"target_col value must be from {['visit', 'conversion']}. " + f"Got value {target_col}.") + + if return_X_y_t: + if as_frame: + return data, target, treatment + else: + return data.to_numpy(), target.to_numpy(), treatment.to_numpy() + else: + target_name = target_col + treatment_name = treatment_col + + module_path = os.path.dirname(__file__) + with open(os.path.join(module_path, 'descr', 'criteo.rst')) as rst_file: + fdescr = rst_file.read() + + if as_frame: + return Bunch(data=data, target=target, treatment=treatment, DESCR=fdescr, feature_names=feature_names, + target_name=target_name, treatment_name=treatment_name) + else: + return Bunch(data=data.to_numpy(), target=target.to_numpy(), treatment=treatment.to_numpy(), DESCR=fdescr, + feature_names=feature_names, target_name=target_name, treatment_name=treatment_name) + + +def fetch_hillstrom(target_col='visit', data_home=None, dest_subdir=None, download_if_missing=True, + return_X_y_t=False, as_frame=True): + """Load and return Kevin Hillstrom Dataset MineThatData (classification or regression). + + This dataset contains 64,000 customers who last purchased within twelve months. + The customers were involved in an e-mail test. + + Major columns: + + * ``Visit`` (binary): target. 1/0 indicator, 1 = Customer visited website in the following two weeks. + * ``Conversion`` (binary): target. 1/0 indicator, 1 = Customer purchased merchandise in the following two weeks. + * ``Spend`` (float): target. Actual dollars spent in the following two weeks. + * ``Segment`` (str): treatment. The e-mail campaign the customer received + + Read more in the :ref:`docs `. + + Args: + target_col (string, 'visit' or 'conversion' or 'spend', default='visit'): Selects which column from dataset + will be target + data_home (str): The path to the folder where datasets are stored. + dest_subdir (str): The name of the folder in which the dataset is stored. + download_if_missing (bool): Download the data if not present. Raises an IOError if False and data is missing. + return_X_y_t (bool, default=False): If True, returns (data, target, treatment) instead of a Bunch object. + as_frame (bool): If True, returns a pandas Dataframe for the data, target and treatment objects + in the Bunch returned object; Bunch return object will also have a frame member. + + Returns: + Bunch or tuple: dataset. + + Bunch: + By default dictionary-like object, with the following attributes: + + * ``data`` (ndarray or DataFrame object): Dataset without target and treatment. + * ``target`` (Series object): Column target by values. + * ``treatment`` (Series object): Column treatment by values. + * ``DESCR`` (str): Description of the Lenta dataset. + * ``feature_names`` (list): Names of the features. + * ``target_name`` (str): Name of the target. + * ``treatment_name`` (str): Name of the treatment. + + Tuple: + tuple (data, target, treatment) if `return_X_y` is True + + References: + https://blog.minethatdata.com/2008/03/minethatdata-e-mail-analytics-and-data.html + + """ + + url = 'https://hillstorm1.s3.us-east-2.amazonaws.com/hillstorm_no_indices.csv.gz' + csv_path = _get_data(data_home=data_home, + url=url, + dest_subdir=dest_subdir, + dest_filename='hillstorm_no_indices.csv.gz', + download_if_missing=download_if_missing) + + if target_col != ('visit' or 'conversion' or 'spend'): + raise ValueError(f"target_col value must be from {['visit', 'conversion', 'spend']}. " + f"Got value {target_col}.") + + data = pd.read_csv(csv_path, usecols=[i for i in range(8)]) + feature_names = list(data.columns) + treatment = pd.read_csv(csv_path, usecols=['segment']) + target = pd.read_csv(csv_path, usecols=[target_col]) + if as_frame: + target = target[target_col] + treatment = treatment['segment'] + else: + data = data.to_numpy() + target = target.to_numpy() + treatment = treatment.to_numpy() + + module_path = os.path.dirname(os.path.abspath(__file__)) + with open(os.path.join(module_path, 'descr', 'hillstrom.rst')) as rst_file: + fdescr = rst_file.read() + + if return_X_y_t: + return data, target, treatment + else: + target_name = target_col + return Bunch(data=data, target=target, treatment=treatment, DESCR=fdescr, + feature_names=feature_names, target_name=target_name, treatment_name='segment') diff --git a/sklift/datasets/descr/criteo.rst b/sklift/datasets/descr/criteo.rst new file mode 100644 index 0000000..8721ae7 --- /dev/null +++ b/sklift/datasets/descr/criteo.rst @@ -0,0 +1,41 @@ +Criteo Uplift Modeling Dataset +================================ +This is a copy of `Criteo AI Lab Uplift Prediction dataset `_. + +Data description +################ + +This dataset is constructed by assembling data resulting from several incrementality tests, a particular randomized trial procedure where a random part of the population is prevented from being targeted by advertising. + + +Fields +################ + +Here is a detailed description of the fields (they are comma-separated in the file): + +* **f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11**: feature values (dense, float) +* **treatment**: treatment group. Flag if a company participates in the RTB auction for a particular user (binary: 1 = treated, 0 = control) +* **exposure**: treatment effect, whether the user has been effectively exposed. Flag if a company wins in the RTB auction for the user (binary) +* **conversion**: whether a conversion occured for this user (binary, label) +* **visit**: whether a visit occured for this user (binary, label) + + +Key figures +################ +* Format: CSV +* Size: 297M (compressed) 3,2GB (uncompressed) +* Rows: 13,979,592 +* Average Visit Rate: .046992 +* Average Conversion Rate: .00292 +* Treatment Ratio: .85 + + + +This dataset is released along with the paper: +“*A Large Scale Benchmark for Uplift Modeling*" +Eustache Diemert, Artem Betlei, Christophe Renaudin; (Criteo AI Lab), Massih-Reza Amini (LIG, Grenoble INP) +This work was published in: `AdKDD 2018 `_ Workshop, in conjunction with KDD 2018. + + + + diff --git a/sklift/datasets/descr/hillstrom.rst b/sklift/datasets/descr/hillstrom.rst new file mode 100644 index 0000000..98fed87 --- /dev/null +++ b/sklift/datasets/descr/hillstrom.rst @@ -0,0 +1,45 @@ +Kevin Hillstrom Dataset: MineThatData +===================================== + +Data description +################ + +This is a copy of `MineThatData E-Mail Analytics And Data Mining Challenge dataset `_. + +This dataset contains 64,000 customers who last purchased within twelve months. +The customers were involved in an e-mail test. + +* 1/3 were randomly chosen to receive an e-mail campaign featuring Mens merchandise. +* 1/3 were randomly chosen to receive an e-mail campaign featuring Womens merchandise. +* 1/3 were randomly chosen to not receive an e-mail campaign. + +During a period of two weeks following the e-mail campaign, results were tracked. +Your job is to tell the world if the Mens or Womens e-mail campaign was successful. + +Fields +################ + +Historical customer attributes at your disposal include: + +* Recency: Months since last purchase. +* History_Segment: Categorization of dollars spent in the past year. +* History: Actual dollar value spent in the past year. +* Mens: 1/0 indicator, 1 = customer purchased Mens merchandise in the past year. +* Womens: 1/0 indicator, 1 = customer purchased Womens merchandise in the past year. +* Zip_Code: Classifies zip code as Urban, Suburban, or Rural. +* Newbie: 1/0 indicator, 1 = New customer in the past twelve months. +* Channel: Describes the channels the customer purchased from in the past year. + +Another variable describes the e-mail campaign the customer received: + +* Segment + + * Mens E-Mail + * Womens E-Mail + * No E-Mail + +Finally, we have a series of variables describing activity in the two weeks following delivery of the e-mail campaign: + +* Visit: 1/0 indicator, 1 = Customer visited website in the following two weeks. +* Conversion: 1/0 indicator, 1 = Customer purchased merchandise in the following two weeks. +* Spend: Actual dollars spent in the following two weeks. \ No newline at end of file diff --git a/sklift/datasets/descr/lenta.rst b/sklift/datasets/descr/lenta.rst new file mode 100644 index 0000000..e5c28ff --- /dev/null +++ b/sklift/datasets/descr/lenta.rst @@ -0,0 +1,116 @@ +Lenta Uplift Modeling Dataset +================================ + +Data description +################ + +An uplift modeling dataset containing data about Lenta's customers grociery shopping and related marketing campaigns. + +Source: **BigTarget Hackathon** hosted by Lenta and Microsoft in summer 2020. + +Fields +################ + +Major features: + + * ``group`` (str): treatment/control group flag + * ``response_att`` (binary): target + * ``gender`` (str): customer gender + * ``age`` (float): customer age + * ``main_format`` (int): store type (1 - grociery store, 0 - superstore) + + +.. list-table:: + :align: center + :header-rows: 1 + :widths: 5 5 + + * - Feature + - Description + * - CardHolder + - customer id + * - customer + - age + * - children + - number of children + * - cheque_count_[3,6,12]m_g* + - number of customer receipts collected within last 3, 6, 12 months + before campaign. g* is a product group + * - crazy_purchases_cheque_count_[1,3,6,12]m + - number of customer receipts with items purchased on "crazy" + marketing campaign collected within last 1, 3, 6, 12 months before campaign + * - crazy_purchases_goods_count_[6,12]m + - items amount purchased on "crazy" marketing campaign collected + within last 6, 12 months before campaign + * - disc_sum_6m_g34 + - discount sum for past 6 month on a 34 product group + * - food_share_[15d,1m] + - food share in customer purchases for 15 days, 1 month + * - gender + - customer gender + * - group + - treatment/control group flag + * - k_var_cheque_[15d,3m] + - average check coefficient of variation for 15 days, 3 months + * - k_var_cheque_category_width_15d + - coefficient of variation of the average number of purchased + categories (2nd level of the hierarchy) in one receipt for 15 days + * - k_var_cheque_group_width_15d + - coefficient of variation of the average number of purchased + groups (1st level of the hierarchy) in one receipt for 15 days + * - k_var_count_per_cheque_[15d,1m,3m,6m]_g* + - unique product id (SKU) coefficient of variation for 15 days, 1, 3 ,6 months + for g* product group + * - k_var_days_between_visits_[15d,1m,3m] + - coefficient of variation of the average period between visits + for 15 days, 1 month, 3 months + * - k_var_disc_per_cheque_15d + - discount sum coefficient of variation for 15 days + * - k_var_disc_share_[15d,1m,3m,6m,12m]_g* + - discount amount coefficient of variation for 15 days, 1 month, 3 months, 6 months, 12 months + for g* product group + * - k_var_discount_depth_[15d,1m] + - discount amount coefficient of variation for 15 days, 1 month + * - k_var_sku_per_cheque_15d + - number of unique product ids (SKU) coefficient of variation + for 15 days + * - k_var_sku_price_12m_g* + - price coefficient of variation for 15 days, 3, 6, 12 months + for g* product group + * - main_format + - store type (1 - grociery store, 0 - superstore) + * - mean_discount_depth_15d + - mean discount depth for 15 days + * - months_from_register + - number of months from a moment of register + * - perdelta_days_between_visits_15_30d + - timdelta in percent between visits during the first half + of the month and visits during second half of the month + * - promo_share_15d + - promo goods share in the customer bucket + * - response_att + - binary target variable = store visit + * - response_sms + - share of customer responses to previous SMS. + Response = store visit + * - response_viber + - share of responses to previous Viber messages. + Response = store visit + * - sale_count_[3,6,12]m_g* + - number of purchased items from the group * for 3, 6, 12 months + * - sale_sum_[3,6,12]m_g* + - sum of sales from the group * for 3, 6, 12 months + * - stdev_days_between_visits_15d + - coefficient of variation of the days between visits for 15 days + * - stdev_discount_depth_[15d,1m] + - discount sum coefficient of variation for 15 days, 1 month + +Key figures +################ + +* Format: CSV +* Size: 153M (compressed) 567M (uncompressed) +* Rows: 687 029 +* Response Ratio: 0.1 +* Treatment Ratio: 0.75 + diff --git a/sklift/datasets/descr/x5.rst b/sklift/datasets/descr/x5.rst new file mode 100644 index 0000000..8fff6e7 --- /dev/null +++ b/sklift/datasets/descr/x5.rst @@ -0,0 +1,26 @@ +X5 RetailHero Uplift Modeling Dataset +===================================== + +The dataset is provided by X5 Retail Group at the RetailHero hackaton hosted in winter 2019. + +The dataset contains raw retail customer purchases, raw information about products and general info about customers. + + +`Machine learning competition website `_. + +Data description +################ + +Data contains several parts: + +* train.csv: a subset of clients for training. The column *treatment_flg* indicates if there was a communication. The column *target* shows if there was a purchase afterward; +* clients.csv: general info about clients; +* purchases.csv: clients’ purchase history prior to communication. + +Fields +################ + +* treatment_flg (binary): information on performed communication +* target (binary): customer purchasing + + diff --git a/sklift/metrics/metrics.py b/sklift/metrics/metrics.py index 9196637..e40a913 100644 --- a/sklift/metrics/metrics.py +++ b/sklift/metrics/metrics.py @@ -4,6 +4,8 @@ from sklearn.utils.extmath import stable_cumsum from sklearn.utils.validation import check_consistent_length +from ..utils import check_is_binary + def uplift_curve(y_true, uplift, treatment): """Compute Uplift curve. @@ -31,8 +33,10 @@ def uplift_curve(y_true, uplift, treatment): Devriendt, F., Guns, T., & Verbeke, W. (2020). Learning to rank for uplift modeling. ArXiv, abs/2002.05897. """ - # TODO: check the treatment is binary + check_consistent_length(y_true, uplift, treatment) + check_is_binary(treatment) y_true, uplift, treatment = np.array(y_true), np.array(uplift), np.array(treatment) + desc_score_indices = np.argsort(uplift, kind="mergesort")[::-1] y_true, uplift, treatment = y_true[desc_score_indices], uplift[desc_score_indices], treatment[desc_score_indices] @@ -84,7 +88,9 @@ def perfect_uplift_curve(y_true, treatment): :func:`.plot_uplift_curve`: Plot Uplift curves from predictions. """ + check_consistent_length(y_true, treatment) + check_is_binary(treatment) y_true, treatment = np.array(y_true), np.array(treatment) cr_num = np.sum((y_true == 1) & (treatment == 0)) # Control Responders @@ -121,8 +127,9 @@ def uplift_auc_score(y_true, uplift, treatment): :func:`.qini_auc_score`: Compute normalized Area Under the Qini Curve from prediction scores. """ - check_consistent_length(y_true, uplift, treatment) + check_consistent_length(y_true, uplift, treatment) + check_is_binary(treatment) y_true, uplift, treatment = np.array(y_true), np.array(uplift), np.array(treatment) x_actual, y_actual = uplift_curve(y_true, uplift, treatment) @@ -164,7 +171,9 @@ def qini_curve(y_true, uplift, treatment): Devriendt, F., Guns, T., & Verbeke, W. (2020). Learning to rank for uplift modeling. ArXiv, abs/2002.05897. """ - # TODO: check the treatment is binary + + check_consistent_length(y_true, uplift, treatment) + check_is_binary(treatment) y_true, uplift, treatment = np.array(y_true), np.array(uplift), np.array(treatment) desc_score_indices = np.argsort(uplift, kind="mergesort")[::-1] @@ -220,7 +229,9 @@ def perfect_qini_curve(y_true, treatment, negative_effect=True): :func:`.plot_qini_curves`: Plot Qini curves from predictions.. """ + check_consistent_length(y_true, treatment) + check_is_binary(treatment) n_samples = len(y_true) y_true, treatment = np.array(y_true), np.array(treatment) @@ -274,9 +285,10 @@ def qini_auc_score(y_true, uplift, treatment, negative_effect=True): Nicholas J Radcliffe. (2007). Using control groups to target on predicted lift: Building and assessing uplift model. Direct Marketing Analytics Journal, (3):14–21, 2007. """ - # ToDO: Add Continuous Outcomes - check_consistent_length(y_true, uplift, treatment) + # TODO: Add Continuous Outcomes + check_consistent_length(y_true, uplift, treatment) + check_is_binary(treatment) y_true, uplift, treatment = np.array(y_true), np.array(uplift), np.array(treatment) if not isinstance(negative_effect, bool): @@ -328,9 +340,10 @@ def uplift_at_k(y_true, uplift, treatment, strategy, k=0.3): :func:`.qini_auc_score`: Compute normalized Area Under the Qini Curve from prediction scores. """ - # ToDo: checker that treatment is binary and all groups is not empty - check_consistent_length(y_true, uplift, treatment) + # TODO: checker all groups is not empty + check_consistent_length(y_true, uplift, treatment) + check_is_binary(treatment) y_true, uplift, treatment = np.array(y_true), np.array(uplift), np.array(treatment) strategy_methods = ['overall', 'by_group'] @@ -424,12 +437,14 @@ def response_rate_by_percentile(y_true, uplift, treatment, group, strategy='over variance of the response rate at each percentile, group size at each percentile. """ - + + check_consistent_length(y_true, uplift, treatment) + check_is_binary(treatment) + group_types = ['treatment', 'control'] strategy_methods = ['overall', 'by_group'] n_samples = len(y_true) - check_consistent_length(y_true, uplift, treatment) if group not in group_types: raise ValueError(f'Response rate supports only group types in {group_types},' @@ -494,10 +509,12 @@ def weighted_average_uplift(y_true, uplift, treatment, strategy='overall', bins= float: Weighted average uplift. """ + check_consistent_length(y_true, uplift, treatment) + check_is_binary(treatment) + strategy_methods = ['overall', 'by_group'] n_samples = len(y_true) - check_consistent_length(y_true, uplift, treatment) if strategy not in strategy_methods: raise ValueError(f'Response rate supports only calculating methods in {strategy_methods},' @@ -559,10 +576,12 @@ def uplift_by_percentile(y_true, uplift, treatment, strategy='overall', bins=10, pandas.DataFrame: DataFrame where metrics are by columns and percentiles are by rows. """ + check_consistent_length(y_true, uplift, treatment) + check_is_binary(treatment) + strategy_methods = ['overall', 'by_group'] n_samples = len(y_true) - check_consistent_length(y_true, uplift, treatment) if strategy not in strategy_methods: raise ValueError(f'Response rate supports only calculating methods in {strategy_methods},' @@ -612,10 +631,8 @@ def uplift_by_percentile(y_true, uplift, treatment, strategy='overall', bins=10, response_rate_ctrl_total, variance_ctrl_total, n_ctrl_total = response_rate_by_percentile( y_true, uplift, treatment, strategy=strategy, group='control', bins=1) - weighted_avg_uplift = 1 / n_trmnt_total * np.dot(n_trmnt, uplift_scores) - df.loc[-1, :] = ['total', n_trmnt_total, n_ctrl_total, response_rate_trmnt_total, - response_rate_ctrl_total, weighted_avg_uplift] + response_rate_ctrl_total, response_rate_trmnt_total - response_rate_ctrl_total] if std: std_treatment = np.sqrt(variance_trmnt) @@ -649,6 +666,9 @@ def treatment_balance_curve(uplift, treatment, winsize): Returns: array (shape = [>2]), array (shape = [>2]): Points on a curve. """ + + check_consistent_length(uplift, treatment) + check_is_binary(treatment) uplift, treatment = np.array(uplift), np.array(treatment) desc_score_indices = np.argsort(uplift, kind="mergesort")[::-1] diff --git a/sklift/models/models.py b/sklift/models/models.py index dc84c5c..b194bb7 100644 --- a/sklift/models/models.py +++ b/sklift/models/models.py @@ -1,11 +1,12 @@ import warnings - import numpy as np import pandas as pd from sklearn.base import BaseEstimator from sklearn.utils.multiclass import type_of_target from sklearn.utils.validation import check_consistent_length +from ..utils import check_is_binary + class SoloModel(BaseEstimator): """aka Treatment Dummy approach, or Single model approach, or S-Learner. @@ -92,6 +93,7 @@ def fit(self, X, y, treatment, estimator_fit_params=None): """ check_consistent_length(X, y, treatment) + check_is_binary(treatment) treatment_values = np.unique(treatment) if len(treatment_values) != 2: @@ -239,8 +241,8 @@ def fit(self, X, y, treatment, estimator_fit_params=None): object: self """ - # TODO: check the treatment is binary check_consistent_length(X, y, treatment) + check_is_binary(treatment) self._type_of_target = type_of_target(y) if self._type_of_target != 'binary': @@ -382,8 +384,9 @@ def fit(self, X, y, treatment, estimator_trmnt_fit_params=None, estimator_ctrl_f Returns: object: self """ - # TODO: check the treatment is binary + check_consistent_length(X, y, treatment) + check_is_binary(treatment) self._type_of_target = type_of_target(y) X_ctrl, y_ctrl = X[treatment == 0], y[treatment == 0] diff --git a/sklift/utils/__init__.py b/sklift/utils/__init__.py new file mode 100644 index 0000000..981a544 --- /dev/null +++ b/sklift/utils/__init__.py @@ -0,0 +1,3 @@ +from .utils import check_is_binary + +__all__ = [check_is_binary] \ No newline at end of file diff --git a/sklift/utils/utils.py b/sklift/utils/utils.py new file mode 100644 index 0000000..9aa2690 --- /dev/null +++ b/sklift/utils/utils.py @@ -0,0 +1,13 @@ +import numpy as np + +def check_is_binary(array): + """Checker if array consists of int or float binary values 0 (0.) and 1 (1.) + + Args: + array (1d array-like): Array to check. + """ + + if not np.all(np.unique(array) == np.array([0, 1])): + raise ValueError(f"Input array is not binary. " + f"Array should contain only int or float binary values 0 (or 0.) and 1 (or 1.). " + f"Got values {np.unique(array)}.") diff --git a/sklift/viz/base.py b/sklift/viz/base.py index 2572a23..14340ee 100644 --- a/sklift/viz/base.py +++ b/sklift/viz/base.py @@ -2,6 +2,7 @@ import numpy as np from sklearn.utils.validation import check_consistent_length +from ..utils import check_is_binary from ..metrics import ( uplift_curve, perfect_uplift_curve, uplift_auc_score, qini_curve, perfect_qini_curve, qini_auc_score, @@ -15,8 +16,8 @@ def plot_uplift_preds(trmnt_preds, ctrl_preds, log=False, bins=100): Args: trmnt_preds (1d array-like): Predictions for all observations if they are treatment. ctrl_preds (1d array-like): Predictions for all observations if they are control. - log (bool, default False): Logarithm of source samples. Default is False. - bins (integer or sequence, default 100): Number of histogram bins to be used. + log (bool): Logarithm of source samples. Default is False. + bins (integer or sequence): Number of histogram bins to be used. Default is 100. If an integer is given, bins + 1 bin edges are calculated and returned. If bins is a sequence, gives bin edges, including left edge of first bin and right edge of last bin. In this case, bins is returned unmodified. Default is 100. @@ -24,11 +25,14 @@ def plot_uplift_preds(trmnt_preds, ctrl_preds, log=False, bins=100): Returns: Object that stores computed values. """ + # TODO: Add k as parameter: vertical line on plots check_consistent_length(trmnt_preds, ctrl_preds) + check_is_binary(treatment) if not isinstance(bins, int) or bins <= 0: - raise ValueError(f'Bins should be positive integer. Invalid value for bins: {bins}') + raise ValueError( + f'Bins should be positive integer. Invalid value for bins: {bins}') if log: trmnt_preds = np.log(trmnt_preds + 1) @@ -61,13 +65,15 @@ def plot_uplift_curve(y_true, uplift, treatment, random=True, perfect=True): y_true (1d array-like): Ground truth (correct) labels. uplift (1d array-like): Predicted uplift, as returned by a model. treatment (1d array-like): Treatment labels. - random (bool, default True): Draw a random curve. Default is True. - perfect (bool, default False): Draw a perfect curve. Default is True. + random (bool): Draw a random curve. Default is True. + perfect (bool): Draw a perfect curve. Default is True. Returns: Object that stores computed values. """ + check_consistent_length(y_true, uplift, treatment) + check_is_binary(treatment) y_true, uplift, treatment = np.array(y_true), np.array(uplift), np.array(treatment) fig, ax = plt.subplots(ncols=1, nrows=1, figsize=(8, 6)) @@ -76,7 +82,8 @@ def plot_uplift_curve(y_true, uplift, treatment, random=True, perfect=True): ax.plot(x_actual, y_actual, label='Model', color='blue') if random: - x_baseline, y_baseline = x_actual, x_actual * y_actual[-1] / len(y_true) + x_baseline, y_baseline = x_actual, x_actual * \ + y_actual[-1] / len(y_true) ax.plot(x_baseline, y_baseline, label='Random', color='black') ax.fill_between(x_actual, y_actual, y_baseline, alpha=0.2, color='b') @@ -85,7 +92,8 @@ def plot_uplift_curve(y_true, uplift, treatment, random=True, perfect=True): ax.plot(x_perfect, y_perfect, label='Perfect', color='Red') ax.legend(loc='lower right') - ax.set_title(f'Uplift curve\nuplift_auc_score={uplift_auc_score(y_true, uplift, treatment):.2f}') + ax.set_title( + f'Uplift curve\nuplift_auc_score={uplift_auc_score(y_true, uplift, treatment):.4f}') ax.set_xlabel('Number targeted') ax.set_ylabel('Gain: treatment - control') @@ -99,8 +107,8 @@ def plot_qini_curve(y_true, uplift, treatment, random=True, perfect=True, negati y_true (1d array-like): Ground truth (correct) labels. uplift (1d array-like): Predicted uplift, as returned by a model. treatment (1d array-like): Treatment labels. - random (bool, default True): Draw a random curve. Default is True. - perfect (bool, default False): Draw a perfect curve. Default is True. + random (bool): Draw a random curve. Default is True. + perfect (bool): Draw a perfect curve. Default is True. negative_effect (bool): If True, optimum Qini Curve contains the negative effects (negative uplift because of campaign). Otherwise, optimum Qini Curve will not contain the negative effects. Default is True. @@ -108,7 +116,9 @@ def plot_qini_curve(y_true, uplift, treatment, random=True, perfect=True, negati Returns: Object that stores computed values. """ + check_consistent_length(y_true, uplift, treatment) + check_is_binary(treatment) y_true, uplift, treatment = np.array(y_true), np.array(uplift), np.array(treatment) fig, ax = plt.subplots(ncols=1, nrows=1, figsize=(8, 6)) @@ -117,16 +127,19 @@ def plot_qini_curve(y_true, uplift, treatment, random=True, perfect=True, negati ax.plot(x_actual, y_actual, label='Model', color='blue') if random: - x_baseline, y_baseline = x_actual, x_actual * y_actual[-1] / len(y_true) + x_baseline, y_baseline = x_actual, x_actual * \ + y_actual[-1] / len(y_true) ax.plot(x_baseline, y_baseline, label='Random', color='black') ax.fill_between(x_actual, y_actual, y_baseline, alpha=0.2, color='b') if perfect: - x_perfect, y_perfect = perfect_qini_curve(y_true, treatment, negative_effect) + x_perfect, y_perfect = perfect_qini_curve( + y_true, treatment, negative_effect) ax.plot(x_perfect, y_perfect, label='Perfect', color='Red') ax.legend(loc='lower right') - ax.set_title(f'Qini curve\nqini_auc_score={qini_auc_score(y_true, uplift, treatment, negative_effect):.2f}') + ax.set_title( + f'Qini curve\nqini_auc_score={qini_auc_score(y_true, uplift, treatment, negative_effect):.4f}') ax.set_xlabel('Number targeted') ax.set_ylabel('Number of incremental outcome') @@ -170,8 +183,9 @@ def plot_uplift_by_percentile(y_true, uplift, treatment, strategy='overall', kin strategy_methods = ['overall', 'by_group'] kind_methods = ['line', 'bar'] - n_samples = len(y_true) check_consistent_length(y_true, uplift, treatment) + check_is_binary(treatment) + n_samples = len(y_true) if strategy not in strategy_methods: raise ValueError(f'Response rate supports only calculating methods in {strategy_methods},' @@ -182,10 +196,12 @@ def plot_uplift_by_percentile(y_true, uplift, treatment, strategy='overall', kin f' got {kind}.') if not isinstance(bins, int) or bins <= 0: - raise ValueError(f'Bins should be positive integer. Invalid value bins: {bins}') + raise ValueError( + f'Bins should be positive integer. Invalid value bins: {bins}') if bins >= n_samples: - raise ValueError(f'Number of bins = {bins} should be smaller than the length of y_true {n_samples}') + raise ValueError( + f'Number of bins = {bins} should be smaller than the length of y_true {n_samples}') df = uplift_by_percentile(y_true, uplift, treatment, strategy=strategy, std=True, total=True, bins=bins) @@ -214,19 +230,23 @@ def plot_uplift_by_percentile(y_true, uplift, treatment, strategy='overall', kin linewidth=2, color='orange', label='control\nresponse rate') axes.errorbar(percentiles, uplift_score, yerr=std_uplift, linewidth=2, color='red', label='uplift') - axes.fill_between(percentiles, response_rate_trmnt, response_rate_ctrl, alpha=0.1, color='red') + axes.fill_between(percentiles, response_rate_trmnt, + response_rate_ctrl, alpha=0.1, color='red') if np.amin(uplift_score) < 0: axes.axhline(y=0, color='black', linewidth=1) axes.set_xticks(percentiles) axes.legend(loc='upper right') - axes.set_title(f'Uplift by percentile\nweighted average uplift = {uplift_weighted_avg:.2f}') + axes.set_title( + f'Uplift by percentile\nweighted average uplift = {uplift_weighted_avg:.4f}') axes.set_xlabel('Percentile') - axes.set_ylabel('Uplift = treatment response rate - control response rate') + axes.set_ylabel( + 'Uplift = treatment response rate - control response rate') else: # kind == 'bar' delta = percentiles[0] - fig, axes = plt.subplots(ncols=1, nrows=2, figsize=(8, 6), sharex=True, sharey=True) + fig, axes = plt.subplots(ncols=1, nrows=2, figsize=( + 8, 6), sharex=True, sharey=True) fig.text(0.04, 0.5, 'Uplift = treatment response rate - control response rate', va='center', ha='center', rotation='vertical') @@ -240,7 +260,8 @@ def plot_uplift_by_percentile(y_true, uplift, treatment, strategy='overall', kin axes[0].legend(loc='upper right') axes[0].tick_params(axis='x', bottom=False) axes[0].axhline(y=0, color='black', linewidth=1) - axes[0].set_title(f'Uplift by percentile\nweighted average uplift = {uplift_weighted_avg:.2f}') + axes[0].set_title( + f'Uplift by percentile\nweighted average uplift = {uplift_weighted_avg:.4f}') axes[1].set_xticks(percentiles) axes[1].legend(loc='upper right') @@ -257,16 +278,22 @@ def plot_treatment_balance_curve(uplift, treatment, random=True, winsize=0.1): Args: uplift (1d array-like): Predicted uplift, as returned by a model. treatment (1d array-like): Treatment labels. - random (bool, default True): Draw a random curve. - winsize (float, default 0.1): Size of the sliding window to apply. Should be between 0 and 1, extremes excluded. + random (bool): Draw a random curve. Default is True. + winsize (float): Size of the sliding window to apply. Should be between 0 and 1, extremes excluded. Default is 0.1. Returns: Object that stores computed values. """ + + check_consistent_length(uplift, treatment) + check_is_binary(treatment) + if (winsize <= 0) or (winsize >= 1): - raise ValueError('winsize should be between 0 and 1, extremes excluded') + raise ValueError( + 'winsize should be between 0 and 1, extremes excluded') - x_tb, y_tb = treatment_balance_curve(uplift, treatment, winsize=int(len(uplift)*winsize)) + x_tb, y_tb = treatment_balance_curve( + uplift, treatment, winsize=int(len(uplift) * winsize)) _, ax = plt.subplots(ncols=1, nrows=1, figsize=(14, 7))