Skip to content

"NA Figure": visualization of the missing data distribution

License

Notifications You must be signed in to change notification settings

VladimirShitov/nafig

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

35 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

nafig

Build status Python Version Dependencies Status

Code style: black Security: bandit Pre-commit Semantic Versions License Coverage Report

Do you want to visualize missing values in your data? There are plenty amazing methods (check missingno for example) but they all look bulky when your data has too many columns. nafig will help you to build a perfect NA figure!

Installation

$ pip install -U nafig

or install with Poetry

$ poetry add nafig

Usage

Here are some examples of the usage both for simulated and real world data. Check this notebook to play with code yourself!

First, let's import the core function and other useful things:

>>> from nafig.plots import na_text_barplot  # The core function
>>> from nafig.utils import create_example_data  # To simulate data
>>> import pandas as pd  # To works with tables
>>> df, feature_types = create_example_data()

df is just a pandas dataframe with missing values. feature_types is an array, containing data type description for each column. This is just an example, so labels don't correspond to actual data types.

>>> feature_types[:10]
array(['Categorical', 'Categorical', 'Binary', 'Continuous', 'Continuous',
       'Continuous', 'Binary', 'Continuous', 'Continuous', 'Binary'],
      dtype='<U11')

This toy dataframe contains 300 columns. Visualization of missing data with heatmap would unfortunately be too bulky. How to explore missing data distribution in this dataset? Try NA text barplot!

>>> na_text_barplot(df, hue=feature_types, line_height=1.5)

1_simulated_data.png

Columns of the dataset are binned by percentage of the missing data in them. Colouring by feature types helps to understand, which types of data are missing. On Y-axis you can see the number of features in each group.

You can vary the number of bins using num_bins parameter:

>>> na_text_barplot(df, hue=feature_types, line_height=1.5, num_bins=20)

2_20_bins.png

>>> na_text_barplot(df, hue=feature_types, line_height=2, num_bins=2, fig_width=8, font_size=3)

3_2_bins.png

Now let's see some real data examples!

House prices missing data visualization

Data source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data?select=train.csv

>>> DATA_PATH = "data/house-prices/train.csv"
>>> house_prices_df = pd.read_csv(DATA_PATH, index_col=0)

This is a reasonably good data with most of the values present. But thanks to this plot, we can see, which features are the bad guys!

>>> na_text_barplot(house_prices_df, fig_width=17, num_bins=20, line_height=1.5)

4_house_prices_data.png

Note that if you don't pass the hue parameter, features will be colored by the data type of the column. If you don't want to colorize features at all, set hue to False.

By setting remove_empty_bins to True, you can remove the empty bins. It will require a reader to pay more attention to the X-axis but will save you some space.

>>> na_text_barplot(house_prices_df, fig_width=10, num_bins=20, 
                    line_height=1.5, remove_empty_bins=True)

5_house_prices_no_bins.png

Seatle AirBnB dataset missing values vizualization

Data source: https://www.kaggle.com/datasets/airbnb/seattle

>>> airbnb_df = pd.read_csv("data/airbnb/listings.csv")

This dataset has a bit more missing data. On the plot we can see that all integer features are almost complete, and some object and floating number columns contain missing values

>>> na_text_barplot(airbnb_df, fig_width=18, line_height=1.8, font_size=9, remove_empty_bins=True)

6_airbnb_data.png

Feel free to explore other parameters! There are more to help you create a perfect missing values visualization

Developers section

πŸš€ Features

Development features

Deployment features

Makefile usage

Makefile contains a lot of functions for faster development.

1. Download and remove Poetry

To download and install Poetry run:

make poetry-download

To uninstall

make poetry-remove

2. Install all dependencies and pre-commit hooks

Install requirements:

make install

Pre-commit hooks coulb be installed after git init via

make pre-commit-install

3. Codestyle

Automatic formatting uses pyupgrade, isort and black.

make codestyle

# or use synonym
make formatting

Codestyle checks only, without rewriting files:

make check-codestyle

Note: check-codestyle uses isort, black and darglint library

Update all dev libraries to the latest version using one comand

make update-dev-deps
4. Code security

make check-safety

This command launches Poetry integrity checks as well as identifies security issues with Safety and Bandit.

make check-safety

5. Type checks

Run mypy static type checker

make mypy

6. Tests with coverage badges

Run pytest

make test

7. All linters

Of course there is a command to rule run all linters in one:

make lint

the same as:

make test && make check-codestyle && make mypy && make check-safety

8. Docker

make docker-build

which is equivalent to:

make docker-build VERSION=latest

Remove docker image with

make docker-remove

More information about docker.

9. Cleanup

Delete pycache files

make pycache-remove

Remove package build

make build-remove

Delete .DS_STORE files

make dsstore-remove

Remove .mypycache

make mypycache-remove

Or to remove all above run:

make cleanup

πŸ“ˆ Releases

You can see the list of available releases on the GitHub Releases page.

We follow Semantic Versions specification.

We use Release Drafter. As pull requests are merged, a draft release is kept up-to-date listing the changes, ready to publish when you’re ready. With the categories option, you can categorize pull requests in release notes using labels.

List of labels and corresponding titles

Label Title in Releases
enhancement, feature πŸš€ Features
bug, refactoring, bugfix, fix πŸ”§ Fixes & Refactoring
build, ci, testing πŸ“¦ Build System & CI/CD
breaking πŸ’₯ Breaking Changes
documentation πŸ“ Documentation
dependencies ⬆️ Dependencies updates

You can update it in release-drafter.yml.

GitHub creates the bug, enhancement, and documentation labels for you. Dependabot creates the dependencies label. Create the remaining labels on the Issues tab of your GitHub repository, when you need them.

πŸ›‘ License

License

This project is licensed under the terms of the MIT license. See LICENSE for more details.

πŸ“ƒ Citation

@misc{nafig,
  author = {VladimirShitov},
  title = {Package for plotting figures with NA data distribution},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/VladimirShitov/nafig}}
}

Credits πŸš€ Your next Python package needs a bleeding-edge project structure.

This project was generated with python-package-template

About

"NA Figure": visualization of the missing data distribution

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Packages

No packages published

Languages