Skip to content

hypper-team/hypper

Repository files navigation

PyPI PyPI - Python Version PyPI - Wheel build PyPI Downloads PyPI - License

Hypper is a data-mining Python library for binary classification. It uses hypergraph-based methods to explore datasets for the purpose of undersampling, feature selection and binary classification.

Hypper provides an easy-to-use interface familiar to well-recognized Scikit-Learn API.

The primary goal of this library is to provide a tool for handling datasets consisting of mainly categorical features. Novel hypergraph-based methods proposed in the Hypper library were benchmarked against the alternative solutions and achieved satisfactory results. More details can be found in scientific papers presented in the section below.

Installation

pip install hypper

Local installations

pip install -e .['documentation'] # documentation
pip install -e .['develop'] # development (with testing)
pip install -e .['benchmarking'] # benchmarking scripts
pip install -e .['all'] # install everything

Tutorials:

1. Introduction to data mining with Hypper

Testing

pytest

Important links

Citation

@ARTICLE{Misiorek2022-ru,
  title     = "Hypergraph-based importance assessment for binary classification
               data",
  author    = "Misiorek, Pawel and Janowski, Szymon",
  abstract  = "AbstractWe present a novel hypergraph-based framework enabling
               an assessment of the importance of binary classification data
               elements. Specifically, we apply the hypergraph model to rate
               data samples' and categorical feature values' relevance to
               classification labels. The proposed Hypergraph-based Importance
               ratings are theoretically grounded on the hypergraph cut
               conductance minimization concept. As a result of using
               hypergraph representation, which is a lossless representation
               from the perspective of higher-order relationships in data, our
               approach allows for more precise exploitation of the information
               on feature and sample coincidences. The solution was tested
               using two scenarios: undersampling for imbalanced classification
               data and feature selection. The experimentation results have
               proven the good quality of the new approach when compared with
               other state-of-the-art and baseline methods for both scenarios
               measured using the average precision evaluation metric.",
  journal   = "Knowl. Inf. Syst.",
  publisher = "Springer Science and Business Media LLC",
  month     =  dec,
  year      =  2022,
  copyright = "https://creativecommons.org/licenses/by/4.0",
  language  = "en"
}