minds is a Python toolkit, which can be used for computing minimum size
decisions sets, i.e. unordered sets of if-then rules [1]. The toolkit
represents a set of pure Python modules, which can be used in a Python script
in the standard way through the provided API. Additionally, it contains an
executable mds.py
, which can be applied for constructing a smallest size
decision set for a training dataset in CSV.
[1] | Here the size can be defined as the number of rules of the decision set, or the total number of literals used. |
Before using minds, make sure you have the following Python packages installed:
Please, follow the installation instructions on these projects' websites to install them properly. (If you spot any other package dependency not listed here, please, let us know.)
Also, running some of the algorithms require CPLEX and/or Gurobi as well as their Python bindings to be installed.
Finally, to install minds, it should suffice to do (alternatively, set it up from this GitHub repository):
$ pip install minds-kit
The mds.py
executable provided by the toolkit serves as a simple example
of how the toolkit can be used. It has a number of command-line options and
their list can be seen by running:
$ mds.py -h
In order to minimize the number of rules used in the target decision set, a user can apply the approaches proposed in the IJCAR'18 paper. For instance, the following command can be used to apply the SAT-based formulation of MinDS2 and exploit the Glucose3 SAT solver (note that any other SAT solver available in PySAT can be used):
$ mds.py -a satr -s glucose3 -v <dataset.csv>
To apply the MP92 model, run:
$ mds.py -a mp92 -s glucose3 -v <dataset.csv>
To apply the MinDS1 model, run:
$ mds.py -a minds1 -s glucose3 -v <dataset.py>
Any of these approaches can be augmenter with the option --opt
, which
enables MaxSAT-based minimization of the number of literals used once the
optimal number of rules is obtained and fixed, e.g.:
$ mds.py -a satr --opt -s glucose3 -v <dataset.csv>
If option '--approx <int>'
is given, the MaxSAT call will be replaced with
a series of <int>
MCSes enumerated to approximate the exact MaxSAT
solution.
Our recent CP'20 paper
proposed a novel SAT- and MaxSAT-based approach to minimizing the total number
of literals used in the target decision set. An example of how this can be
done using the mds.py
script follows:
$ mds.py -a satl -s glucose3 -v <dataset.csv>
Here, one can replace the argument value 'satl'
with values satls
to
split the computation process by classes, or with values mxsatl
and
mxsatls
to achieve the result by exploiting MaxSAT solvers (instead of
iterative SAT solving).
Sparse decision sets can be constructed by running:
$ mds.py -a sparse --lambda <float> -s glucose3 -v <dataset.csv>
Here, the value of '--lambda'
is the regularization cost parameter, which
equals 0.005 by default. It indicates how much adding a literal/rule to the
decision set costs with respect to the overall accuracy increase (see the
paper
for details).
Our recent paper at AAAI'21 proposed an efficient two-stage approach to computing smallest size decision sets, the first stage of which consists in enumerating all possible rules for given training data while the second stage consists of solving the set cover problem for selecting an optimal set of rules for each class. This approach can be used to compute a decision set with either a smallest number of rules or a smallest number of literals. The approach can be invoked by running:
$ mds.py -a 2stage -B -C gurobi --s glucose3 -v <dataset.csv>
Here, option '-B'
enables breaking symmetric rules to enumerate at stage
1; options '-C gurobi'
forces the tool to use Gurobi for solving the set
cover problem of stage 2 (note that one can instead opt to use 'rc2'
or
'cplex'
).
As some of the developed algorithms apply MaxSAT solving, it is sometimes
important to get the best performance of the underlying MaxSAT solver. The
mds.py
tool provides a number of command-line options to tweak the
internal heuristics of the award-winning MaxSAT solver RC2 used in minds:
-1
- to detect AtMost1 constraints-b
- to apply Boolean lexicographic optimization (BLO)-m
- to apply heuristic minimization of unsatisfiable cores-t
- to trim unsatisfiable cores (at most) a given number of times-x
- to exhaust unsatisfiable cores
You may want to use any combination of these. Also, note that none of them are enabled by default. The techniques enabled by these command-line parameters are detailed in the paper describing RC2. Read it if you are interested.
As mentioned above, the toolkit's functionality can be accessed through the
Python API. At this point, minds does not offer an easy-to-use API à la
scikit although this should hopefully be fixed
in the future. The mds.py
tool provided with the toolkit can serve as an
example of how the functionality can be used. Some of the major points are
also outlined below:
from minds.data import Data
from minds.check import ConsistencyChecker
from minds.options import Options
from minds.twostage import TwoStageApproach
# setting the necessary parameters
options = Options()
options.approach = '2stage'
options.solver = 'glucose3'
options.cover = 'gurobi'
options.bsymm = True # applying symmetry breaking
options.verb = 0 # verbosity level
# reading data from a CSV file
data = Data(filename='some-dataset.csv', separator=',')
# data may be inconsistent/contradictory
checker = ConsistencyChecker(data, options)
if checker.status and checker.do() == False:
# if we do not remove inconsistency, our approach will fail
checker.remove_inconsistent()
# creating and calling the solver
ruler = TwoStageApproach(data, options)
covers = ruler.compute()
# printing the result rules for every label/class to stdout
for label in covers:
for rule in covers[label]:
print('rule:', rule)
Note that all the algorithms make use of the Data
class for representing
the data internally. An object of class Data
can be created for a given
CSV file, or alternatively for a pandas' DataFrame object:
import pandas
dframe = pandas.read_csv('some-dataset.csv')
data = Data(dataframe=dframe)
It is also crucial that the data to be given to any of the algorithms must
be consistent, i.e. there must not be two data instances mapping the same
feature values to different classes. This can be checked by the use of
ConsistencyChecker
, which determines the largest portion of the dataset
that is consistent. See the code above for how it can be used.
Alternatively, especially if you work with a pandas DataFrame
, it may be
easier and/or more convenient for you to get rid of inconsistency directly on
your own by traversing and "fixing" the DataFrame
.
If any of the decision set learning algorithms of minds has been significant to a project that leads to an academic publication, please, acknowledge that fact by citing the corresponding paper where it was proposed:
@inproceedings{ipnms-ijcar18, author = {Alexey Ignatiev and Filipe Pereira and Nina Narodytska and Joao Marques{-}Silva}, title = {A SAT-Based Approach to Learn Explainable Decision Sets}, booktitle = {{IJCAR}}, pages = {627--645}, year = {2018}, url = {https://doi.org/10.1007/978-3-319-94205-6\_41}, doi = {10.1007/978-3-319-94205-6\_41} } @inproceedings{yilbs-cp20, author = {Jinqiang Yu and Alexey Ignatiev and Peter J. Stuckey and Pierre {Le Bodic}}, title = {Computing Optimal Decision Sets with {SAT}}, booktitle = {{CP}}, pages = {952--970}, year = {2020}, url = {https://doi.org/10.1007/978-3-030-58475-7\_55}, doi = {10.1007/978-3-030-58475-7\_55} } @inproceedings{ilsms-aaai21, author = {Alexey Ignatiev and Edward Lam and Peter J. Stuckey and Joao Marques{-}Silva}, title = {A Scalable Two Stage Approach to Computing Optimal Decision Sets}, booktitle = {AAAI}, year = {2021} }
minds is licensed under MIT.