Skip to content

Commit

Permalink
Add documentation for the classification feature space partitioning
Browse files Browse the repository at this point in the history
Signed-off-by: Terence Parr <parrt@antlr.org>
  • Loading branch information
parrt committed Jan 29, 2023
1 parent 808dbf4 commit dec56e6
Showing 1 changed file with 43 additions and 1 deletion.
44 changes: 43 additions & 1 deletion dtreeviz/trees.py
Original file line number Diff line number Diff line change
Expand Up @@ -889,6 +889,48 @@ def ctree_feature_space(self,
features=None,
figsize=None,
ax=None):
"""
Decision trees partition feature space into rectangular regions
through the series of splits (at internal decision nodes) along the
path from the root to a leaf while making a prediction for an input
instance. The complete tessellation of feature space is the collection
of regions inscribed by all paths from the root to a leaf.
This function isolates one or two features of interest according to
the features parameter and generates a plot. For one feature, the
resulting plot has that feature on the X axis and the associated class
targets at different elevations (with some noise) on the Y axis to
separate them. (Use gtype='barstacked' to get a histogram instead.)
For two features, the plot has the two features of
interest on the X and Y axes and plots the 2D coordinate for each
training data instance. Each marker and region has a unique color
according to the classification label.
Decision nodes associated with features not in the features parameter
do not contribute to the tessellation of the feature space. Paths from
the root to leaves in the decision tree do not contribute a region unless
one or more of the features of interest is tested. Given a model
trained with exactly two features, calling this function with those
two features results in disjoint regions. Any coordinate in 2D feature
space would always lead to a unique leaf.
When the model is trained on more than two features, however, the same
coordinate in the 2D feature space of interest could appear in
multiple paths from the root to a leaf. Consequently, it is possible
for regions to overlap because the tree is testing other variables
along those paths (so that each coordinate in d-space for d model
features still reaches a unique leaf). Overlapping regions simply
means another feature would disambiguate those regions during model
inference.
:param gtype: {'strip','barstacked'}
:param show: Plot elements to show: {'title', 'legend', 'splits'}
:param features: A list of strings containing one or two features of interest.
If none is specified, the first feature(s) in X_training dataframe are used.
:param figsize: Width and height in inches for the figure; use something like (5,1)
for len(features)==1 and (5,3) for len(features)==1.
"""

# TODO: check if we can find some common functionality between univar and bivar visualisations and refactor
# to a single method.
if features is None:
Expand Down Expand Up @@ -941,7 +983,7 @@ def rtree_feature_space(self, fontsize: int = 10, ticks_fontsize=8, show={'title
:param show: which or all of {'title', 'splits'} to show
:param features: A list of strings containing one or two features of interest.
If none is specified, the first feature(s) in X_training dataframe are used.
:param figsize: Width and height in inchesFor the figure; use something like (5,1)
:param figsize: Width and height in inches for the figure; use something like (5,1)
for len(features)==1 and (5,3) for len(features)==1.
"""
if features is None:
Expand Down

0 comments on commit dec56e6

Please sign in to comment.