This is a naive C4.5 algorithm used for classification of data written in Python.
The module "decision_tree.py" interprets a data set R as a set of m tuples (1), where each tuple is an example. The ith tuple (2) contains n attribute values denoted by xi,j (where j = 1, 2, ..., n) and a target value denoted by yi. Each index j represents a unique attribute. The target value yi must be at the last index of each tuple. A subset of the data set is denoted by S (3).
Each attribute and its distinct values will be a node in the decision tree. An attribute's unique values will be its child nodes, and each of these child nodes will then be a parent node to another attribute node, and so on. This continues until the target value is reached, which is a leaf node. The module "decision_tree.py" splits an attribute a into its unique values by creating a new set Xa, which contains values xi,j in each tuple Si for all tuples in the subset S such that j = a (4). A set of unique target values yi in subset S is also defined (5).
Deciding which attribute should be chosen as a root node or a child node of another attribute's value is determined partly by the information entropy H of the subset S (6), where P(S,y) is the probability of selecting the target value y from the subset S (7), or the number of times the value y occurs in tuples within the subset S devided by the total number of tuples in the subset S. The vertical bars denote the cardinality of the set or sequence. The value of the information entropy is between 0 and 1 bits inclusive.
A plot of the information entropy of a set containing two target values, + for positive or yes and - for negative or no, and their probabilities, p+ and p- respectively, over all possible probabilities is shown below. When the number of positive values is the same as the number of negative values in the set, the information entropy is one. When the set contains only positive values or only negative values, the information entropy is zero. Therefore, a set with a lower information entropy is preferred, because it is closer to achieving a final verdict of yes or no. It is more "pure." Note that p+ = 1 - p-.
The information entropy provides a measure of the purity of a data set, or how close it is to achieving a final verdict, but it alone does not provide information on which attributes should be prioritized when selecting nodes to construct the decision tree. For this, the purity of the data set after an attribute is split on each of its values should be measured and compared with the purity of the unsplit data set. This calculation is called the information gain IG (8), which is the change in information entropy of a subset S after splitting an attribute a. The information entropy of the split is given by a weighted sum of the information entropy of each subset Sa(S,x) (9), which contains tuples Si with the split value x at xi,a.
The ID3 algorithm uses information gain to select nodes when constructing the tree. The C4.5 algorithm uses the information gain ratio IGR (10), which is the information gain upon splitting an attribute a devided by the intrinsic value IV of the split (11). The information gain ratio takes the cardinality of the split into account when choosing an attribute. The larger the portion of data eliminated by the split, the smaller the cardinality of Sa(S,x), the larger the instrisic value, and the smaller the information gain ratio. This way, attributes that do not contribute very much to the decision making process but split into pure data sets will be of less priority.
Note that when the instrinsic value is zero, the data set cannot be split anymore and a final verdict must be achieved by a majority vote. The module "decision_tree.py" returns an information gain ratio of zero if an intrinsic value of zero is encountered. This is reasonable because the information gain should also be zero if the intrisnsic value is zero. However, a majority vote is taken just in case the information gain is not zero.
The module "decision_tree.py" uses the learning algorithm described in pseudocode below. This is a naive C4.5 algorithm. I decided to use a nested hash map as the tree structure, in which the leaf nodes are target value keys pointing to NULL values. The algorithm is recursive and creates a nested structure. Note that this algorithm will produce a decision tree, but does not guarantee the optimal tree structure.
Importing the "decision_tree.py" into a Python environment:
from decision_tree import DecisionTree
The "DecisionTree" class can read csv files and automatically convert them into a set of tuples. The file should have the target values as the last column in the data set and headers for each column. The file "tennis.csv" contains data on when a golfer named Peter decided to play golf during various weather conditions.
model = DecisionTree()
model.importcsv( 'tennis.csv' )
The headers of the file are stored in the "label" variable. The rest of the data is stored in the "data" variable. Attributes "Outlook," "Humidity" and "Wind" correspond to columns 1, 2 and 3 in the data, which are their respective values. The last label and column of data is the target value, "Play" with values "Yes" or "No." This data will lead to a binary classifier. However, the "DecisionTree" class and its algorithm can be used for any order of classification.
model.label
['Outlook', 'Humidity', 'Wind', 'Play']
model.data
{('Overcast', 'High', 'Strong', 'Yes'),
('Overcast', 'High', 'Weak', 'Yes'),
('Overcast', 'Normal', 'Strong', 'Yes'),
('Overcast', 'Normal', 'Weak', 'Yes'),
('Rain', 'High', 'Strong', 'No'),
('Rain', 'High', 'Weak', 'Yes'),
('Rain', 'Normal', 'Strong', 'No'),
('Rain', 'Normal', 'Weak', 'Yes'),
('Sunny', 'High', 'Strong', 'No'),
('Sunny', 'High', 'Weak', 'No'),
('Sunny', 'Normal', 'Strong', 'Yes'),
('Sunny', 'Normal', 'Weak', 'Yes')}
The learn method employs the aforementioned learning algorithm on a data set. The data set passed to the learning algorithm is all the data found in the "tennis.csv" file. Data passed to the learn method must be a set of tuples. After the learning algorithm is employed on the data set, the resulting tree can be plotted with the plot method and given a title, shown below. This tree correctly classifies all of the data, and this can easily be verified. Next, let's try a more complicated example.
model.learn( model.data )
model.plot( 'Will Peter Play Golf?' )
The file "mushrooms.csv" contains 8124 examples of data on the toxicity of mushrooms based on various characteristics.
model = DecisionTree()
model.importcsv( 'mushrooms.csv' )
len( model.data )
8124
The data contains 22 attributes and a target class, shown below. The target class is either poisonous or edible. The "DecisionTree" class will be used to contruct a decision tree from the data that will classify a mushroom as poisonous or edible based on the 22 attributes below.
model.label
['cap shape',
'cap surface',
'cap color',
'bruises',
'odor',
'gill attachment',
'gill spacing',
'gill size',
'gill color',
'stalk shape',
'stalk root',
'stalk surface above ring',
'stalk surface below ring',
'stalk color above ring',
'stalk color below ring',
'veil type',
'veil color',
'ring number',
'ring type',
'spore print color',
'population',
'habitat',
'class']
The "testAndTrain" method takes a ratio, which is the ratio of data that will be sampled to train or construct the decision tree model. The sampling is done randomly without replacement. The remainder of the data is separated from the training sample and used to test the accuracy of the model. The code below samples 25% of the total number of examples in the data to contruct the tree, tests the contructed tree on the remaining 75% of the data and prints the results. An accuracy of 99.61 % is achieved after sampling just 25% of the data. However, since eating a poisonous mushroom may be a life or death situation, a higher accuracy will be preferred.
model.testAndTrain( ratio = 0.25 )
Samples in training set: 2031
Samples tested : 6093
Total samples : 8124
Model accuracy : 99.61 %
Due to the size of this data set and the number of attributes it contains, there is no way to completely visualize the decision tree model that was learned by the algorithm. I edited the plot method to at least provide a visualization for the size of the tree, shown below. This tree contains over 400 nodes.
model.plot()