Codecademy Team

Decision Trees for Classification and Regression

Learn about decision trees, how they work and how they can be used for classification and regression tasks.

Introduction

Decision trees are a common model type used for binary classification tasks. The natural structure of a binary tree, which is traversed sequentially by evaluating the truth of each logical statement until the final prediction outcome is reached, lends itself well to predicting a “yes” or “no” target. Such examples include predicting whether a student will pass or fail an exam, whether an email is spam or not, or if a transaction if fraudulent or legitimate.

Decision trees can also be used for regression tasks! Predicting the grade of a student on an exam, the number of spam emails per day, the amount of fraudulent transactions on a platform, etc. are all possible using decision trees. The algorithm works in much the same way, with modification only to the splitting criteria and how the final output it computed. In this article, we will explore both a binary classification and regression model using decision trees with the Indian Graduate Admissions dataset.

Dataset

The data contains features commonly used in determining admission to masters’ degree programs, such as GRE, GPA, and letters of recommendation. The complete list of features is summarized below:

  • GRE Scores ( out of 340 )
  • TOEFL Scores ( out of 120 )
  • University Rating ( out of 5 )
  • Statement of Purpose and Letter of Recommendation Strength ( out of 5 )
  • Undergraduate GPA ( out of 10 )
  • Research Experience ( either 0 or 1 )
  • Chance of Admit ( ranging from 0 to 1 )

We’re going to begin by loading the dataset as a pandas DataFrame. Feel free to open up a jupyter notebook on the side to implement the code in the article!

import pandas as pd
df = pd.read_csv("Admission_Predict.csv")
df.columns = df.columns.str.strip().str.replace(' ','_').str.lower()

Decision Trees for Classification: A Recap

As a first step, we will create a binary class (1=admission likely , 0=admission unlikely) from the chance of admit – greater than 80% we will consider as likely. The remaining data columns will be used as predictors.

X = df.loc[:,'gre_score':'research']
y = df['chance_of_admit']>=.8

Fitting and Predicting

We will use scikit-learn‘s tree module to create, train, predict, and visualize a decision tree classifier. The syntax is the same as other models in scikit-learn, once an instance of the model class is instantiated with dt = DecisionTreeClassifier(), .fit() can be used to fit the model on the training set. After fitting, .predict() (and predict_proba()) and .score() can be called to generate predictions and score the model on the test data.

As with other scikit-learn models, only numeric data can be used (categorical variables and nulls must be handled prior to model fitting). In this case, our categorical features have already been transformed and no missing values are present in the data set.

x_train, x_test, y_train, y_test = train_test_split(X,y, random_state=0, test_size=0.2)
dt = DecisionTreeClassifier(max_depth=2, ccp_alpha=0.01,criterion='gini')
dt.fit(x_train, y_train)
y_pred = dt.predict(x_test)
print(dt.score(x_test, y_test))
print(accuracy_score(y_test, y_pred))

Output:

0.925
0.925

Two methods are available to visualize the tree within the tree module – the first is using tree_plot to graphically represent the decision tree. The second uses export_text to list the rules behind the splits in the decision tree. There are many other packages available for more visualization options – such as graphviz, but may require additional installations and will not be covered here.

tree.plot_tree(dt, feature_names = x_train.columns,
max_depth=3, class_names = ['unlikely admit', 'likley admit'],
label='root', filled=True)
print(tree.export_text(dt, feature_names = X.columns.tolist()))

Output:

|--- cgpa <= 8.85
|   |--- class: False
|--- cgpa >  8.85
|   |--- gre_score <= 313.50
|   |   |--- class: False
|   |--- gre_score >  313.50
|   |   |--- class: True

Split Criteria

For a classification task, the default split criteria is Gini impurity – this gives us a measure of how “impure” the groups are. At the root node, the first split is then chosen as the one that maximizes the information gain, i.e. decreases the Gini impurity the most. Our tree has already been built for us, but how was the split cgpa<=8.845 determined? cgpa is a continuous variable, which adds an extra complication, as the split can occur for ANY value of cgpa.

To verify, we will use the defined functions gini and info_gain. By running gini(y_train), we get the same Gini impurity value as printed in the tree at the root node, 0.443.

def gini(data):
"""Calculate the Gini Impurity Score
"""
data = pd.Series(data)
return 1 - sum(data.value_counts(normalize=True)**2)
gi = gini(y_train)
print(f'Gini impurity at root: {round(gi,3)}')

Output:

Gini impurity at root: 0.443

Next, we are going to verify how the split on cgpa was determined, i.e. where did the 8.845 value come from. We will use info_gain over ALL values of cgpa to determine the information gain when split on each value. This is stored in a table and sorted, and voila, the top value for the split is cgpa<=8.845! This is also done for every other feature (and for those continuous ones, every value), to find the top split overall.

info_gain_list = []
for i in x_train.cgpa.unique():
left = y_train[x_train.cgpa<=i]
right = y_train[x_train.cgpa>i]
info_gain_list.append([i, info_gain(left, right, gi)])
ig_table = pd.DataFrame(info_gain_list, columns=['split_value', 'info_gain']).sort_values('info_gain',ascending=False)
ig_table.head(10)

Output:


|  |split_value |    info_gain |
|---|-----|-----|
| 10 | 8.84 | 0.296932 |
| 124 | 8.85 | 0.291464 |
| 139 | 8.88 | 0.290704 |
| 18 | 8.90 | 0.290054 |
| 98 | 8.83 | 0.287810 |
| 110 | 8.87 | 0.286050 |
| 152 | 8.94 | 0.284714 |
| 57 | 8.96 | 0.284210 |
| 96 | 8.80 | 0.283371 |
| 21 | 9.00 | 0.283364 |
plt.plot(ig_table['split_value'], ig_table['info_gain'],'o')
plt.plot(ig_table['split_value'].iloc[0], ig_table['info_gain'].iloc[0],'r*')
plt.xlabel('cgpa split value')
plt.ylabel('info gain')

After this process is repeated, and there is no further info gain by splitting, the tree is finally built. Last to evaluate, any sample traverses through tree and appropriate splits until it reaches a leaf node, and then assigned the majority class of that leaf (or weighted majority).

Regression

For the regression problem, we will use the unaltered chance_of_admit target, which is a floating point value between 0 and 1.

X = df.loc[:,'gre_score':'research']
y = df['chance_of_admit']

Fitting and Predicting

The syntax is identical as the decision tree classifier, except the target, y, must be real-valued and the model used must be DecisionTreeRegressor(). As far as model hyperparameters, almost all are the same, except the criterion used must be for a regression task – the default is MSE (mean squared error), which we will investigate below:

x_train, x_test, y_train, y_test = train_test_split(X,y, random_state=0, test_size=0.2)
dt = DecisionTreeRegressor(max_depth=3, ccp_alpha=0.001)
dt.fit(x_train, y_train)
y_pred = dt.predict(x_test)
print(dt.score(x_test, y_test))

Similarly, the tree can be visualized using tree.plot_tree – keeping in mind the splitting criteria is mse and the value is the average chance_of_admit of all samples in that leaf.

plt.figure(figsize=(10,10))
tree.plot_tree(dt, feature_names = x_train.columns,
max_depth=2, filled=True);

Split Criteria

Unlike the classification problem, there are no longer classes to split the tree by. Instead, at each level, the value is the average of all samples that fit the logical criteria. In terms of evaluating the split, the default method is MSE. For example, the root node, the average target value is 0.727 (verify y_train.mean()). Then the MSE (mean-squared error) if we were to use 0.727 as the value for all samples, would be:

np.mean((y_train - y_train.mean())**2) = 0.02029

Now to determine the split, for each value of cpga, the information gain, or decrease in MSE after the split, is calculated and then values are sorted. Like before, we can modify our functions for the regression version, and see the best split is again cpga<=8.84.

The below code challenges walks you through the details – in the regression version, instead of Gini impurity, MSE is used, and the information gain function is modified to mse_gain.

#Code Challenge
def mse(data):
"""Calculate the MSE of a data set
"""
return np.mean((data - data.mean())**2)
def mse_gain(left, right, current_mse):
"""Information Gain (MSE) associated with creating a node/split data based on MSE.
Input: left, right are data in left branch, right banch, respectively
current_impurity is the data impurity before splitting into left, right branches
"""
# weight for gini score of the left branch
w = float(len(left)) / (len(left) + len(right))
return current_mse - w * mse(left) - (1 - w) * mse(right)
m = None
print(f'MSE at root: {round(m,3)}')
mse_gain_list = []
for i in x_train.cgpa.unique():
left = y_train[x_train.cgpa<=i]
right = y_train[x_train.cgpa>i]
mse_gain_list.append([i, mse_gain(left, right, m)])
mse_table = pd.DataFrame(mse_gain_list,columns=['split_value', 'info_gain']).sort_values('info_gain',ascending=False)
print(mse_table.head(10))
print(f'Split with highest information gain is: {None}')
plt.plot(mse_table['split_value'], mse_table['info_gain'],'o')
plt.plot(mse_table['split_value'].iloc[0], mse_table['info_gain'].iloc[0],'r*')
plt.xlabel('cgpa split value')
plt.ylabel('info gain')

Again, the process will continue until there is no increase in information gain by splitting. Now that the tree has been built, evaluation occurs in much the same way. Any sample traverses through the tree until it reaches a leaf node, then assigned the average value of the samples in leaf. Depending on the depth of the tree, the predicted values can be limited. In this example, only four unique predicted values are possible, which we can verify. This is something to be aware of when using a decision tree regressor, unlike linear/logistic regression, not all output values may be possible.

np.unique(dt.predict(x_train))

Output:

array([0.        , 0.00588235, 0.28571429, 0.32      , 0.63636364,
       0.96428571])