We know that we want to end up with leaves with a low Gini Impurity, but we still need to figure out which features to split on in order to achieve this. For example, is it better if we split our dataset of students based on how much sleep they got or how much time they spent studying?
To answer this question, we can calculate the information gain of splitting the data on a certain feature. Information gain measures difference in the impurity of the data before and after the split. For example, let’s say you had a dataset with an impurity of
0.5. After splitting the data based on a feature, you end up with three groups with impurities
0. The information gain of splitting the data in that way is
0.5 - 0 - 0.375 - 0 = 0.125.
Not bad! By splitting the data in that way, we’ve gained some information about how the data is structured — the datasets after the split are purer than they were before the split. The higher the information gain the better — if information gain is
0, then splitting the data on that feature was useless!
Unfortunately, right now it’s possible for information gain to be negative. In the next exercise, we’ll calculate weighted information gain to fix that problem.
We’ve given you a set of labels named
unsplit_labels and two different ways of splitting those labels into smaller subsets. Let’s calculate the information gain of splitting the labels in this way.
At the bottom of your code, begin by creating a variable named
info_gain should start at the Gini impurity of the
We now want to subtract the impurity of each subset in
Loop through every
split_labels_1. We want to change the value of
subset, calculate the Gini impurity and subtract it from
Outside of your loop, print
We’ve given you a second way to split the data. Instead of looping through the subsets in
split_labels_1, loop through the subsets in
Which split resulted in more information gain?
Once again, in the next exercise, we’ll put the code you wrote into a function named
information_gain that takes
split_labels as parameters.