We’re not quite done calculating the information gain of a set of objects. The sizes of the subset that get created after the split are important too! For example, the image below shows two sets with the same impurity. Which set would you rather have in your decision tree?
Both of these sets are perfectly pure, but the purity of the second set is much more meaningful. Because there are so many items in the second set, we can be confident that whatever we did to produce this set wasn’t an accident.
It might be helpful to think about the inverse as well. Consider these two sets with the same impurity:
Both of these sets are completely impure. However, that impurity is much more meaningful in the set with more instances. We know that we are going to have to do a lot more work in order to completely separate the two classes. Meanwhile, the impurity of the set with two items isn’t as important. We know that we’ll only need to split the set one more time in order to make two pure sets.
Let’s modify the formula for information gain to reflect the fact that the size of the set is relevant. Instead of simply subtracting the impurity of each set, we’ll subtract the weighted impurity of each of the split sets. If the data before the split contained 20
items and one of the resulting splits contained 2
items, then the weighted impurity of that subset would be 2/20 * impurity
.
We’re lowering the importance of the impurity of sets with few elements.
Now that we can calculate the information gain using weighted impurity, let’s do that for every possible feature. If we do this, we can find the best feature to split the data on.
Instructions
Let’s update the information_gain
function to make it calculate weighted information gain.
When subtracting the impurity of a subset from info_gain
, first multiply the impurity by the correct percentage.
The percentage should be the number of labels in the subset, len(subset)
, divided by the number of labels before the split, len(starting_labels)
.
We’ve given you a split()
function along with ten cars
and the car_labels
associated with those cars.
After your information_gain()
function, call split()
using cars
, car_labels
and 3
as a parameter. This will split the data based on the third index (That feature was the number of people the car could hold).
split()
returns two lists. Create two variables named split_data
and split_labels
and set them equal to the result of the split function.
We’ll explore what these variables contain in a second!
Take a look at what these variables are. Begin by printing split_data
. It’s kind of hard to tell what’s going on there! There are so many lists of lists!
Try printing the length of split_data
. What do you think this is telling you?
Also try printing split_data[0]
. What do you notice about the items at index 3
of all these lists? (Remember, when we called split
, we used 3
as the split index).
Try printing split_data[1]
. What do you notice about the items at index 3
of these lists?
We now know that split_data
contains the cars split into different subsets. split_labels
contains the labels of those cars split into different subsets.
Use those split labels to find the information gain of splitting on index 3
! Remember, the information_gain()
function takes a list of the labels before the split (car_labels
), and a list of the subsets of labels after the split (split_labels
).
Call this function and print the result! How did we do when we split the function on index 3
?
We found the information gain when splitting on feature 3
. Let’s do the same for every possible feature.
Loop through all of the features of our data to find the best one to split on! Each car has six features, so we want to loop through the indices 0
through 5
.
Inside your for loop, call split()
using the unsplit data, the unsplit labels, and the index that you’re looping through.
Call information_gain()
using the resulting split labels and print the results. Which feature produces the most information gain?