We’re not quite done calculating the information gain of a set of objects. The sizes of the subset that get created after the split are important too! For example, the image below shows two sets with the same impurity. Which set would you rather have in your decision tree?

Both of these sets are perfectly pure, but the purity of the second set is much more meaningful. Because there are so many items in the second set, we can be confident that whatever we did to produce this set wasn’t an accident.

It might be helpful to think about the inverse as well. Consider these two sets with the same impurity:

Both of these sets are completely impure. However, that impurity is much more meaningful in the set with more instances. We know that we are going to have to do a lot more work in order to completely separate the two classes. Meanwhile, the impurity of the set with two items isn’t as important. We know that we’ll only need to split the set one more time in order to make two pure sets.

Let’s modify the formula for information gain to reflect the fact that the size of the set is relevant. Instead of simply subtracting the impurity of each set, we’ll subtract the *weighted* impurity of each of the split sets. If the data before the split contained `20`

items and one of the resulting splits contained `2`

items, then the weighted impurity of that subset would be `2/20 * impurity`

.
We’re lowering the importance of the impurity of sets with few elements.

Now that we can calculate the information gain using weighted impurity, let’s do that for every possible feature. If we do this, we can find the best feature to split the data on.

### Instructions

**1.**

Let’s update the `information_gain`

function to make it calculate weighted information gain.

When subtracting the impurity of a subset from `info_gain`

, first multiply the impurity by the correct percentage.

The percentage should be the number of labels in the subset, `len(subset)`

, divided by the number of labels before the split, `len(starting_labels)`

.

**2.**

We’ve given you a `split()`

function along with ten `cars`

and the `car_labels`

associated with those cars.

After your `information_gain()`

function, call `split()`

using `cars`

, `car_labels`

and `3`

as a parameter. This will split the data based on the third index (That feature was the number of people the car could hold).

`split()`

returns two lists. Create two variables named `split_data`

and `split_labels`

and set them equal to the result of the split function.

We’ll explore what these variables contain in a second!

**3.**

Take a look at what these variables are. Begin by printing `split_data`

. It’s kind of hard to tell what’s going on there! There are so many lists of lists!

Try printing the length of `split_data`

. What do you think this is telling you?

Also try printing `split_data[0]`

. What do you notice about the items at index `3`

of all these lists? (Remember, when we called `split`

, we used `3`

as the split index).

Try printing `split_data[1]`

. What do you notice about the items at index `3`

of these lists?

**4.**

We now know that `split_data`

contains the cars split into different subsets. `split_labels`

contains the labels of those cars split into different subsets.

Use those split labels to find the information gain of splitting on index `3`

! Remember, the `information_gain()`

function takes a list of the labels before the split (`car_labels`

), and a list of the subsets of labels after the split (`split_labels`

).

Call this function and print the result! How did we do when we split the function on index `3`

?

**5.**

We found the information gain when splitting on feature `3`

. Let’s do the same for every possible feature.

Loop through all of the features of our data to find the best one to split on! Each car has six features, so we want to loop through the indices `0`

through `5`

.

Inside your for loop, call `split()`

using the unsplit data, the unsplit labels, and the index that you’re looping through.

Call `information_gain()`

using the resulting split labels and print the results. Which feature produces the most information gain?