You might be wondering how the trees in the random forest get created. After all, right now, our algorithm for creating a decision tree is deterministic — given a training set, the same tree will be made every time.
Random forests create different trees using a process known as bagging. Every time a decision tree is made, it is created using a different subset of the points in the training set. For example, if our training set had
1000 rows in it, we could make a decision tree by picking
100 of those rows at random to build the tree. This way, every tree is different, but all trees will still be created from a portion of the training data.
One thing to note is that when we’re randomly selecting these
100 rows, we’re doing so with replacement. Picture putting all
100 rows in a bag and reaching in and grabbing one row at random. After writing down what row we picked, we put that row back in our bag.
This means that when we’re picking our
100 random rows, we could pick the same row more than once. In fact, it’s very unlikely, but all
100 randomly picked rows could all be the same row!
Because we’re picking these rows with replacement, there’s no need to shrink our bagged training set from
1000 rows to
100. We can pick
1000 rows at random, and because we can get the same row more than once, we’ll still end up with a unique data set.
Let’s implement bagging! We’ll be using the data set of cars that we used in our decision tree lesson.
Start by creating a tree using all of the data we’ve given you. Create a variable named
tree and set it equal to the
build_tree() function using
car_labels as parameters.
tree as a parameter. Scroll up to the top to see the root of the tree. Which feature is used to split the data at the root?
For now, comment out printing the tree.
Let’s now implement bagging. The original dataset has 1000 items in it. We want to randomly select a subset of those with replacement.
Create a list named
indices that contains
1000 random numbers between
1000. We’ll use this list to remember the 1000 cars and the 1000 labels that we’re going to build a tree with.
You can use either a for loop or list comprehension to make this list. To get a random number between
Create two new lists named
labels_subset. These two lists should contain the cars and labels found at each
Once again, you can use either a for loop or list comprehension to make these lists.
Create a tree named
subset_tree using the
build_tree() function with
labels_subset as parameters.
subset_tree using the
Which feature is used to split the data at the root? Is it a different feature than the feature that split the tree that was created using all of the data?
You’ve just created a new tree from the training set! If you used 1000 different indices, you’d get another different tree. You could now create a random forest by creating multiple different trees!