You might be wondering how the trees in the random forest get created. After all, right now, our algorithm for creating a decision tree is deterministic — given a training set, the same tree will be made every time.
Random forests create different trees using a process known as bagging. Every time a decision tree is made, it is created using a different subset of the points in the training set. For example, if our training set had 1000
rows in it, we could make a decision tree by picking 100
of those rows at random to build the tree. This way, every tree is different, but all trees will still be created from a portion of the training data.
One thing to note is that when we’re randomly selecting these 100
rows, we’re doing so with replacement. Picture putting all 100
rows in a bag and reaching in and grabbing one row at random. After writing down what row we picked, we put that row back in our bag.
This means that when we’re picking our 100
random rows, we could pick the same row more than once. In fact, it’s very unlikely, but all 100
randomly picked rows could all be the same row!
Because we’re picking these rows with replacement, there’s no need to shrink our bagged training set from 1000
rows to 100
. We can pick 1000
rows at random, and because we can get the same row more than once, we’ll still end up with a unique data set.
Let’s implement bagging! We’ll be using the data set of cars that we used in our decision tree lesson.
Instructions
Start by creating a tree using all of the data we’ve given you. Create a variable named tree
and set it equal to the build_tree()
function using car_data
and car_labels
as parameters.
Then call print_tree()
using tree
as a parameter. Scroll up to the top to see the root of the tree. Which feature is used to split the data at the root?
For now, comment out printing the tree.
Let’s now implement bagging. The original dataset has 1000 items in it. We want to randomly select a subset of those with replacement.
Create a list named indices
that contains 1000
random numbers between 0
and 1000
. We’ll use this list to remember the 1000 cars and the 1000 labels that we’re going to build a tree with.
You can use either a for loop or list comprehension to make this list. To get a random number between 0
and 1000
, use random.randint(0, 999)
.
Create two new lists named data_subset
and labels_subset
. These two lists should contain the cars and labels found at each index
in indices
.
Once again, you can use either a for loop or list comprehension to make these lists.
Create a tree named subset_tree
using the build_tree()
function with data_subset
and labels_subset
as parameters.
Print subset_tree
using the print_tree()
function.
Which feature is used to split the data at the root? Is it a different feature than the feature that split the tree that was created using all of the data?
You’ve just created a new tree from the training set! If you used 1000 different indices, you’d get another different tree. You could now create a random forest by creating multiple different trees!