Learn
Random Forests
Bagging

You might be wondering how the trees in the random forest get created. After all, right now, our algorithm for creating a decision tree is deterministic — given a training set, the same tree will be made every time.

Random forests create different trees using a process known as bagging. Every time a decision tree is made, it is created using a different subset of the points in the training set. For example, if our training set had `1000` rows in it, we could make a decision tree by picking `100` of those rows at random to build the tree. This way, every tree is different, but all trees will still be created from a portion of the training data.

One thing to note is that when we’re randomly selecting these `100` rows, we’re doing so with replacement. Picture putting all `100` rows in a bag and reaching in and grabbing one row at random. After writing down what row we picked, we put that row back in our bag.

This means that when we’re picking our `100` random rows, we could pick the same row more than once. In fact, it’s very unlikely, but all `100` randomly picked rows could all be the same row!

Because we’re picking these rows with replacement, there’s no need to shrink our bagged training set from `1000` rows to `100`. We can pick `1000` rows at random, and because we can get the same row more than once, we’ll still end up with a unique data set.

Let’s implement bagging! We’ll be using the data set of cars that we used in our decision tree lesson.

### Instructions

1.

Start by creating a tree using all of the data we’ve given you. Create a variable named `tree` and set it equal to the `build_tree()` function using `car_data` and `car_labels` as parameters.

Then call `print_tree()` using `tree` as a parameter. Scroll up to the top to see the root of the tree. Which feature is used to split the data at the root?

2.

For now, comment out printing the tree.

Let’s now implement bagging. The original dataset has 1000 items in it. We want to randomly select a subset of those with replacement.

Create a list named `indices` that contains `1000` random numbers between `0` and `1000`. We’ll use this list to remember the 1000 cars and the 1000 labels that we’re going to build a tree with.

You can use either a for loop or list comprehension to make this list. To get a random number between `0` and `1000`, use `random.randint(0, 999)`.

3.

Create two new lists named `data_subset` and `labels_subset`. These two lists should contain the cars and labels found at each `index` in `indices`.

Once again, you can use either a for loop or list comprehension to make these lists.

4.

Create a tree named `subset_tree` using the `build_tree()` function with `data_subset` and `labels_subset` as parameters.

Print `subset_tree` using the `print_tree()` function.

Which feature is used to split the data at the root? Is it a different feature than the feature that split the tree that was created using all of the data?

You’ve just created a new tree from the training set! If you used 1000 different indices, you’d get another different tree. You could now create a random forest by creating multiple different trees!