Now that we have covered two major ways to build trees on a resampled dataset, both in terms of samples and features, we are ready to get to the implementation of random forests! This will be similar to what we covered in the previous exercises, but the random forest algorithm has a slightly different way of randomly choosing features. Rather than choosing a single random set at the onset, each split chooses a different random set.
For example, when finding which feature to split the data on the first time, we might randomly choose to only consider the price of the car, the number of doors, and the safety rating.After splitting the data on the best feature from that subset, we’ll likely want to split again. For this next split, we’ll randomly select three features again to consider. This time those features might be the cost of maintenance, the number of doors, and the size of the trunk. We’ll continue this process until the tree is complete.
One question to consider is how to choose the number of features to randomly select. Why did we choose 3 in this example? A good rule of thumb is to randomly select the square root of the total number of features. Our car dataset doesn’t have a lot of features, so in this example, it’s difficult to follow this rule. But if we had a dataset with 25 features, we’d want to randomly select 5 features to consider at every split point.
You now have the ability to make a random forest using your own decision trees. However,
scikit-learn has a
RandomForestClassifier() class that will do all of this work for you!
RandomForestClassifier is in the
RandomForestClassifier() works almost identically to
DecisionTreeClassifier() — the
.score() methods work in the exact same way.
Create a random forest classification model defined as
rf with default parameters. Print the parameters of the tree using
rf using the training data set and labels. Predict the classes of the test data set (
x_test) and save this as an array
y_pred. Print the accuracy of model on the test set (either using
Implement additional classification evaluation metrics – print the precision, recall, and confusion matrix on the test set.