In this lesson, you'll learn how random forests try to improve on some of the shortcomings of decision trees. 

We've seen that decision trees can be powerful supervised machine learning models. However, they're not without their weaknesses &mdash; decision trees are often prone to overfitting.

We've discussed some strategies to minimize this problem, like pruning, but sometimes that isn't enough. We need to find another way to generalize our trees. This is where the concept of a _random forest_ comes in handy.

A _random forest_ is an ensemble machine learning technique &mdash; a random forest contains many decision trees that all work together to classify new points. When a random forest is asked to classify a new point, the random forest gives that point to each of the decision trees. Each of those trees reports their classification and the random forest returns the most popular classification. It's like every tree gets a vote, and the most popular classification wins.

Some of the trees in the random forest may be overfit, but by making the prediction based on a large number of trees, overfitting will have less of an impact.

In this lesson, we'll learn how the trees in a random forest get created.

Random Forest

You might be wondering how the trees in the random forest get created. After all, right now, our algorithm for creating a decision tree is deterministic &mdash; given a training set, the same tree will be made every time. 

Random forests create different trees using a process known as _bagging_. Every time a decision tree is made, it is created using a different subset of the points in the training set. For example, if our training set had `1000` rows in it, we could make a decision tree by picking `100` of those rows at random to build the tree.  This way, every tree is different, but all trees will still be created from a portion of the training data.

One thing to note is that when we're randomly selecting these `100` rows, we're doing so _with replacement_. Picture putting all `100` rows in a bag and reaching in and grabbing one row at random. After writing down what row we picked, we put that row back in our bag.

This means that when we're picking our `100` random rows, we could pick the same row more than once. In fact, it's very unlikely, but all `100` randomly picked rows could all be the same row!

Because we're picking these rows with replacement, there's no need to shrink our bagged training set from `1000` rows to `100`. We can pick `1000` rows at random, and because we can get the same row more than once, we'll still end up with a unique data set.

Let's implement bagging! We'll be using the data set of cars that we used in our decision tree lesson.

Bagging

We're now making trees based on different random subsets of our initial dataset. But we can continue to add variety to the ways our trees are created by changing the features that we use. 

Recall that for our car data set, the original features were the following:
* The price of the car
* The cost of maintenance
* The number of doors
* The number of people the car can hold
* The size of the trunk
* The safety rating

Right now when we create a decision tree, we look at every one of those features and choose to split the data based on the feature that produces the most information gain. We could change how the tree is created by only allowing a subset of those features to be considered at each split.

For example, when finding which feature to split the data on the first time, we might randomly choose to only consider the price of the car, the number of doors, and the safety rating. 

After splitting the data on the best feature from that subset, we'll likely want to split again. For this next split, we'll randomly select three features again to consider. This time those features might be the cost of maintenance, the number of doors, and the size of the trunk. We'll continue this process until the tree is complete.

One question to consider is how to choose the number of features to randomly select. Why did we choose `3` in this example? A good rule of thumb is to randomly select the square root of the total number of features. Our car dataset doesn't have a lot of features, so in this example, it's difficult to follow this rule. But if we had a dataset with `25` features, we'd want to randomly select `5` features to consider at every split point.

Bagging Features

Now that we can make different decision trees, it's time to plant a whole forest! Let's say we make different `8` trees using bagging and feature bagging. We can now take a new unlabeled point, give that point to each tree in the forest, and count the number of times different labels are predicted. 

The trees give us their votes and the label that is predicted most often will be our final classification! For example, if we gave our random forest of 8 trees a new data point, we might get the following results:

```py
["vgood", "vgood", "good", "vgood", "acc", "vgood", "good", "vgood"]

```
Since the most commonly predicted classification was `"vgood"`, this would be the random forest's final classification.

Let's write some code that can classify an unlabeled point!

Classify

We're now able to create a random forest, but how accurate is it compared to a single decision tree? To answer this question we've split our data into a training set and test set. By building our models using the training set and testing on every data point in the test set, we can calculate the accuracy of both a single decision tree and a random forest.

We've given you code that calculates the accuracy of a single tree. This tree was made without using any of the bagging techniques we just learned. We created the tree by using every row from the training set once and considered every feature when splitting the data rather than a random subset.

Let's also calculate the accuracy of a random forest and see how it compares!

Test Set

You now have the ability to make a random forest using your own decision trees. However, `scikit-learn` has a `RandomForestClassifier` class that will do all of this work for you! `RandomForestClassifier` is in the `sklearn.ensemble` module. 

`RandomForestClassifier` works almost identically to `DecisionTreeClassifier` &mdash; the `.fit()`, `.predict()`, and `.score()` methods work in the exact same way.

When creating a `RandomForestClassifier`, you can choose how many trees to include in the random forest by using the `n_estimators` parameter like this:

```py
classifier = RandomForestClassifier(n_estimators = 100)
```

We now have a very powerful machine learning model that is fairly resistant to overfitting!


Random Forest in Scikit-learn

Nice work! Here are some of the major takeaways about random forests:
* A random forest is an ensemble machine learning model. It makes a classification by aggregating the classifications of many decision trees.
* Random forests are used to avoid overfitting. By aggregating the classification of multiple trees, having overfitted trees in a random forest is less impactful.
* Every decision tree in a random forest is created by using a different subset of data points from the training set. Those data points are chosen at random _with replacement_, which means a single data point can be chosen more than once. This process is known as _bagging_.
* When creating a tree in a random forest, a randomly selected subset of features are considered as candidates for the best splitting feature. If your dataset has `n` features, it is common practice to randomly select the square root of `n` features.


Review

Random Forests

In this course, you will learn how to build and use decision trees - a powerful supervised machine learning model. After gaining an understanding of the strengths and weaknesses of a decision tree, you will learn how random forests are used to solve some of those weaknesses.

In this course, you will learn how to build and use decision trees and random forests - two powerful supervised machine learning models.

Decision Trees

Decision trees are supervised machine learning models. In this course, you'll learn how to create a decision tree and classify new data using the tree.

Decision trees are machine learning models that try to find patterns in the features of data points. Take a look at the tree on this page. This tree tries to predict whether a student will get an A on their next test. 

By asking questions like "What is the student's average grade in the class" the decision tree tries to get a better understanding of their chances on the next test.

In order to make a classification, this classifier needs a data point with four features:
* The student's average grade in the class.
* The number of hours the student plans on studying for the test.
* The number of hours the student plans on sleeping the night before the test.
* Whether or not the student plans on cheating.

For example, let's say that somebody has a "B" average in the class, studied for more than 3 hours, slept less than 5 hours before the test, and doesn't plan to cheat. If we start at the top of the tree and take the correct path based on that data, we'll arrive at a _leaf node_ that predicts the person will _not_ get an A on the next test.

In this course, you'll learn how to create a tree like this!

If we're given this magic tree, it seems relatively easy to make classifications. But how do these trees get created in the first place? Decision trees are supervised machine learning models, which means that they're created from a training set of labeled data. Creating the tree is where the _learning_ in machine learning happens.

Take a look at the gif on this page. We begin with every point in the training set at the top of the tree. These training points have labels &mdash; the red points represent students that didn't get an A on a test and the green points represent students that did get an A on a test . 

We then decide to split the data into smaller groups based on a feature. For example, that feature could be something like their average grade in the class. Students with an A average would go into one set, students with a B average would go into another subset, and so on.

Once we have these subsets, we repeat the process &mdash; we split the data in each subset again on a different feature. 

Eventually, we reach a point where we decide to stop splitting the data into smaller groups. We've reached a leaf of the tree. We can now count up the labels of the data in that leaf. If an unlabeled point reaches that leaf, it will be classified as the majority label.

We can now make a tree, but how did we know which features to split the data set with? After all, if we started by splitting the data based on the number of hours they slept the night before the test, we'd end up with a very different tree that would produce very different results. How do we know which tree is best? We'll tackle this question soon!


Making Decision Trees

In this lesson, we'll create a decision tree build off of a dataset about cars. When considering buying a car, what factors go into making that decision? 

Each car can fall into four different classes which represent how satisfied someone would be with purchasing the car &mdash; `unacc` (unacceptable), `acc` (acceptable), `good`, `vgood`.

Each car has 6 features:
* The price of the car which can be `"vhigh"`, `"high"`, `"med"`, or `"low"`.
* The cost of maintaining the car which can be `"vhigh"`, `"high"`, `"med"`, or `"low"`.
* The number of doors which can be `"2"`, `"3"`, `"4"`, `"5more"`.
* The number of people the car can hold which can be `"2"`, `"4"`, or `"more"`.
* The size of the trunk which can be `"small"`, `"med"`, or `"big"`.
* The safety rating of the car which can be `"low"`, `"med"`, or `"high"`.

We've imported a dataset of cars behind the scenes and created a decision tree using that data. In this lesson, you'll learn how to build that tree yourself, but for now, let's see what the tree can do!

Cars

Consider the two trees below. Which tree would be more useful as a model that tries to predict whether someone would get an A in a class?

<img src="https://content.codecademy.com/programs/data-science-path/decision-trees/comparison_1.svg" alt="A tree where the leaf nodes have different types of classification">

<img src="https://content.codecademy.com/programs/data-science-path/decision-trees/comparison_2.svg" alt="A tree where the leaf nodes have only one type of classification">

Let's say you use the top tree. You'll end up at a leaf node where the label is up for debate. The training data has labels from both classes! If you use the bottom tree, you'll end up at a leaf where there's only one type of label. There's no debate at all! We'd be much more confident about our classification if we used the bottom tree.

This idea can be quantified by calculating the _Gini impurity_ of a set of data points. To find the Gini impurity, start at `1` and subtract the squared percentage of each label in the set. For example, if a data set had three items of class `A` and one item of class `B`, the Gini impurity of the set would be

```tex
1 - \bigg(\frac{3}{4}\bigg)^2 - \bigg(\frac{1}{4}\bigg)^2  = 0.375
```

If a data set has only one class, you'd end up with a Gini impurity of `0`. The lower the impurity, the better the decision tree!


Gini Impurity

We know that we want to end up with leaves with a low Gini Impurity, but we still need to figure out which features to split on in order to achieve this. For example, is it better if we split our dataset of students based on how much sleep they got or how much time they spent studying?

To answer this question, we can calculate the _information gain_ of splitting the data on a certain feature. Information gain measures difference in the impurity of the data before and after the split. For example, let's say you had a dataset with an impurity of `0.5`. After splitting the data based on a feature, you end up with three groups with impurities `0`, `0.375`, and `0`. The information gain of splitting the data in that way is `0.5 - 0 - 0.375 - 0 = 0.125`.

<img src="https://content.codecademy.com/programs/data-science-path/decision-trees/info.svg">

Not bad! By splitting the data in that way, we've gained some information about how the data is structured &mdash; the datasets after the split are purer than they were before the split. The higher the information gain the better &mdash; if information gain is `0`, then splitting the data on that feature was useless!
Unfortunately, right now it's possible for information gain to be negative. In the next exercise, we'll calculate _weighted_ information gain to fix that problem.


Information Gain

We're not quite done calculating the information gain of a set of objects. The sizes of the subset that get created after the split are important too! For example, the image below shows two sets with the same impurity. Which set would you rather have in your decision tree?

<img src="https://content.codecademy.com/programs/data-science-path/decision-trees/impurity-0.svg">

Both of these sets are perfectly pure, but the purity of the second set is much more meaningful. Because there are so many items in the second set, we can be confident that whatever we did to produce this set wasn't an accident.

It might be helpful to think about the inverse as well. Consider these two sets with the same impurity:

<img src="https://content.codecademy.com/programs/data-science-path/decision-trees/impurity-5.svg">

Both of these sets are completely impure. However, that impurity is much more meaningful in the set with more instances. We know that we are going to have to do a lot more work in order to completely separate the two classes. Meanwhile, the impurity of the set with two items isn't as important. We know that we'll only need to split the set one more time in order to make two pure sets.

Let's modify the formula for information gain to reflect the fact that the size of the set is relevant. Instead of simply subtracting the impurity of each set, we'll subtract the _weighted_ impurity of each of the split sets. If the data before the split contained `20` items and one of the resulting splits contained `2` items, then the weighted impurity of that subset would be `2/20 * impurity`. 
We're lowering the importance of the impurity of sets with few elements.

<img src="https://content.codecademy.com/programs/data-science-path/decision-trees/weighted_info.svg">

Now that we can calculate the information gain using weighted impurity, let's do that for every possible feature. If we do this, we can find the best feature to split the data on.


Weighted Information Gain

Now that we can find the best feature to split the dataset, we can repeat this process again and again to create the full tree. This is a recursive algorithm! We start with every data point from the training set, find the best feature to split the data, split the data based on that feature, and then recursively repeat the process again on each subset that was created from the split.

We'll stop the recursion when we can no longer find a feature that results in any information gain. In other words, we want to create a leaf of the tree when we can't find a way to split the data that makes purer subsets. 

The leaf should keep track of the classes of the data points from the training set that ended up in the leaf. In our implementation, we'll use a `Counter` object to keep track of the counts of labels.

We'll use these counts to make predictions about new data that we give the tree.

Recursive Tree Building

We can finally use our tree as a classifier! Given a new data point, we start at the top of the tree and follow the path of the tree until we hit a leaf. Once we get to a leaf, we'll use the classes of the points from the training set to make a classification. 

We've slightly changed the way our `build_tree()` function works. Instead of returning a list of branches or a `Counter` object, the `build_tree()` function now returns a `Leaf` object or an `Internal_Node` object. We'll explain how to use these objects in the instructions!


Let's write a function that will use our tree to classify new points!


Classifying New Data

Nice work! You've written a decision tree from scratch that is able to classify new points. Let's take a look at how the Python library `scikit-learn` implements decision trees.

The `sklearn.tree` module contains the `DecisionTreeClassifier` class. To create a `DecisionTreeClassifier` object, call the constructor:

```py
classifier = DecisionTreeClassifier()
```

Next, we want to create the tree based on our training data. To do this, we'll use the `.fit()` method. 

`.fit()` takes a list of data points followed by a list of the labels associated with that data. Note that when we built our tree from scratch, our data points contained strings like `"vhigh"` or `"5more"`. When creating the tree using `scikit-learn`, it's a good idea to map those strings to numbers. For example, for the first feature representing the price of the car, `"low"` would map to `1`, `"med"` would map to `2`, and so on.

```py
classifier.fit(training_data, training_labels)
```

Finally, once we've made our tree, we can use it to classify new data points. The `.predict()` method takes an array of data points and will return an array of classifications for those data points.

```py
predictions = classifier.predict(test_data)
```
If you've split your data into a test set, you can find the accuracy of the model by calling the `.score()` method using the test data and the test labels as parameters.
```py
print(classifier.score(test_data, test_labels))
```
`.score()` returns the percentage of data points from the test set that it classified correctly.

Decision Trees in scikit-learn

Now that we have an understanding of how decision trees are created and used, let's talk about some of their limitations.

One problem with the way we're currently making our decision trees is that our trees aren't always _globablly optimal_. This means that there might be a better tree out there somewhere that produces better results. But wait, why did we go through all that work of finding information gain if it's not producing the best possible tree?

Our current strategy of creating trees is _greedy_. We assume that the best way to create a tree is to find the feature that will result in the largest information gain _right now_ and split on that feature. We never consider the ramifications of that split further down the tree. It's possible that if we split on a suboptimal feature right now, we would find even better splits later on. Unfortunately, finding a globally optimal tree is an extremely difficult task, and finding a tree using our greedy approach is a reasonable substitute.

Another problem with our trees is that they potentially _overfit_ the data. This means that the structure of the tree is too dependent on the training data and doesn't accurately represent the way the data in the real world looks like. In general, larger trees tend to overfit the data more. As the tree gets bigger, it becomes more tuned to the training data and it loses a more generalized understanding of the real world data. 

One way to solve this problem is to _prune_ the tree. The goal of pruning is to shrink the size of the tree. There are a few different pruning strategies, and we won't go into the details of them here. `scikit-learn` currently doesn't prune the tree by default, however we can dig into the code a bit to prune it ourselves.


Decision Tree Limitations

Great work! In this lesson, you learned how to create decision trees and use them to make classifications. Here are some of the major takeaways:

* Good decision trees have pure leaves. A leaf is pure if all of the data points in that class have the same label.
* Decision trees are created using a greedy algorithm that prioritizes finding the feature that results in the largest information gain when splitting the data using that feature.
* Creating an optimal decision tree is difficult. The greedy algorithm doesn't always find the globally optimal tree.
* Decision trees often suffer from overfitting. Making the tree small by pruning helps to generalize the tree so it is more accurate on data in the real world.


Use random forests to predict the income of a person based on census data.

Let's begin by investigating the data available to us. Click on the file `income.csv` and take a look. Notice the first row of the dataset contains the names of our columns. What columns do you think might be helpful in predicting a person's income?

You can find more detailed descriptions of the columns on <a href = "https://archive.ics.uci.edu/ml/datasets/census%20income" target = "_blank">UCI's Machine Learning Repository</a>.

Go back to the `income.py` file. We want to get all of that data into a Pandas DataFrame. Use the `pd.read_csv()` function using `"income.csv"` as a parameter and store the result in a variable named `income_data`. Since the first row of our file contains the names of the columns, we also want to add the argument `header = 0`.

Take a look at one of the rows of the data we've imported. Print `income_data.iloc[0]` to see the first row in its entirety. Did this person make more than $50,000? What is the name of the column that contains that information?

There's a small problem with our data that is a little hard to catch &mdash; every string has an extra space at the start. For example, the first row's `native-country` is `" United-States"`, but we want it to be `"United-States"`. This is happening because in `income.csv` there are spaces after the commas. To fix this, we can add the parameter `delimiter = ", "` to our `read_csv()` function.

Now that we have our data imported into a DataFrame, we can begin putting it in a format that our Random Forest can work with. To do this, we need to separate the labels from the rest of the data. 

For this project, the labels are in the column called `"income"`. We want to grab only this column. You can get a single column from a DataFrame using this syntax:

```py
one_column = data_frame_name[["column_name"]]
```
Create a variable named `labels` that contains only the column `"income"` from the `income_data` DataFrame.


We'll also want to pick which columns to use when trying to predict income. For now, let's select `"age"`, `"capital-gain"`, `"capital-loss"`, `"hours-per-week"`, and `"sex"`. Create a new variable named `data` that contains only those columns. The syntax for this is very similar to selecting only one column:

```py
many_columns = data_frame_name[["a", "b", "c"]]
```
In this example, `many_columns` now contains the columns `"a"`, `"b"`, and `"c"` from `data_frame_name`.


Finally, we want to split our data and labels into a training set and a test set. We'll use the training set to build the random forest, and the test set to see how accurate it is. Use the `train_test_split()` function to do this. 

`train_test_split()` returns four values &mdash; name them `train_data`, `test_data`, `train_labels`, and `test_labels`. When calling `train_test_split()`, it should take three arguments &mdash; `data`, `labels` and `random_state = 1`.

We're now ready to use this data to build and test our random forest. First, create a `RandomForestClassifier` and name it `forest`. When you create the random forest, use the parameter `random_state = 1`.

Next, we need to fit the model. We want to use the training data and training labels to train the random forest. 

Call `forest`'s `.fit()` method using `train_data` and `train_labels` as parameters. When you run your code, there should be an error!

There seems to be a problem with using the column `"sex"` when training the random forest. 

In that column, there are values like `"Male"` and `"Female"`. Random forests can't use columns that contain Strings &mdash; they have to be continuous values like integers or floats. We'll solve this problem soon, but for now, let's remove the `"sex"` column when creating `data`.

Now that our training set doesn't have a column containing strings, we have successfully fit our random forest.

We can now test its accuracy. Call `forest`'s `.score()` method using `test_data` and `test_labels` as parameters. Print the result.


Now that we know the random forest works, let's go back and try to add the `"sex"` column. 

Recall that the problem was that this column contained strings. If we transformed those strings into integers, we could use this data!

If we take every row and make every `"Male"` a `0` and every `"Female"` a `1`, we could then use the column in our random forest. Before creating the `data` variable, use this line of code:

```py
income_data["sex-int"] = income_data["sex"].apply(lambda row: 0 if row == "Male" else 1)
``` 

This creates a new column called `"sex-int"` in the `income_data` DataFrame. Every row in that new column contains a `0` if the row's `"sex"` column contained `"Male"` and a `1` otherwise. 


Add `"sex-int"` to your list of columns included in `data`. What is your accuracy now?

There are a couple of other columns that use strings that might be useful to use. Let's try transforming the values in the `"native-country"` column. 

We should first take a look at the different values that exist in the column. Print `income_data["native-country"].value_counts()`.


Since the majority of the data comes from `"United-States"`, it might make sense to make a column where every row that contains `"United-States"` becomes a `0` and any other country becomes a `1`. Use the syntax from creating the `"sex-int"` column to create a `"country-int"` column.

When mapping Strings to numbers like this, it is important to make sure that continuous numbers make sense. For example, it wouldn't make much sense to map `"United-States"` to `0`, `"Germany"` to `1`, and `"Mexico"` to `2`. If we did this, we're saying that Mexico is more similar to Germany than it is to the United States.

However, if you had values in a column like `"low"`, `"medium"`, and `"high"` mapping those values to `0`, `1`, and `2` would make sense because their representation as Strings is also continuous.

Add `"country-int"` to the columns used when creating `data`. How does this change the accuracy of your model?


Now that you've gotten the hang of transforming, adding, and removing columns from your training data, it's time to explore on your own to try to make the best classifier possible. 

As you play around with the data, here are some ideas that you might want to try:

* Create a `tree.DecisionTreeClassifier`, train it, test is using the same data, and compare the results to the random forest. When does the random forest do better than the single tree? When does a single tree do just as well as the forest?
* After calling `.fit()` on the forest, print `forest.feature_importances_`. This will show you a list of numbers where each number corresponds to the relevance of a column from the training data. Which features tend to be more relevant?
* Use some of the other columns that use continuous variables, or transform columns that use strings!

Predicting Income with Random Forests

A random forest makes a classification by aggregating the classification of multiple decision trees.

A random forest makes a classification by randomly picking one tree from a group of trees to make the classification.

A random forest is meant to be used with other machine learning models to validate their accuracy.

A random forest isn't an ensemble machine learning model - it is a supervised machine learning model.

When randomly selecting rows to be in our training set, the same row can be selected more than once.

When randomly selecting rows to be in our training set, a row that has already been selected will be replaced by a different row.

If there are `n` rows in our training set, we want to randomly select `sqrt(n)` unique rows.

When randomly selecting rows to be in our training set, we randomly replace  values in rows to create new data points.

When constructing the tree, at every point a split needs to be made, a different subset of features are considered.

When constructing the tree, at every point a split needs to be made, the same subset of features are considered.

When constructing the tree, at every point a split needs to be made, all features are considered.

When constructing the tree, at every point a split needs to be made, the first `sqrt(n)` features are considered.

A random forest uses the most common classification from its decision trees as the final classification.

A random forest uses the least common classification from its decision trees as the final classification.

A random forest uses the average classification from its decision trees as the final classification.

A random forest uses a random classification from its decision trees as the final classification. 

The number of features to consider when implementing feature bagging.

The number of rows to randomly select when implementing bagging.

The maximum depth of the trees in the forest.

Take this quiz to test your knowledge of random forests

Congratulations, you’ve successfully completed the Machine Learning: Random Forests & Decision Trees course! You've learned how to implement decision trees, and make them more effective by using them within random forests.

Your learning journey into Machine Learning isn't over yet! Here is our roadmap to mastering Machine Learning:

* [Machine Learning: Random Forests & Decision Trees](https://www.codecademy.com/machine-learning-random-forests-decision-trees) <-- Completed!
* [Machine Learning: Clustering with K-Means](https://www.codecademy.com/machine-learning-clustering-with-k-means) <-- Up next!
* [Machine Learning: Perceptrons](https://www.codecademy.com/machine-learning-perceptrons)

Once again, congratulations on finishing the Machine Learning: Random Forests & Decision Trees course! We are excited to see what you accomplish next.

Next Steps

Use decision trees to predict what continent a flag comes from based on features like color and shape.

Let's start by seeing what the data looks like. Begin by loading the data into a variable named `flags` using Panda's `pd.read_csv()` function. The function should take the name of the CSV file you want to load. In this case, our file is named `"flags.csv"`.

We also want row `0` to be used as the header, so include the parameter `header = 0`.

Take a look at the names of the columns in our DataFrame. These are the features we have available to us. Print `flags.columns`.

Let's also take a look at the first few rows of the dataset. Print `flags.head()`.

Many columns contain numbers that don't make a lot of sense. For example, the third row, which represents Algeria, has a `Language` of `8`. What exactly does that mean?

Take a look at the Attribute Information for this dataset from  <a href = "http://archive.ics.uci.edu/ml/datasets/Flags" target="_blank">UCI's Machine Learning Repository</a>.

Using that information along with the printout of `flags.head()`, can you figure out what landmass Andorra is on?

We're eventually going to use create a decision tree to classify what `Landmass` a country is on.

Create a variable named `labels` and set it equal to only the `"Landmass"` column from `flags`. 

You can grab specific columns from a DataFrame using this syntax:

```py
one_column = df[["A"]]
two_columns = df[["B", "C"]]
```
In this example, `one_column` will be a DataFrame of only `df`'s `"A"` column. `two_columns` will be a DataFrame of the `"B"` and `"C"` columns from `df`.

We have our labels. Now we want to choose which columns will help our decision tree correctly classify those labels.

You could spend a lot of time playing with groups of columns to find the that work best. But for now, let's see if we can predict where a country is based only on the colors of its flag.

Create a variable named `data` and set it equal to a DataFrame containing the following columns from `flags`:
* `"Red"`
* `"Green"`
* `"Blue"`
* `"Gold"`
* `"White"`
* `"Black"`
* `"Orange"`

Finally, let's split these DataFrames into a training set and test set using the `train_test_split()` function. This function should take `data` and `labels` as parameters. Also include the parameter `random_state = 1`.

This function returns four values. Name those values `train_data`, `test_data`, `train_labels`, and `test_labels` in that order.

Create a `DecisionTreeClassifier` and name it `tree`. When you create the tree, give it the parameter `random_state = 1`.

Call `tree`'s `.fit()` method using `train_data` and `train_labels` to fit the tree to the training data.

Call `.score()` using `test_data` and `test_labels`. Print the result. 

Since there are six possible landmasses, if we randomly guessed, we'd expect to be right about 16% of the time. Did our decision tree beat randomly guessing?

We now have a good baseline of how our model performs with these features. Let's see if we can prune the tree to make it better!

Put your code that creates, trains, and tests the tree inside a for loop that has a variable named `i` that increases from `1` to `20`.

Inside your for loop, when you create `tree`, give it the parameter `max_depth = i`.

We'll now see a printout of how the accuracy changes depending on how large we allow the tree to be.

Rather than printing the score of each tree, let's graph it! We want the x-axis to show the depth of the tree and the y-axis to show the tree's score.

To do this, we'll need to create a list containing all of the scores. Before the for loop, create an empty list named `scores`. Inside the loop, instead of printing the tree's score, use `.append()` to add it to `scores`.

Let's now plot our points. Call `plt.plot()` using two parameters. The first should be the points on the x-axis. In this case, that is `range(1, 21)`. The second should be `scores`.

Then call `plt.show()`.

Our graph doesn't really look like we would expect it to. It seems like the depth of the tree isn't really having an impact on its performance. This might be a good indication that we're not using enough features. 

Let's add all the features that have to do with shapes to our data. `data` should now be set equal to:

```py
flags[["Red", "Green", "Blue", "Gold",
 "White", "Black", "Orange",
 "Circles",
"Crosses","Saltires","Quarters","Sunstars",
"Crescent","Triangle"]]
```

What does your graph look like after making this change?

Nice work! That graph looks more like what we'd expect. If the tree is too short, we're underfitting and not accurately representing the training data. If the tree is too big, we're getting too specific and relying too heavily on the training data.

There are a few different ways to extend this project:
* Try to classify something else! Rather than predicting the `"Landmass"` feature, could predict something like the `"Language"`?
* Find a subset of features that work better than what we're currently using. An important note is that a feature that has categorical data won't work very well as a feature. For example, we don't want a decision node to split nodes based on whether the value for `"Language"` is above or below `5`.
* Tune more parameters of the model. You can find a description of all the parameters you can tune in the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier" target="blank_">Decision Tree Classifier documentation</a>. For example, see what happens if you tune `max_leaf_nodes`. Think about whether you would be overfitting or underfitting the data based on how many leaf nodes you allow.

Find the Flag

The algorithm doesn't look ahead. It chooses to split the data based on the best possible feature given the current dataset.

Because the algorithm is recursive, it requires more processing power.

The algorithm takes as much time as it needs to find the globally optimal solution.

A set of 20 objects &mdash; `5` have label `A`, `5` have label `B`, `5` have label `C`, and `5` have label `D`. 

A set of 100 objects that all have the same labels.

A set of 10 objects that all have the same label.

A set of 20 objects &mdash; `10` have label `A` and `10` have label `B`.

An internal node represents which feature to split the data on.

An internal node represents the predicted class of an unlabeld point that reaches that node.

Decision trees don't have internal nodes &mdash; they only have the root and leaves.

An internal node represents the Gini impurity of a set of data.

The `.fit()` method creates the tree according to the training data and training labels.

The `.fit()` method returns the accuracy of the tree according to the testing data and testing labels.

The `.fit()` method creates a `DecisionTreeClassifier` object.

The `.fit()` method creates a random forest of `DecisionTreeClassifier`s.

The relative sizes of each subset after the split.

How many times you have already split the data.

The index of the feature that was used to split the data.

We've reached the base case when splitting on every feature results in no information gain.

We've reached the base case when spitting on one feature results in no information gain.

We've reached the base case when splitting on every feature results in information gain greater than `0`.

We've reached the base case when splitting on one feature results in information gain greater than `0`.

Test your understanding of decision trees by taking this quiz

### About this course
Continue your Machine Learning journey with Machine Learning: Random Forests and Decision Trees. Find patterns in data with decision trees, learn about the weaknesses of those trees, and how they can be improved with random forests. 

### Skills you'll gain
* Prepare data for Decision Tree and Random Forest Classifiers
* Implement and assess decision trees and random forests
* Explain the limitations of decision trees


Learn how to build decision trees and then build those trees into random forests.