Learn about how random forests work and how to implement them in Python.

In this lesson, you’ll learn what random forests are, how the random forest algorithm works and how to implement it in `scikit-learn`.

We’ve seen that [decision trees](https://www.codecademy.com/paths/data-science/tracks/dsml-supervised-learning-i/modules/mle-decision-trees-8b27e5b3-0352-4887-a8e5-e4a507597ad5/lessons/mlfun-decision-trees/exercises/what-is-a-decision-tree)** can be powerful supervised machine learning models. However, they’re not without their weaknesses — decision trees are often prone to overfitting. We’ve discussed some strategies to minimize this problem, like pruning, but sometimes that isn’t enough. We need to find another way to generalize our trees. This is where the concept of a random forest comes in handy.

A random forest is an _ensemble machine learning technique_. A random forest contains many decision trees that all work together to classify new points. When a random forest is asked to classify a new point, the random forest gives that point to each of the decision trees. Each of those trees reports their classification and the random forest returns the most popular classification. It’s like every tree gets a vote, and the most popular classification wins.Some of the trees in the random forest may be overfit, but by making the prediction based on a large number of trees, overfitting will have less of an impact.

** _A prerequisite for this lesson is the understanding of decision trees. Be sure to revisit the decision trees module if needed!_ 



Basics of a Random Forest

You might be wondering how the trees in the random forest get created. After all, right now, our algorithm for creating a decision tree is deterministic &mdash; given a training set, the same tree will be made every time. To make a random forest, we use a technique called _bagging_, which is short for _bootstrap aggregating_. This exercise will explain bootstrapping, which is a type of sampling method done with replacement.  

How it works is as follows: every time a decision tree is made, it is created using a different subset of the points in the training set. For example, if our training set had `1000` rows in it, we could make a decision tree by picking `100` of those rows at random to build the tree.  This way, every tree is different, but all trees will still be created from a portion of the training data.

In bootstrapping, we're doing this process _with replacement_. Picture putting all `100` rows in a bag and reaching in and grabbing one row at random. After writing down what row we picked, we put that row back in our bag. This means that when we're picking our `100` random rows, we could pick the same row more than once. In fact, it's very unlikely, but all `100` randomly picked rows could all be the same row! Because we're picking these rows with replacement, there's no need to shrink our bagged training set from `1000` rows to `100`. We can pick `1000` rows at random, and because we can get the same row more than once, we'll still end up with a unique data set.

We've loaded a dataset about cars here. An important field within the dataset is the safety rating, which tells us how crash/rollover resistant a car is, in other words, how safe the car is. The `safety` variable can be either “low”, “med”, or “high.” We're going to implement bootstrapping and estimate the average safety rating across the different bootstrapped samples. 

Here is a list of the variables in the car evaluation dataset.

Variable | Description
--- | ---
safety | estimated safety of the car (low, med, or high)
buying | buying price
maint | price of the maintenance
doors | number of doors
persons | capacity in terms of persons to carry
lug_boot | the size of luggage boot
accep |  evaulation level (unacceptable, acceptable, good, very good)


Bootstrapping

Random forests create different trees using a process known as bagging, which is short for bootstrapped aggregating. As we already covered bootstrapping, the process starts with creating a single decision tree on a bootstrapped sample of data points in the training set.  Then after many trees have been made, the results are "aggregated" together. In the case of a classification task, often the aggregation is taking the majority vote of the individual classifiers.  For regression tasks, often the aggregation is the average of the individual regressors.

We will dive into this process for the cars dataset we used in the previous exercise. The dataset has six features:

- `buying`: car price as a categorical variable: “vhigh”, “high”, “med”, or “low”
- `maint`: cost of maintaining the car; can be “vhigh”, “high”, “med”, or “low”.
- `doors`: number of doors; can be “2”, “3”, “4”, “5more”.
- `persons`: number of people the car can hold; can be “2”, “4”, or “more”.
- `lugboot`: size of the trunk; can be “small”, “med”, or “big”.
- `safety`: safety rating of the car; can be “low”, “med”, or “high”

We've already loaded the dataset and done the train-test split. Our target variable for prediction is an acceptability rating, `accep`, that's either `True` or `False`.

Bagging

In addition to using bootstrapped samples of our dataset, we can continue to add variety to the ways our trees are created by randomly selecting the features that are used.

Recall that for our car data set, the original features were the following:

* The price of the car which can be "vhigh", "high", "med", or "low".
* The cost of maintaining the car which can be "vhigh", "high", "med", or "low".
* The number of doors which can be "2", "3", "4", "5more".
* The number of people the car can hold which can be "2", "4", or "more".
* The size of the trunk which can be "small", "med", or "big".
* The safety rating of the car which can be "low", "med", or "high"

Our target variable for prediction is an acceptability rating, `accep`, that’s either `True` or `False`. For our final features sets, `x_train` and `x_test`, the categorical features have been dummy encoded, giving us 15 features in total.  

When we use a decision tree, all the features are used and the split is chosen as the one that increases the information gain the most. While it may seem counter-intuitive, selecting a random subset of features can help in the performance of an ensemble model. In the following example, we will use a random selection of features prior to model building to add additional variance to the individual trees.  While an individual tree may perform worse, sometimes the increases in variance can help model performance of the ensemble model as a whole.

Random Feature Selection

The two steps we walked through above created trees on bootstrapped samples and randomly selecting features. These can be combined together and implemented at the same time! Combining them adds an additional variation to the base learners for the ensemble model. This in turn increases the ability of the model to generalize to new and unseen data, i.e., it minimizes bias and increases variance.  Rather than re-doing this process manually, we will use `scikit-learn`'s bagging implementation, `BaggingClassifier()`, to do so. 
 
Much like other models we have used in `scikit-learn`, we instantiate a instance of `BaggingClassifier()` and specify the parameters. The first parameter, `base_estimator` refers to the machine learning model _that is being bagged_. In the case of random forests, the *base estimator* would be a *decision tree*. We are going to use a decision tree classifier WITH a `max_depth` of 5, this will be instantiated with `BaggingClassifier(DecisionTreeClassifier(max_depth=5))`.

After the model has been defined, methods `.fit(), .predict(), .score()` can be used as expected. Additional hyperparameters specific to bagging include the number of estimators (`n_estimators`) we want to use and the maximum number of features we'd like to keep (`max_features`).
 
_Note_: While we have focused on decision tress classifiers (as this is the base learner for a random forest classifier), this procedure of bagging is not specific to decision trees, and in fact can be used for any base classifier or regression model. The `scikit-learn` implementation is generalizable and can be used for other base models!

Bagging in `scikit-learn`

Now that we have covered two major ways to combine trees, both in terms of samples and features, we are ready to get to the implementation of random forests! This will be similar to what we covered in the previous exercises, but the random forest algorithm has a slightly different way of randomly choosing features.  Rather than choosing a single random set at the onset, each split chooses a different random set.   

For example, when finding which feature to split the data on the first time, we might randomly choose to only consider the price of the car, the number of doors, and the safety rating. After splitting the data on the best feature from that subset, we’ll likely want to split again. For this next split, we’ll randomly select three features again to consider. This time those features might be the cost of maintenance, the number of doors, and the size of the trunk. We’ll continue this process until the tree is complete.

One question to consider is how to choose the number of features to randomly select. Why did we choose 3 in this example? A good rule of thumb is select as many features as _the square root of the total number of features_. Our car dataset doesn’t have a lot of features, so in this example, it’s difficult to follow this rule. But if we had a dataset with 25 features, we’d want to randomly select 5 features to consider at every split point.

You now have the ability to make a random forest using your own decision trees. However, `scikit-learn` has a `RandomForestClassifier()` class that will do all of this work for you! `RandomForestClassifier` is in the `sklearn.ensemble` module.

`RandomForestClassifier()` works almost identically to `DecisionTreeClassifier()` — the `.fit()`, `.predict()`, and `.score()` methods work in the exact same way.

Train and Predict using `scikit-learn`

Just like in decision trees, we can use random forests for regression as well! It is important to know when to use regression or classification &mdash; this usually comes down to what type of variable your target is.  Previously, we were using a binary categorical variable (acceptable versus not), so a classification model was used.  

We will now consider a hypothetical new target variable, price, for this data set, which is a continuous variable. We've generated some fake prices in the dataset so that we have numerical values instead of the previous categorical variables. (Please note that these are not reflective of the previous categories of high and low prices - we just wanted some numeric values so we can perform regression! :) ) 

Now, instead of a classification task, we will use `scikit-learn`'s `RandomForestRegressor()` to carry out a regression task. 

_Note_: Recall that the [default evaluation score](https://scikit-learn.org/stable/modules/model_evaluation.html) for regressors in `scikit-learn` is the [R-squared score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score).



Random Forest Regressor

Nice work! Here are some of the major takeaways about random forests:

- A random forest is an ensemble machine learning model. It makes a classification by aggregating the classifications of many decision trees.
- Random forests are used to avoid overfitting. By aggregating the classification of multiple trees, having overfitted trees in a random forest is less impactful.
- Every decision tree in a random forest is created by using a different subset of data points from the training set. Those data points are chosen at random with replacement, which means a single data point can be chosen more than once. This process is known as bagging.
- When creating a tree in a random forest, a randomly selected subset of features are considered as candidates for the best splitting feature. If your dataset has n features, it is common practice to randomly select the square root of n features.

Review

Random Forests

Learn about bagging, random forests and how to implement them using `scikit-learn`!

### Introduction

Why use one machine learning model to solve a problem when you can use many at the same time? **Ensemble learning** is the process of building a complex machine learning model by combining multiple base estimators as building blocks. The main goals for employing ensemble methods are to end up with a resultant machine learning model that is more performant and more robust than its components. 

### Base estimators

In ensembling, often the base models that are chosen tend to underachieve on their own and are typically referred to as **weak learners**. The reason this is preferable is to have the model implementation be computationally efficient. Combining strong learners doesn't necessarily make the resultant ensembled model more performant so one might as well choose learners that cost less computationally.

What makes a model a "weak learner"? There are many ways in which a base model can be considered weak. For example, it can have high bias or high variance. The nature of the weakness of the base model is typically taken into consideration and is a design choice when determining the best ensembling method to use.

The basic components of ensemble learning are:
- Base models that are weak learners
- An ensembling method that combines the base models to improve performance and robustness

It is possible to have an ensemble model that performs worse than any one of its contributing base estimators. To circumvent this outcome it is important that the base estimators are uncorrelated and independent. Having a higher diversity among the trained base estimators leads to a stronger ensemble model. 

Three common ensembling methods in machine learning that will be covered briefly in this article and later on in separate modules are Bagging, Boosting, and Stacking.

![Image](https://static-assets.codecademy.com/Paths/machine-learning-engineer-career-path/ensemble_methods/1866.jpg)

### Bagging
As the name suggests, **B**ootstrap **AGG**regat**ING** (also known as **Bagging**) is an ensemble method that combines the concepts of bootstrapping and aggregation. Bagging can be used for both classification and regression problems. Bagging methods use weak learners as base models that are complex and tend to suffer from high variance. Their weakness as models is due to the fact that they are built with only a subset of the available features and on a subset of the training data due to bootstrapping.

If the full dataset is represented by the larger cookie in the figure above, each of the candies atop the cookie represents an individual training data instance. The smaller cookies in the Bagging panel represent a **bootstrapped** sample of the full training dataset. Bootstrapping refers to the method of sampling data with replacement.

Bagging is a learning technique that is done in parallel. Each of the base models is trained independently of the others. Additionally, each base model is trained using only a subset of the original features. Even though the base models tend to be complex, they overfit to both a subset of the available training data and a subset of the available features. This allows them to be diverse from one another, often leading to a very strong ensemble model when aggregated. In the Bagging panel of the figure, the base models are decision trees that are relatively large and overfit to the bootstrapped subset of data provided to each of them.

Once each of the base models is trained, the method for ensembling tends to be a simple aggregation technique over each of the models; a majority vote for classification problems and averaging for regression problems.

A common implementation of a bagging algorithm that uses decision trees as their base model is the Random Forest. You can learn more about Bagging and Random Forests in the [Random Forests module](https://www.codecademy.com/content-items/8673d2edb2f34e03a2a976adf30d0805/exercises/basics-of-a-random-forest).

### Boosting
**Boosting** is an ensemble learning technique where the weak learners are too simple and tend to suffer from high bias. In the Boosting panel of the figure above, the base models are decision trees with only one level, a decision stump. Decision stumps can only make a decision based off of one feature at a time, causing them to underfit the data substantially.

Boosting is a sequential learning technique where each of the base models builds off the previous model. Each subsequent model aims to improve the performance of the final ensembled model by attempting to fix the errors in the previous stage. 

In the Boosting panel of the figure, you may notice that some of the candies atop the cookies are larger than the others. These particular training instances were misclassified by the previous decision stump and are therefore given more weight by the next decision stump. This is one method in which boosting methods may learn from their mistakes.

Two common implementations of the boosting algorithm are Adaptive Boosting and Gradient Boosting. You can learn more about Boosting, including these two implementations, in the [Boosting module](https://www.codecademy.com/content-items/4803fa74f7484f368527f9f4092a1fff/exercises/boosting?_gl=1%2A1ukr3rq%2A_ga%2AMzQxNTg2NDY4MC4xNjU1MTQ5NTg2%2A_ga_3LRZM6TM9L%2AMTY1NTMyNzEyMS40LjEuMTY1NTMyNzEyMS42MA..&bypass_cache=true&draft=true).

### Stacking
**Stacking** is an extremely flexible ensembling technique where a final model is trained to learn how to best combine a set of base models to make strong predictions. In contrast to bagging and boosting, the base models in stacking do not need to be the same type of learning algorithm.

While bagging and boosting are built with base models that are weak learners, that does not necessarily have to be the case for stacking. A stacking algorithm can be used to combine decently performing learners as well.

You can learn more about Stacking in the [Stacking article](https://author.codecademy.com/content-items/30d9f5c1493148d6a0962e79cf6795af).

An introduction to Ensembling Methods in Machine Learning.

Introduction to Ensembling Methods

## Welcome

Ever wondered if combining machine learning models increases their predictive power? This process is called ensembling and there are many wyas of doing this like boosting, bagging and stacking. 

## What Will We Cover in This Course?

In this course you'll learn about the different ensembling techniques like:

- Bagging decision trees to make Random Forests 
- Boosting algorithms like AdaBoost and GradientBoost
- Stacking ML models

## Prerequisites

This course requires prior knowledge of machine learning basics. Check out our [Machine Learning Fundamentals Skill Path](https://www.codecademy.com/learn/paths/machine-learning-fundamentals) to brush up on the same.

## Learning is Social

Whatever you're working on, be sure to connect with the Codecademy community in the [forums](https://discuss.codecademy.com/). Remember to check in with the community regularly, including for things like asking for code reviews on your project work and providing code reviews to others in the [projects category](https://discuss.codecademy.com/c/project/1833), which can help to reinforce what you've learned.


Learn how to combine machine learning models

Welcome to Ensemble Methods in Machine Learning

A random forest is not related to a decision tree model.

A random forest is an ensemble model that makes a classification by aggregating the results of multiple decision trees.

A random forest is a single decision tree that has been trained on a bootstrapped sample of data.

A random forest makes a classification by randomly picking one tree from a group of decision trees to make the classification.

A random forest classifier uses the most common classification from its decision trees as the final classification.

A random forest classifier uses the least common classification from its decision trees as the final classification.

A random forest classifier uses the average classification from its decision trees as the final classification.

A random forest classifier uses a random classification from its decision trees as the final classifier.

A random forest regressor uses the most common result from its decision trees as the final prediction.

A random forest regressor uses the least common result from its decision trees as the final prediction.

A random forest regressor uses the average prediction from its decision trees as the final predictions.

A random forest regressor uses a random result from its decision trees as the final predictions.

Bootstrapped samples are obtained by taking samples with replacement. Each decision tree is built on a bootstrapped sample.

Bootstrapped samples are obtained by taking samples without replacement. Each decision tree is built on a bootstrapped sample.

Bootstrapped samples are obtained by taking samples with replacement. Each decision tree is built on the full dataset.

Bootstrapped samples are obtained by taking samples without replacement. Each decision tree is built on the full dataset.

The number of trees in the forest, and the default value is `100`.

The number of features to consider when implementing feature bagging.

The number of rows to randomly select when implementing bagging.

The maximum depth of the trees in the forest.

The `.fit()` method creates the model according to the training data and training labels.

The `.fit()` method returns the accuracy of the model according to the testing data and testing labels.

The `.fit()` method creates a `RandomForestClassifier` object.

The `.fit()` method creates a single `DecisionTreeClassifier`.

Random Forests Quiz

Use random forests to predict the income of a person based on census data.

We will build a random forest classifier to predict the income category.  First, take a look at the distribution of income values -- what percentage of samples have incomes less than 50k and greater than 50k?

There’s a small problem with our data that is a little hard to catch — every string has an extra space at the start. For example, the first row’s native-country is " United-States", but we want it to be "United-States". One way to fix this is to select all columns of type `object` and use the string method `.str.strip()`. 

Create a features dataframe `X`.  This should include only features in the list `feature_cols` and convert categorical features to dummy variables using `pd.get_dummies()`.  Include the paramter `drop_first=True` to eliminate redundant features.

Create the output variable `y`, which is binary.  It should be 0 when income is less than 50k and 1 when it is greater than 50k.

Split the data into a train and test set with a test size of `20%`.

Instantiate an instance of a `RandomForestClassifier()` (with default parameters).  Fit the model on the train data and print the score (accuracy) on the test data.  This will act as a baseline to compare other model performances.

We will explore tuning the random forest classifier model by testing the performance over a range of `max_depth` values.  Fit a random forest classifier for `max_depth` values from 1-25. Save the accuracy score for the train and test sets in the lists `accuracy_train, accuracy_test`.

Find the largest accuracy and the depth this occurs on the test data.

Plot the training and test accuracy of the models versus the `max_depth`.

Refit the random forest model using the `max_depth` from above; save the feature importances in a dataframe.  Sort the results and print the top five features.

Looking at the education feature, there are 16 unique values -- from preschool to professional school.  Rather than adding dummy variables for each value, it makes sense to bin some of these values together.  While there are many ways to do this, we will take the approach of combining the values into 3 groups: `High school and less`, `College to Bachelors` and `Masters and more`.  Create a new column in `df` for this new features called `education_bin`.

Like we did previously, we will now add this new feature into our feature list and recreate `X`.

As we did before, we will tune the random forest classifier model by testing the performance over a range of `max_depth` values.  Fit a random forest classifier for `max_depth` values from 1-25. Save the accuracy score for the train and test sets in the lists `accuracy_train, accuracy_test`.

Find the largest accuracy and the depth this occurs on the test data. Compare the results from the previous model tuned.

Plot the training and test accuracy of the models versus the `max_depth`. Compare the results from the previous model tuned.

Refit the random forest model using the `max_depth` from above; save the feature importances in a dataframe.  Sort the results and print the top five features. Compare the results from the previous model tuned.

Nice work! Note that the accuracy of our final model increased and one of our added features is now in the top 5 based on importance!

There are a few different ways to extend this project:
* Are there other features that may lead to an even better performace?  Consider creating new ones or adding additional features not part of the original feature list.
* Consider tuning hyperparameters based on a different evaluation metric -- our classes are fairly imbalanced, AUC of F1 may lead to a different result
* Tune more parameters of the model. You can find a description of all the parameters you can tune in the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html" target="blank_">Random Forest Classifier documentation</a>. For example, see what happens if you tune `max_features` or `n_estimators`. 

Random Forests Project

## Introduction

We’ve seen how ensembling methods such as bagging or boosting can be used to manage bias and variance of a machine learning model and improve performance. In this article, we’re going to learn about another ensemble method: Stacking! Stacking fundamentally operates differently from bagging and boosting in two ways:

* The base estimators used in the ensemble need not be weak learners. This is in contrast to boosting where we use a sequence of weak learners that improve performance each iteration.
* Unlike bagging, there are no subsampling processes used. The stacking model effectively uses a full training set.

Stacking is a proven technique for improving prediction accuracy of standalone models. How does this work? Let’s take a look!


## How does stacking work?
Consider the scenario in which we are handling a classification problem where we are exploring different models. When using a decision tree we find that our model has poor generalization performance as a result of overfitting. Additionally, when using a logistic regression classification model, we discover our model parameters are hard to tune due to highly-correlated features. We might discover that while each model has a unique advantage over the others, each may also have a distinct drawback such as poor generalization accuracy or the inability to predict a specific class. As it turns out, there’s no rule stating we can’t take the best of both worlds!

Stacking can be thought of as a democracy of machine learning models, where different models are trained and subsequently cast their vote through their predictions. A majority-rules approach can be used for determining the final model prediction if we weighted each estimator equally. In practice, each base estimator may need to be weighted differently, so we have a later-stage model to learn how to appropriately weigh the predictions of all the prior base estimators.

Let’s talk about creating a stacking model and how to set up our training data for it.

#### Training Base Estimators
We can select from combinations of different base estimators, such as a logistic regression model in combination with a decision tree. We could additionally select models of the same learning algorithms, but with different parameters, such as multiple decision trees with varying depths. The number of estimators is arbitrary, so it’s good practice to explore how different combinations behave.

This introduces a problem, however. The estimators would be making predictions on data used in training. This puts our model at risk of overfitting. To avoid this, we use K-Fold Cross Validation as described next.

![image](https://static-assets.codecademy.com/Paths/machine-learning-engineer-career-path/ensemble_methods/stacking/stacking_cv.jpg)

#### K-Fold Cross Validation 

Consider 10 segments (or folds) and a stacking model that uses a logistic regression model and a decision tree model. Each estimator can be trained using data from 9 of the segments, and make predictions on the excluded 10th segment. We then append the predictions as new features to that 10th segment. Now 1/10th of the training data has two new features: one is the prediction made by the logistic regression model and the other is the prediction made by the decision tree model.                 

We want to do the same with the other 9 segments, so we rotate the excluded segment and repeat this process until all training data points are augmented with new features. The end result is a prediction made on each training sample, without having seen the sample during the training process.

##### Feature Augmentation
In our stacking setup, the base estimators need to be trained to make predictions on our training data. The prediction of each estimator will be appended to the corresponding data sample as a new feature. We thus augment the training data set with this additional information. The augmented training set is used by our later-stage stacking model to make the final prediction. 

![image](https://static-assets.codecademy.com/Paths/machine-learning-engineer-career-path/ensemble_methods/stacking/stacking.gif)

So, in summary, say our training dataset has 10,000 samples, 10 features, and we select 3 base estimators. We would train each base estimator on the training set and make predictions on the training set. Each estimator would make a prediction on each sample, therefore each sample will have 3 predictions. These 3 predictions are appended to the pre-existing 10 features. This leaves us with 10,000 training samples with 13 features each.  

#### Training the stacking model
With our augmented training set, we’re ready to prepare the final model that will make the official prediction from our stacking setup.

This later-stage model is a learning algorithm that we select just like the base estimators. We may even reuse an algorithm from an earlier stage here. The purpose of this model is to learn the proper weighting of the earlier estimator given our training samples now include a data point from each estimator. Some estimators may perform better than others, so our overall model should account for this.

The only difference in the training process is that the model will be designed to accept samples of the augmented size rather than the given size. This means if the given data set has n features and m base estimators, this model will require n + m features on each sample.

And that’s it! Once the later-stage model is trained, we feed testing data samples into the base estimators, which will append their predictions to the data sample. This sample is then given to the later-stage model for our final prediction.

### Example 
In the associated code for this article, we’ll be referencing a water quality dataset. The features of this dataset are ones that can be used to determine how safe a supply of water is to drink or its “potability” represented by a 1 for drinkable and a 0 for not drinkable. For simplicity, let’s take a look at a subset of the data with three of its features:


Sample | pH | Conductivity | Turbidity | Potability
--- | --- | --- | --- | ---
1 | 8.316766 | 363.266516 | 4.628771 | 1
2 | 9.092223 | 398.410813 | 4.075075 | 0
3 | 5.584087 | 280.467916 | 2.559708 | 1

For this example, we’ll select a Logistic Regression model and a Decision Tree model as our base estimators. The table below shows how the training dataset expands with new features provided by the predictions of the trained base estimators.

Sample | Logistic Regression prediction | Decision Tree prediction | pH | Conductivity | Turbidity | Potability
--- | --- | --- | --- | --- | --- | ---
1 | 0 | 1 | 8.316766 | 363.266516 | 4.628771 | 1
2 | 0 | 0 | 9.092223 | 398.410813 | 4.075075 | 0
3 | 1 | 1 | 5.584087 | 280.467916 | 2.559708 | 1



## Implementing stacking with `scikit-learn`

```python
import pandas as pd
df = pd.read_csv('water_potability')
print(df.columns, df.shape)
```
Output:  
```
Index(['ph', 'Hardness', 'Solids', 'Chloramines', 'Sulfate', 'Conductivity',
       'Organic_carbon', 'Trihalomethanes', 'Turbidity', 'Potability'],
      dtype='object') (2011, 10)
```

We'll take a train-test split of our data to handle the training process.

```python
X = water_potability.drop(['Potability'], axis=1)
y = water_potability['Potability']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=rand_state)
```

To assemble our ensemble, we'll make a dictionary of base estimators. This will be contained within the ```level_0_estimators``` dict. Also, our final estimator will be a Random Forest as represented with ```level_1_estimator```. Notice also how we prepare to add new features to our training dataset (as columns) in ```level_0_columns```.

```python
level_0_estimators = dict()
level_0_estimators["logreg"] = LogisticRegression(random_state=rand_state)
level_0_estimators["forest"] = RandomForestClassifier(random_state=rand_state)

level_0_columns = [f"{name}_prediction" for name in level_0_estimators.keys()]

level_1_estimator = RandomForestClassifier(random_state=rand_state)
```

Handling our k-fold cross-validation is fairly straightforward using ```sklearn.model_selection.StratifiedKFold``` from the ```scikit-learn``` library. The kfold is then given to the instantiated ```StackingClassifier```.

```python
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=rand_state)
stacking_clf = StackingClassifier(estimators=list(level_0_estimators.items()), 
                                    final_estimator=level_1_estimator, 
                                    passthrough=True, cv=kfold, stack_method="predict_proba")
```

Calling ```fit_transform``` on our classifier manages a lot of the heavy lifting. It will handle the training of our ```level_0``` base estimators along with the ```level_1_estimator```, make cross-validated predictions on the training set, and augment the training set with predictions from each estimator. Let's see how the resulting training dataset looks:

```python
df = pd.DataFrame(stacking_clf.fit_transform(X_train, y_train), columns=level_0_columns + list(X_train.columns))
```

![water_dataset_dataframe](https://static-assets.codecademy.com/Paths/machine-learning-engineer-career-path/ensemble_methods/stacking/stacking_dataframe.png)

Finally, with our full model trained, we can make predictions. Let's compare how our Stacking classifier performed with how a lone linear model and a lone decision tree model perform!

```python
y_val_pred = stacking_clf.predict(X_test)
stacking_accuracy = accuracy_score(y_test, y_val_pred)

vanilla_logistic_regression = LogisticRegression(random_state=rand_state).fit(X_train, y_train)
lr_accuracy = accuracy_score(y_test, vanilla_logistic_regression.predict(X_test))
                                   
vanilla_decision_tree = RandomForestClassifier(random_state=rand_state).fit(X_train, y_train)
dt_accuracy =  accuracy_score(y_test, vanilla_decision_tree.predict(X_test))

print(f'Stacking accuracy: {stacking_accuracy:.4f}')
print(f'Logistic Regression accuracy: {lr_accuracy:.4f}')
print(f'Decision Tree accuracy: {dt_accuracy:.4f}')
```
Output: 
```
Stacking accuracy: 0.6576
Logistic Regression accuracy: 0.5732
Decision Tree accuracy: 0.6526
```

## Limitations of stacking

Stacking is very powerful in that we remove the occasionally difficult choice of which learning algorithm to use for our problem. Depending on the use case, this benefit does come with some limitations worth noting:

- Because we have an arbitrary number of learning algorithms in use, training an entire stacking model is computationally expensive. This is also true for deployed inference models.

- Such a large model with many parameters means that a plethora of data is needed for proper training. Small datasets won’t see significant gains with stacking. Stacking models typically yields marginal gains over the best single estimator used for the same problem. When successful, a stacked model may reduce error by 2% or less.

#### Additional Considerations
How many layers can a stacking model have?

Stacking offers some creativity and openendedness in how we want to build our model. A vanilla stacking model may have one tier of models to contribute to the final prediction. We could alternatively construct a multi-tier approach in which one level of models feeds into a later stage of models before making a final prediction.

As with other machine learning models, we also have many parameters to tune and select from. We can enhance model diversity by using different training algorithms, different training sets, different feature subsets, and different hyperparameters. In the next section, we’ll use a k-fold cross-validation to preprocess the training data, where k=10. In reality, the value for k is arbitrary, and a different k folds may lead to performance improvement or decline.

## Conclusion
We have covered in-depth how stacking can combine the predictive power of various machine learning models. This property distinguishes stacking from other ensemble methods like bagging or boosting, while not being mutually exclusive in application. Further, we discussed how a k-fold cross-validation setup of the training data can help our stacking model avoid overfitting and augment the data to leverage all of its predictive strengths. In a coded example, we see this in action using the `scikit-learn` library where we assembled a stacking model with logistic regression estimators and decision tree estimators. We found that using these estimators in a combination yielded higher prediction accuracy compared to using them independently. 

### Practice

<Assessment id="0f0ab3f3dda642eeb0c0c294b3eee41c" />
<Assessment id="a4bbb97b92334f49a0339b5410f660cf" />

Learn about the stacking ensemble learning method. 

Stacking

Learn about the different boosting techniques!

In this module we will cover a powerful [ensemble method](https://www.codecademy.com/content-items/a872071d9d294467bb9441fb4392236b) called Boosting. Boosted ensemble methods use weak learners as base models that are simple and tend to suffer from high bias. The weak learners underfit the data.

Boosting is a sequential learning technique where each of the base models builds off of the previous model. Each subsequent model aims to improve the performance of the final ensemble model by attempting to fix the errors in the previous stage. 

There are two important decisions that need to be made to perform boosted ensembling:
1. Sequential Fitting Method
2. Aggregation Method

Two boosting algorithms that will be covered in detail in this module are **Adaptive Boosting** and **Gradient Boosting**. 

While boosting can be applied to any base machine learning algorithm, we will demonstrate with an extremely popular choice as a base estimator, the decision tree. Recall that [Decision Trees](https://www.codecademy.com/article/mlfun-decision-trees-article) are a commonly used and powerful machine learning algorithm because they are easy to interpret. Additionally, the training data requires very little manipulation (no need standardization, removal of collinearity, etc.). 

The major limitation to decision trees is that they tend to suffer from high variance and are therefore prone to overfitting. They are good at making a series of decisions which cause them to memorize the training data, so they do not generalize well to unseen data. In the following exercises we will explore how to work past these limitations while using decision trees for boosting.

Boosting

**Adaptive Boosting** (or AdaBoost) is a sequential ensembling method that can be used for both classification and regression. It can use any base machine learning model, though it is most commonly used with decision trees.

For AdaBoost, the **Sequential Fitting Method** is accomplished by updating the weight attached to each of the training dataset observations as we proceed from one base model to the next. The **Aggregation Method** is a weighted sum of those base models where the model weight is dependent on the error of that particular estimator.

The training of an AdaBoost model is the process of determining the training dataset observation weights at each step as well as the final weight for each base model for aggregation. 

In the next exercise we will dive into the details of AdaBoost!

Adaptive Boosting Overview

Let's take a deeper look at how AdaBoost works! AdaBoost can be used for both regression and classification, but in this example we will be solving a classification problem. We begin with the full Training Dataset. You will see that it consists of green circles and red triangles. The goal of our AdaBoost classifier will be to form a decision boundary that separates these two classes. Initially, the training data instances are all given the same weight. This is indicated by the size of shapes all being the same.

Our first step is to fit an estimator, the 1st Base Model. While boosting can be applied to any base machine learning model, we will use decision trees. But aren't decision trees prone to overfitting? We already said that the base models for boosting are supposed to be very simple and tend to underfit. That is correct, and for this reason we use the simplest version of a decision tree, known as a decision stump. A decision stump only makes a single decision, so the resultant estimator only has two leaf nodes.  

Taking a look at the Result of the 1st Base Model, we see that the decision boundary, that is the border between the lighter green and lighter red regions, does a decent job of separating the green circles from the red triangles. However we do notice that there are two red triangles in the light green region. This indicates that they have been classified incorrectly by the decision stump.

Each of the base models will contribute a different amount to the final ensemble model. The influence that a particular base model contributes is going to be dependent on the number of errors it makes, or for regression, the magnitude of the errors it makes. We do not want a decision stump that does a terrible job of classifying the data to have the same say as a decision stump that does a great job. Once we are able to evaluate the Result of the 1st Base Model, we can Weight the Model and assign it a value, here indicated by `alpha_1`.

To prepare for the next stage of the sequential learning process, we need to Reweight the Data. The instances of the training data that were classified incorrectly by the 1st Base Model, the two red triangles in the middle right, are given a larger weight than the other data instances indicated by their larger size. By assigning those misclassified points a larger weight, we are asking the the 2nd Base Model to give them preferential treatment during the Model Fitting.

Taking a look at the Result of the 2nd Base Model, we see that is exactly what happens. The two larger red triangles are classified correctly by the 2nd Base Model. Once again we assign the base model a weight, `alpha_2` proportional to the errors it makes and  prepare for the next stage of the sequential learning by reweighting the training data. The instances that were incorrectly classified by the 2nd Base Model, the two green circles in on the center right, are given a larger weight.

Once we have reached the predefined number of estimators for our AdaBoost model, the base models are ready to aggregate. In this example we have chosen `n_estimators = 3`. The influence of each base model in the final ensemble model will be proportional to the `alpha` it was assigned during the training process.

Adaptive Boosting

Wow, that was a lot to take in! Let's take this opportunity to implement AdaBoost on a real dataset and solve a classification problem.

We will be using a dataset from [UCI’s Machine Learning Repository](https://archive.ics.uci.edu/dataset/19/car+evaluation) to evaluate the acceptability of a car based on a set of features that encompasses their price and technical characteristics.



Adaptive Boosting Implementation

**Gradient Boosting** is a sequential ensembling method that can be used for both classification and regression. It can use any base machine learning model, though it is most commonly used with decision trees, known as Gradient Boosted Trees.

For Gradient Boost, the **Sequential Fitting Method** is accomplished by fitting a base model to the negative gradient of the error in the previous stage. The **Aggregation Method** is a weighted sum of those base models where the model weight is constant.

The training of a Gradient Boosted model is the process of determining the base model error at each step and using those to determine how to best formulate the subsequent base model.

In the next exercise we will dive into the details of Gradient Boosting!

Gradient Boosting Overview

Now we will take a deeper look at how Gradient Boosting works. While Gradient Boosting can be applied to any base machine learning model, decision trees are commonly used in practice. In this example we will be focusing on a Gradient Boosted Trees model.

Our first step is to fit an estimator, the 1st Base Model. Recall that the base estimators for boosting algorithms tend to be simple and high bias. In contrast to AdaBoost which leveraged the simplest form of decision trees, the decision stump with only 1 level, gradient boosted trees can and actually do tend to include a few more decision branches. Often gradient boosted trees will have up to 32 leaf nodes, which corresponds to a tree depth of 5 levels. In this example, we are limiting the depth of the base estimators to 2, corresponding to 4 leaf nodes.

Once the 1st Base Model is trained, the residual errors (`h_1`), of the model given the training training data are determined. The residual error is the difference between the actual and predicted values for each of the training data instances.
```tex
h_{1} = y_{\text{actual}} - y_{1\text{(predicted)}}
```
The errors will be greater for the training data instances where the model did not do as good of a job with its prediction and will be lower on training data instances where the model fit the data well.

In the next stage of the sequential learning process, we fit the 2nd Base Model. Here is where the interesting part comes in. Instead of fitting the model to the target values `y_actual` as we are typically used to doing in machine learning, we actually fit the model on the errors of the previous stage, in this case `h_1`. The 2nd Base Model is literally learning from the mistakes of the 1st Base Model through those residuals that were calculated.

The results of the 2nd Base Model are multiplied by a constant learning rate, `alpha`, and added to the results of the 1st Base Model to give the set of updated predictions, 
The results of the second base model, which was tasked with fitting the errors of the first base model are multiplied by a constant learning rate, alpha and added to the results of the first base model to give us a set of updated predictions, `y_2(predicted)`.

The residual errors of the 2nd stage are calculated using the updated predictions to get,
```tex
h_{2} = y_{\textrm{actual}} - y_{\textrm{2(predicted)}}
```

The subsequent stages repeat the same steps. At stage `N`, the base model is fit on the errors calculated at the previous stage `h_(N-1)`. The new model that is fit is multiplied by the constant learning rate `alpha` and added to the predictions of the previous stage.

Once we have reached the predefined number of estimators for our Gradient Boosting model or the residual errors are not changing between iterations, the model will stop training and we end up with the resultant ensemble model.

Gradient Boosting

Now that we have taken a look at what is going on under the hood, we are ready to implement Gradient Boosting on a real dataset and solve a classification problem.

We will be using a dataset from [UCI’s Machine Learning Repository](https://archive.ics.uci.edu/dataset/19/car+evaluation) to evaluate the acceptability of a car based on a set of features that encompasses their price and technical characteristics.

Gradient Boosting Implementation

Boosting Machine Learning Models

Congratulations, you completed the Ensemble Methods in Machine Learning (ML) course! You learned how to combine ML models to make them more powerful.

Having completed this unit, you are now able to:

- Leverage ensembling techniques like bagging, boosting and stacking and how they differ from each other
- Build and tune Random Forests using `scikit-learn`
- Implement boosting techniques like adaptive and gradient boosting with `scikit-learn`
- Stack machine learning models to optimize performance

Your learning journey into Machine Learning isn’t over yet! There are plenty of other topics that you can dive into to continue learning.

## Here are our recommendations for the next steps:

### Build Machine Learning Pipelines
The [Build Machine Learning Pipelines](https://www.codecademy.com/learn/build-ml-pipelines-course) course is an excellent next step. This will set the foundation for combining machine learning and software engineering principles to build reproducible and efficient ML pipelines. 
 

### Introduction to Big Data with PySpark
The [Big Data with PySpark](https://www.codecademy.com/learn/big-data-pyspark) course is an excellent next step. This will set the foundation for scaling machine learning pipelines using PySpark, a key part of productionalizing machine learning models in industry. 
 
Once again, congratulations on finishing the Ensembling Methods course! We are excited to see what you accomplish next. Happy coding!

You’ve completed the Ensemble Methods in Machine Learning! What’s next?

Next Steps

### Why Ensemble Methods in Machine Learning?
Models are great on their own but you can make them better by combining them together! Ensemble methods are techniques in machine learning that help you do this. 

### Take-Away Skills:
Learn  how to bag models to build random forests, boost models using adaptive and gradient boosting, and stack models for improved performance!

Learn about ensembling methods in machine learning like bagging, boosting and stacking!

Introduction to Ensemble Methods in Machine Learning

Learning about how to boost machine learning models!

Learning about boosting machine learning models!

Boosting & Stacking Machine Learning Models

Explore bagging, boosting, stacking, and more in this introduction to ensemble methods in machine learning.

Ensemble Methods in Machine Learning

PRO SALE: Get 50% off annual Pro memberships using code [LLM50](https://www.codecademy.com/checkout?plan_id=proGoldAnnualV2&discountCode=LLM50&plan_type=pro)