Learn about how to build machine learning pipelines using `scikit-learn`!

In this lesson we're going to learn how to turn a machine learning (ML) workflow to a pipeline using `scikit-learn`. A ML pipeline is a modular sequence of objects that codifies and automates a ML workflow to make it efficient, reproducible and generalizable. While the process of building pipelines is not singular, there are some tools that are universally used to do this. The most accessible of these is `scikit-learn`'s `Pipeline` object which allows us to chain together the different steps that go into a ML workflow. 

Turning a workflow into a pipeline has many other advantages too. Pipelines provide consistency &mdash; the same steps will always be applied in the same order under the same conditions. They also are very concise and can streamline your code. The `Pipeline` object within `scikit-learn` has consistent methods to use the many other estimators and transformers we have already covered in our ML curriculum. It is usually the starting point for a Machine Learning Engineer before turning to more sophisticated tools for scaling pipelines (such as PySpark, etc) and we will delve deeper into it in this lesson

What can go into a pipeline?  For any of the intermediate steps, it must have both the `.fit`  and `.transform` methods.  This includes preprocessing, imputation, feature selection and dimensionality reduction.  The final step must have the `.fit` method.  Examples of tasks we've seen already that could benefit from a pipeline include:
* scaling data then applying principal component analysis
* filling in missing values then fitting a regression model
* one-hot-encoding categorical variables and scaling numerical variables

In the following exercises, we will walk through various chained functions and how to incorporate these into a `Pipeline`.

From Workflows to Pipelines

To introduce pipelines, let's look at a common set of data cleaning/EDA tasks &mdash; dealing with missing values and scaling numeric variables. We're going to convert an existing code base that performs these tasks to more concise code that uses `scikit-learn`'s `Pipeline` using the following steps.

1. First, to define a pipeline, we pass a list of tuples of the form `(name, transform/estimator)` into a `Pipeline` object.  For example, if we wanted to perform imputation with a `SimpleImputer` first, and scale our numerical variables with a `StandardScaler` next, the code would look as follows:
```py
from sklearn.pipeline import Pipeline
pipeline = Pipeline([("imputer",SimpleImputer()), ("scale",StandardScaler())])
```
2. Once a `Pipeline` object has been instantiated, the methods `.fit` and `.transform` can be called like we would with any data transformation object in `scikit-learn`. 
```py
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X,y, random_state=0, test_size=0.25)
pipeline.fit(x_train)
pipeline.transform(x_test)
``` 

If the pipeline includes a machine learning model as well,`.predict` can also be called down the line. Each step in the pipeline will be fit in the order provided.  Further parameters can be passed to each step as well.  For example, if we want to pass the parameter `with_mean=False` to the `StandardScaler`, we'd use:
```py
Pipeline([("imputer",SimpleImputer()), ("scale",StandardScaler(with_mean=False))])
```

In the code editor, we've loaded a [dataset](https://archive.ics.uci.edu/ml/datasets/abalone) from the UCI Machine Learning repository containing information about different attributes of [abalones](https://en.wikipedia.org/wiki/Abalone), typically used to predict the age of the abalone. We've defined the predictor and target variables, identified the numerical and categorical columns, added some missing values to impute, and performed a train-test split.





Data Cleaning (Numeric)

We're now going to implement a task similar to the previous exercise with `pipeline.Pipeline()`, but with categorical variables now. Specifically, we'll be dealing with missing values in categorical data and one-hot-encoding categorical variables. We will convert an existing codebase to a pipeline like in the previous exercise. The two steps in detail are:

1. `SimpleImputer()` will be used again to fill missing values in the pipeline, but this time, the strategy parameter will need to be updated to `most_frequent`.
2. [`OneHotEncoder()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder) will be used as the second step in the pipeline. The default setting in `scikit-learn`'s `OneHotEncoder()` is that a sparse array will be returned from this transform, so we will use `sparse='False'` to return a full array.

Data Cleaning (Categorical)

Often times, you may not want to simply apply every function to all columns.  If our columns are of different types, we may only want to apply certain parts of the pipeline to a subset of columns.  This is what we saw in the two previous exercises.  One set of transformations are applied to numeric columns and another set to the categorical ones.  We can use `scikit-learn`'s `ColumnTransformer` as one way of combining these processes together.

`ColumnTransformer` takes in a list of tuples of the form `(name, pipeline, columns)`:
```
example_column_transformer = ColumnTransformer(
    transformers=[ ("name_1", pipeline_1, columns_1),
                   ("name_2", pipeline_2, columns_2)])
``` 


The transformer can be anything with a `.fit` and `.transform` method like we used previously (like `SimpleImputer` or `StandardScaler`), but can also itself be a pipeline, as we will use in the exercise.  


Column Transformer

Great!  Now that we have all the preprocessing done and coded succinctly using `ColumnTransformer` and `Pipeline`, we can add a model.  We will take the result at the end of the previous exercise, and now create a final pipeline with the `ColumnTransformer` as the first step, and a `LinearRegression()` model as the second step.

By adding a model to the final step, the last step no longer has a `.transform` method.  This is the only step in a pipeline that can be a non-transformer.  But now the final step also has a `.predict` method, which can be called on the entire pipeline! Additionally the `.score()` method, which estimates the default prediction score on any `scikit-learn` model can also be used to evaluate the performance of the pipeline.

Adding a Model

Great, we have a very condensed bit of code that does all our data cleaning, preprocessing, and modeling in a reusable fashion!  What now?  Well, we can tune some of the parameters of the model by applying a grid search over a range of hyperparameter values.

A linear regression model has very few hyperparameters and here we'll be using the hyperparameter that pertains to whether we include an intercept or not.  As we've aseen, the pipeline created in the previous exercise is an estimator and we can call the `.fit()` and `.predict()` methods on it.  So in fact, the whole pipeline can be passed as an estimator for `GridSearchCV`.  This will then refit the pipeline for each combination of parameter values in the grid and each fold in the cross-validation split.  

That's a lot -- but the code is again very short.  One last thing to keep in mind while referencing hyperparameters in a pipeline is the following: any hyperparameter can be called using `pipeline_step_name + '__' + hyperparameter`.  For example, `regr__fit_intercept` corresponds to a pipeline step named "regr" and the hyperparameter "fit_intercept". Let's see how this all works, shall we?

_Note_: If you'd like a refresher on how hyperparameters can be tuned check out our [lesson](https://www.codecademy.com/content-items/b9ddf95b9ffd49cfb3d2c361cec4a7da/exercises/introduction) on the same.

Hyperparameter Tuning

Way to go! Now that we are getting the hang of pipelines, we're going take things up a notch. We will now be searching over different types of models, each having their own sets of hyperparameters! In the original `pipeline`, we defined `regr` to be an instance of `LinearRegression()`. Then in defining the parameter grid to search over, we used the dictionary `{"regr__fit_intercept": [True,False]}` to define the values of the `fit_intercept` term. We can equivalently do this by passing both the estimator AND parameters in a single dictionary as
```
{'regr': [LinearRegression()], "regr__fit_intercept": [True,False]}
```
We can add more models to it as follows. Suppose we wanted to add a Ridge regression model and also perform hyperparamter tuning using `GridSearchCV` to find the best regularization parameter `alpha`, we would add it to previous dictionary to create an array of dictionaries as follows:
```
search_space = [{'regr': [LinearRegression()], 'regr__fit_intercept': [True,False]},
                {'regr':[Ridge()], 'regr__alpha': [0,0.1,1,10,100]}
```
_Note_: If you'd like a refresher on regularization using hyperparameter tuning in regression models, check out our [article](https://author.codecademy.com/content-items/1a6ba4626608600468467b86bdc03181) and [lesson](https://author.codecademy.com/content-items/b9ddf95b9ffd49cfb3d2c361cec4a7da) on the same.


The goal of this process is to find the best estimator for our dataset and problem in the most efficient manner possible. The pipeline module allows us to do exactly that! In a couple of lines of code, we're able to preprocess the data and search an entire model _and_ hyperparameter space. The final step is to access the pipeline elements to draw out the information about which estimator and hyperparameter set gets us the best score. We do this by using the `.next_steps` method by using the strings we've used in the dictionary. For instance, the regression model can be access using the string `'regr'` from the dictionary as follows:
- Get the best estimator using `GridSearchCV`'s `.best_estimator_` method
- Use `.named_steps['regr'].get_params()` on the best estimator to get its hyperparameters!


Final Pipeline

While scikit-learn contains many existing transformers and classes that can be used in pipelines, you may need at some point to create your own.  This is simpler than you may think, as a step in the pipeline needs to have only a few methods implemented.  If it is an intermediate step, it will need fit and transform methods. We will implement all of this in the exercise below! 

Here are some of the major takeaways on building pipelines in `scikit-learn`:

- Pipelines help make concise, reproducible, code by combining steps of transformers and/or a final estimator.

- Intermediate steps of a pipeline must have both the `.fit()` and `.transform()` methods. This includes preprocessing, imputation, feature selection, dimension reduction. 

- The final step of a pipeline must have the `.fit()` method -- this can include a transformer or an estimator/model.

- If the pipeline is meant to only transform your data by combining preprocessing and data cleaning steps, then each step in the pipeline will be a transformer.  If your pipeline will also include a model (a final estimation or prediction step), then the last step must be an estimator.  

- Once the steps of a pipeline are defined, it can be used like an other transformer/estimator by calling fit, transform, and/or predict methods.  Similarly, it can be used in place of an estimator in a hyperparameter grid search.

Writing Custom Classes & Summary

Machine Learning Pipelines

Learn how to create machine learning workflows and build machine learning pipelines!

Use pipelines to create reusable, succinct code to predict patient survival for the bone marrow transplant dataset.

The dataset has been loaded for you in **script.py** and saved as a dataframe named `df`. Some of the input and output features of interest are:

* donor_age - Age of the donor at the time of hematopoietic stem cells apheresis
* donor_age_below_35 - Is donor age less than 35 (yes, no)
* donor_ABO - ABO blood group of the donor of hematopoietic stem cells (0, A, B, AB)
* donor_CMV - Presence of cytomegalovirus infection in the donor of hematopoietic stem cells prior to transplantation (present, absent)
* recipient_age - Age of the recipient of hematopoietic stem cells at the time of transplantation
* recipient_age_below_10 - Is recipient age below 10 (yes, no)
* recipient_age_int - Age of the recipient discretized to intervals (0,5], (5, 10], (10, 20]
* recipient_gender - Gender of the recipient (female, male)
* recipient_body_mass - Body mass of the recipient of hematopoietic stem cells at the time of the transplantation
* ...
* survival_status - Survival status (0 - alive, 1 - dead)

We will build a classification pipeline to predict the survival status.  As part of the pipeline, we will process the numerical and categorical columns separately.  First, we will define lists for each of these columns.  However, one thing to note that the data imported is ALL numeric -- binary columns, like `donor_age_below_35`, are encoded as numerical values (0 and 1).  Similarly, columns like `donor_ABO`, with four categories, are encoded as -1,0,1, and 2.  So we cannot just look for columns of numeric types or object types to define these lists.

First, calculate the number of unique values for each column -- how might we use this to determine categorical variables?

Save the target, `survival_status`, as `y` and the remaining columns (minus `survival_time`) as `X`.

After exploring the unique value counts in each column, we will use 7 as a threshold -- for columns with more than 7 unique values, we will consider this a numeric variable.  For columns with 7 or less unique values, we will consider this a categorical variable.  Define `num_cols` as a list of columns in `X` with more than 7 unique values and `cat_cols` as a list of columns in `X` with 7 or fewer unique values.

Check to see what, if any, columns in `X` have missing values and print them.

Split the data into a train and test set with a test size of `20%`.

Create a categorical preprocessing pipeline called `cat_vals` that contains two steps -- the first will fill in missing values using the mode and the second will one-hot-encode the variables.  Make sure to use parameters `sparse=False, drop='first', handle_unknown = 'ignore' ` for the `OneHotEncoder`.

Create a numerical preprocessing pipeline called `num_vals` that contains two steps -- the first will fill in missing values using the mean and the second will scale features using `StandardScaler()`.

Create a column transformer named `preprocess` that will preprocess the numerical and categorical features separately.  This will contain the previous two pipelines, `cat_vals` and `num_vals`, as transformers on their respective columns, `cat_cols` and `num_cols`.

Now that we have finished all the preprocessing, let's create a classification pipeline called `pipeline` -- it will contain three steps.  The first is the `preprocess` created above.  The second is `PCA()` -- this will select a number of principal components from the processed data features.  The third and last step will be the classifier, a logistic regression classifier. 

Fit `pipeline` on the training data and predict the accuracy score on the test data.

Update the search space of hyperparameters (in the dictionary `search_space`) to contain the number of PCA dimensions for each classifier.  Use `np.linspace(30,37,3).astype(int)` for the values of the dimensions.

Search over the hyperparameters in `search_space` for the pipeline using `GridSearchCV`.  Fit on the training set.

Save the best estimator from the grid search as `best_model`.

Print attributes of `best_model` -- the type of classifier it is, the hyperparameters of the classifier, and the number of components selected from PCA.

Evaluate the best estimator on the test set and print the final accuracy score.  How does this compare to the initial model?

Nice work! Note that the accuracy of our final model increased from 79% to 89% -- and we used pipelines to write all the preprocessing, feature selection, and modeling step in our code!

There are a few different ways to extend this project:
* Are there other classification models that may lead to an even better performance?  Consider adding additional types to the grid search and their respective hyperparameters.
* Tune more parameters of the model. You can find a description of all the parameters you can tune in the scikit-learn documentation -- for example <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html" target="blank_">Random Forest Classifier documentation</a>. See what happens if you tune `max_features` or `n_estimators`. 
* Consider other types of feature selection/creation steps other than PCA -- substitute these in place of PCA in the final pipeline and see how this impacts the model performance.
* And many others -- pipeline help you tune parameters across all steps in your model and change steps easily -- options are unlimited!

Building ML Pipelines

Congratulations, you’ve completed the Building Machine Learning (ML) Pipelines course! Now you know what machine learning pipelines are and how to build pipelines using `scikit-learn`. More specifically, you learned:

- What a ML workflow is
- How to build an end-to-end ML workflow
- To build pipelines using `scikit-learn`'s `Pipeline` module 
- Convert a ML workflow to a pipeline in few lines of code!

Your learning journey into Machine Learning isn’t over yet! There are plenty of other topics that you can dive into to continue learning.

## Here are our recommendations for the next steps:


### Learn Big Data with PySpark
The [Big Data with PySpark](https://www.codecademy.com/learn/big-data-pyspark) course is an excellent next step. This will set the foundation for scaling machine learning pipelines using PySpark, a key part of productionalizing machine learning models in industry. 
 
Once again, congratulations on finishing the Building ML Pipelines course! We are excited to see what you accomplish next. Happy coding!

You've completed the Building Machine Learning Pipelines course! What's next?

Next Steps

## Welcome

Machine learning applications are powered by robust end-to-end pipelines. Building these pipelines involves building and tuning high-performing machine learning models and writing production-quality code.

## What Will We Cover in This Course?

You'll learn about the different components of a machine learning workflow and how to turn your workflows into end-to-end pipelines. Along the way you'll learn some nifty `scikit-learn` tools that'll help modularize your machine learning code to make it reproducible and effective!

## Prerequisites

This course requires prior knowledge of machine learning as well as basic software engineering in Python. Check out our [Intermediate Machine Learning Skill Path](https://www.codecademy.com/learn/paths/intermediate-machine-learning-skill-path) and our [Software Engineering for Data Scientists Skill Path](https://www.codecademy.com/learn/paths/software-engineering-for-data-scientists) to brush up on either!

## Learning is Social

Whatever you're working on, be sure to connect with the Codecademy community in the [forums](https://discuss.codecademy.com/). Remember to check in with the community regularly, including for things like asking for code reviews on your project work and providing code reviews to others in the [projects category](https://discuss.codecademy.com/c/project/1833), which can help to reinforce what you've learned.


Learn how to build machine learning pipelines with `scikit-learn`!

Welcome to Building Machine Learning Pipelines!

### Introduction

There are many steps in the process of creating, implementing and iterating over a machine learning model for a specific data-driven problem. While there is no single universal way of sequencing the different steps that go into a workflow, there are some general principles that are good to follow for optimal performance of a machine learning algorithm. This article will layout the steps in a ML workflow and how to implement them for any dataset.

_A note to the learner_: This article and module requires intermediate-level knowledge of machine learning. Check out our [The Intermediate Machine Learning Skill Path](https://www.codecademy.com/learn/paths/intermediate-machine-learning-skill-path) if you'd like to brush up on that. Throughout the article we will also provide links to specific content items if you want to just review some of those concepts.


### Machine Learning Workflow

![image](https://static-assets.codecademy.com/Paths/machine-learning-engineer-career-path/pipelines/workflow_schematic.jpg)

A machine learning workflow has the following steps. Depending on the dataset, the question we're trying to answer and the tech stack we're working with, one or more of these steps can be omitted or combined with another. The steps are:

1. ETL (Extract, Transform and Load) data
2. Data Cleaning
3. Train-Test-Validation Split
4. EDA (Exploratory Data Analysis)
5. Feature Engineering (normalization, removing autocorrelations, discretization, etc.)
6. Model Selection and Implementation
7. Model Evaluation
8. Hyperparameter Tuning
9. Model Validation
10. Build ML pipeline!

We're now going to go through each of these steps in detail.

#### 1. Extract, Transform and Load (ETL) data

This process can look different depending on the data sources and the tech-stack you maybe working with at a specific company. It is often the case that data is stored in a SQL database with a cloud service provider like AWS, Digital Ocean, etc. Depending on the volume of data, an engineer would use a tool like PySpark to extract this data, transform it and load it into a local database. Some times this falls within the purview of a Machine Learning Engineer (MLE) but depending on the technical sophistication of the stack, there might be Data Engineer or a Pipeline Engineer performing this task instead.


#### 2. Data Cleaning and Aggregation

This step is often combined with the previous one. This can involve a range of tasks depending on the form and type of data as well as the problem that the machine learning pipeline is being designed to solve. Some examples include: dealing with null or missing entries, conforming timestamps to a standard, carrying out aggregations like grouping events based on timestamps by the hour or day, grouping IP's by location, etc. Since Spark is best suited to perform such tasks on big data, this task might very well be the "Transform" part of the aforementioned ETL step. Alternately, raw data might get handed over to a MLE who then does the same after the ETL step. 

#### 3. Train-Test-Validation Split

The next step before "editing" the data in any manner, is to perform the train-test-validation split. You are likely very familiar with the idea of training and test data &mash;  `scikit-learn`'s `model_selection` module has a `train_test_split` function that's used to to this. If you have ever implemented a machine learning model using `scikit-learn`, you've likely written the following piece of code:

```python
from sklearn.model_selection import train_test_split
# For feature matrix X and target variable y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
````

In the example above, we see that the test data makes up a third of the original dataset. The `random_state` argument is set to ensure reproducibility here &mdash; i.e., if we reused 42, it would split a certain dataset in the exact same manner. The training data is used to learn a model's parameters and the test data is used to test its performance. For models in production, there's a third portion of the dataset that's set aside known as a holdout or validation dataset used to tune hyperparameters and/or to perform model validation later on. If you would like to revisit the train-test-validation split, checkout this [article](https://www.codecademy.com/paths/machine-learning-engineer/tracks/mle-machine-learning-fundamentals/modules/mlecp-supervised-learning-i-regressors-classifiers-and-trees/articles/mlfun-training-test-validation) on the same.

![image](https://static-assets.codecademy.com/skillpaths/ml-fundamentals/evaluation-metrics/train%20test%20figure.png)

The reason we don't want to perform data manipulations before splitting the dataset into training, test and validation datasets is that we don't want data points from one of these to influence the other. Suppose we needed a machine learning model to predict housing prices and wanted to standardize a feature like the size of an apartment, the average of this value would look different between the entire dataset and each of the individual datasets. Scaling the entire column to the average value misses the unique information contained within the subpopulations and will make the model evaluation and validation process less objective.

#### 4. Exploratory Data Analysis

Exploratory Data Analysis or EDA in the context of a machine learning workflow, is the step of inspecting,  analyzing and altering your data to get it ready for machine learning modeling. Often this is the step where decisions on how to deal with outliers, transform  are made. Our [EDA in Python course](https://www.codecademy.com/learn/eda-exploratory-data-analysis-python) covers this topic exhaustively but specifically, the articles on performing EDA before supervised ([classification](https://www.codecademy.com/courses/eda-exploratory-data-analysis-python/articles/eda-prior-to-fitting-a-classification-model) or [regression](https://www.codecademy.com/courses/eda-exploratory-data-analysis-python/articles/eda-prior-to-fitting-a-regression-model)) or [unsupervised](https://www.codecademy.com/courses/eda-exploratory-data-analysis-python/articles/eda-prior-to-unsupervised-clustering) learning models are worth checking out if you'd like a refresher.


#### 5. Feature Engineering 

Feature engineer refers to an umbrella of methods to prep, select and reduce features in a machine learning problem. This can involve methods that overlap with EDA such as normalization, removing autocorrelations, discretization, etc. Feature engineering can also involve using machine learning algorithms like PCA to reduce dimensionality or methods that are implemented during the model fitting step like regularization. If you'd like to brush up on feature engineering methods, the [Feature Engineering Skill Path](https://www.codecademy.com/learn/paths/fe-path-feature-engineering) is for you!

#### 6. Model Selection and Implementation

Now we're ready to test out different machine learning models. The choice of the model depends on the attributes of the data one's working with as well as the type of question we're trying to answer. `scikit-learn` has a nifty [cheatsheet](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html) to choose the right estimator. 

#### 7. Model Evaluation

We're now getting into the iterative part of the workflow. Whatever model is built, it must be evaluated on the test data. For classification problems, metrics like accuracy, precision, recall, F1 score and AUROC scores indicate how performant the model is and for regression problems, scores like RMSE and R-squared are some commonly used metrics. If you'd like to review these ideas, check out [this lesson](https://www.codecademy.com/paths/machine-learning-engineer/tracks/mle-machine-learning-fundamentals/modules/mlecp-supervised-learning-i-regressors-classifiers-and-trees/lessons/mlfun-evaluation-metrics-classification/exercises/confusion-matrix) on Evaluation metrics. Machine learning engineers iterate over different types of models to figure out the most optimal model for the problem at hand.

#### 8. Hyperparameter Tuning

Once a model has been decided upon, it can be tuned for better performance. Hyperparameter tuning is essential in making sure that the model does not overfit or underfit the data. If you would like a refresher on what hyperparameters are and methods of tuning hyperparameters, the hyperparameter tuning [module](https://www.codecademy.com/paths/machine-learning-engineer/tracks/mle-int-ml/modules/mle-reg-hyptune/articles/mle-hyperparameter-tuning-article) might be just for you!

This is key to how well the model is fitting known data and how well it's able to generalize to new data as well. Hence hyperparameter tuning might be done on the validation or holdout dataset.
 
#### 9. Model Validation

Model validation is the process of making sure that the model is still performant on data that it hasn't seen at all &mdash; neither in the training phase nor in the test phase. This can be done either during the hyperparameter tuning step or after. Typically the same metrics used during the model evaluation phase needs to be used here as well so as to make a reasonable comparison with the former.

![image](https://static-assets.codecademy.com/Paths/machine-learning-engineer-career-path/intro_module/pipes.gif)

#### 10. Build ML pipeline!

When a machine learning workflow is part of a production cycle, it is often the case that a model is tuned and updated based on incoming information. In other words the model that worked well on last month's data might not be applicable for this month. It is the job of a Machine Learning Engineer or a Pipeline Engineer to make sure that the model deployed into production is thus flexible and alterable without affecting the rest of the codebase. ML pipelines allow one to do the same!

A ML pipeline is a modular sequence of objects that codifies and automates a ML workflow to make it efficient, reproducible and generalizable.

### Summary

We've gone through each of the steps within a machine learning workflow in detail. We're now ready to build ML workflows and turn them into pipelines!




Machine Learning Workflows

Feature engineering is an essential step in the machine learning workflow that can be performed before, during or after model implementation depending on the technique used.

The train-test-validation split is performed after feature engineering.

ETL stands for "Extract, Transform and Load"

Hyperparameter tuning is an iterative process that happens alongside model evaluation.

Quiz about machine learning workflows and pipelines!

### Why Building Machine Learning (ML) Pipelines? 
Pipelines help turn buly and unwieldy machine learning workflows into shorter, interpretable, and reproducible processes that can be deployed to users. This course walks you though the major stages of building a pipeline for your machine learning project. 

### Take-Away Skills: 
Learn how to build production-grade ML pipelines using `scikit-learn`!

Learn how to build machine pipelines that automate your workflow and keep everything consistent.

Build a Machine Learning Pipeline

PRO SALE: Get 50% off annual Pro memberships using code [LLM50](https://www.codecademy.com/checkout?plan_id=proGoldAnnualV2&discountCode=LLM50&plan_type=pro)