In this lesson we’re going to learn how to turn a machine learning (ML) workflow to a pipeline using scikit-learn. A ML pipeline is a modular sequence of objects that codifies and automates a ML workflow to make it efficient, reproducible and generalizable. While the process of building pipelines is not singular, there are some tools that are universally used to do this. The most accessible of these is scikit-learn‘s Pipeline object which allows us to chain together the different steps that go into a ML workflow.

Turning a workflow into a pipeline has many other advantages too. Pipelines provide consistency — the same steps will always be applied in the same order under the same conditions. They also are very concise and can streamline your code. The Pipeline object within scikit-learn has consistent methods to use the many other estimators and transformers we have already covered in our ML curriculum. It is usually the starting point for a Machine Learning Engineer before turning to more sophisticated tools for scaling pipelines (such as PySpark, etc) and we will delve deeper into it in this lesson

What can go into a pipeline? For any of the intermediate steps, it must have both the .fit and .transform methods. This includes preprocessing, imputation, feature selection and dimensionality reduction. The final step must have the .fit method. Examples of tasks we’ve seen already that could benefit from a pipeline include:

  • scaling data then applying principal component analysis
  • filling in missing values then fitting a regression model
  • one-hot-encoding categorical variables and scaling numerical variables

In the following exercises, we will walk through various chained functions and how to incorporate these into a Pipeline.

Take this course for free

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?