In this lesson we’re going to learn how to turn a machine learning (ML) workflow to a pipeline using scikit-learn
. A ML pipeline is a modular sequence of objects that codifies and automates a ML workflow to make it efficient, reproducible and generalizable. While the process of building pipelines is not singular, there are some tools that are universally used to do this. The most accessible of these is scikit-learn
‘s Pipeline
object which allows us to chain together the different steps that go into a ML workflow.
Turning a workflow into a pipeline has many other advantages too. Pipelines provide consistency — the same steps will always be applied in the same order under the same conditions. They also are very concise and can streamline your code. The Pipeline
object within scikit-learn
has consistent methods to use the many other estimators and transformers we have already covered in our ML curriculum. It is usually the starting point for a Machine Learning Engineer before turning to more sophisticated tools for scaling pipelines (such as PySpark, etc) and we will delve deeper into it in this lesson
What can go into a pipeline? For any of the intermediate steps, it must have both the .fit
and .transform
methods. This includes preprocessing, imputation, feature selection and dimensionality reduction. The final step must have the .fit
method. Examples of tasks we’ve seen already that could benefit from a pipeline include:
- scaling data then applying principal component analysis
- filling in missing values then fitting a regression model
- one-hot-encoding categorical variables and scaling numerical variables
In the following exercises, we will walk through various chained functions and how to incorporate these into a Pipeline
.