To introduce pipelines, let’s look at a common task – dealing with missing values and scaling numeric variables. We will convert an existing code base to a pipeline, describing these two steps in detail.

To define a pipeline, pass a list of tuples of the form (name, transform/estimator) into the Pipeline object. For example, to use a SimpleImputer first, named “imputer”, and a StandardScaler second, named “scale”, pass these as as Pipeline([("imputer",SimpleImputer()), ("scale",StandardScaler())]). Once the pipeline has been instantiated, methods .fit and .transform can be called as before. If the last step of the pipeline is a model (i.e. has a .predict method), then this can also be called.

Each step in the pipeline will be fit in the order provided. Further parameters can be passed to each step as well. For example, if we want to pass the parameter with_mean=False to the StandardScaler, use Pipeline([("imputer",SimpleImputer()), ("scale",StandardScaler(with_mean=False))]).



Examine the existing code that fills in missing value with the mean value (SimpleImputer) and then scales the data (StandardScaler). Update the pipeline with the correct two steps and fit on the training set (numeric columns).


Transform the test data (numeric columns only) using the fit pipeline. Confirm the results are the same as x_test_fill_missing_scale by printing the sum of absolute differences.


Change the imputer strategy to median. Confirm the results of the two pipelines are different by printing the sum of absolute differences.

Take this course for free

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?