To introduce pipelines, let’s look at a common task – dealing with missing values and scaling numeric variables. We will convert an existing code base to a pipeline, describing these two steps in detail.
To define a pipeline, pass a list of tuples of the form (name, transform/estimator)
into the Pipeline
object. For example, to use a SimpleImputer
first, named “imputer”, and a StandardScaler
second, named “scale”, pass these as as Pipeline([("imputer",SimpleImputer()), ("scale",StandardScaler())])
. Once the pipeline has been instantiated, methods .fit
and .transform
can be called as before. If the last step of the pipeline is a model (i.e. has a .predict
method), then this can also be called.
Each step in the pipeline will be fit in the order provided. Further parameters can be passed to each step as well. For example, if we want to pass the parameter with_mean=False
to the StandardScaler
, use Pipeline([("imputer",SimpleImputer()), ("scale",StandardScaler(with_mean=False))])
.
Instructions
Examine the existing code that fills in missing value with the mean value (SimpleImputer
) and then scales the data (StandardScaler
). Update the pipeline with the correct two steps and fit on the training set (numeric columns).
Transform the test data (numeric columns only) using the fit pipeline. Confirm the results are the same as x_test_fill_missing_scale
by printing the sum of absolute differences.
Change the imputer strategy to median
. Confirm the results of the two pipelines are different by printing the sum of absolute differences.