To introduce pipelines, let’s look at a common task – dealing with missing values and scaling numeric variables. We will convert an existing code base to a pipeline, describing these two steps in detail.
To define a pipeline, pass a list of tuples of the form
(name, transform/estimator) into the
Pipeline object. For example, to use a
SimpleImputer first, named “imputer”, and a
StandardScaler second, named “scale”, pass these as as
Pipeline([("imputer",SimpleImputer()), ("scale",StandardScaler())]). Once the pipeline has been instantiated, methods
.transform can be called as before. If the last step of the pipeline is a model (i.e. has a
.predict method), then this can also be called.
Each step in the pipeline will be fit in the order provided. Further parameters can be passed to each step as well. For example, if we want to pass the parameter
with_mean=False to the
Examine the existing code that fills in missing value with the mean value (
SimpleImputer) and then scales the data (
StandardScaler). Update the pipeline with the correct two steps and fit on the training set (numeric columns).
Transform the test data (numeric columns only) using the fit pipeline. Confirm the results are the same as
x_test_fill_missing_scale by printing the sum of absolute differences.
Change the imputer strategy to
median. Confirm the results of the two pipelines are different by printing the sum of absolute differences.