While scikit-learn contains many existing transformers and classes that can be used in pipelines, you may need at some point to create your own. This is simpler than you may think, as a step in the pipeline needs to have only a few methods implemented. If it is an intermediate step, it will need fit and transform methods, which we will demonstrate in the exercise below.
Here are some of the major takeaways on pipeline:
Pipelines help make concise, reproducible, code by combining steps of transformers and/or a final estimator.
Intermediate steps of a pipeline must have both the
.transform()methods. This includes preprocessing, imputation, feature selection, dimension reduction.
The final step of a pipeline must have the
.fit()method – this can include a transformer or an estimator/model.
If the pipeline is meant to only transform your data by combining preprocessing and data cleaning steps, then each step in the pipeline will be a transformer. If your pipeline will also include a model (a final estimation or prediction step), then the last step must be an estimator.
Once the steps of a pipeline are defined, it can be used like an other transformer/estimator by calling fit, transform, and/or predict methods. Similarly, it can be used in place of an estimator in a hyperparameter grid search.
Examine the code written for the class
MyImputer. This replicates the
SimpleImputer using the mean strategy. Notice both fit and transform methods are defined. Use this new class as the first step in
new_pipeline and second step
Fit the new pipeline on the training data, numeric columns only. This will be identical to the pipeline created in exercise 2. Verify the results of the transform on the test set are the same by printing the sum of absolute differences between the two data sets.