We’re now going to implement a task similar to the previous exercise with pipeline.Pipeline(), but with categorical variables now. Specifically we’ll be dealing with missing values in categorical data and one-hot-encoding categorical variables. We will convert an existing codebase to a pipeline like in the previous exercise, describing the two steps in detail.

  1. SimpleImputer() will be used again to fill missing values in the pipeline, but this time, the strategy parameter will need to be updated to most_frequent.
  2. OneHotEncoder() will be used as the second step in the pipeline. The default setting in scikit-learn‘s OneHotEncoder() is that a sparse array will be returned from this transform, so we will use sparse='False' to return a full array.



The existing code that fills in missing values with the mode of the categorical variable and then creates dummy variables (with OneHotEncoder). Create a Pipeline() object named pipeline that will do the same tasks by making:

  1. The first step in the a pipeline a SimpleImputer() that employs the most_frequent strategy
  2. The second step in the pipeline a OneHotEncoder()

Fit this pipeline to the training data with categorical columns alone. Transform the test data (categorical columns only!) and call the resulting array, x_transform.


Sum the absolute differences between x_transform array and x_test_fill_missing_ohe array. Call this variable array_diff and print it to confirm the results are the same as x_test_fill_missing_ohe.

Take this course for free

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?