For the categorical variables, let’s look at another common task – dealing with missing values and one-hot-encoding. We will convert an existing codebase to a pipeline, describing the two steps in detail.
As in in the previous exercise,
SimpleImputer will be used again to fill missing values in the pipeline, but this time, the strategy parameter will need to be updated to
OneHotEncoder will be used as the second step in the pipeline. Note, that the default is that a sparse array will be returned from this transform, so we will use
sparse='False' to return a full array.
Examine the existing code that fills in missing values with the mode value and then creates dummy variables (with
OneHotEncoder). Update the pipeline with the correct two steps and fit on the training set (categorical columns).
Transform the test data (categorical columns only) using the fit pipeline. Confirm the results are the same as
x_test_fill_missing_ohe by printing the sum of absolute differences.