What is One Hot Encoding and How to Implement it in Python?
When working with real-world datasets, attributes like gender, color, names of places, and product type contain vital information that can help build efficient machine learning (ML) models. As ML models understand only numeric values, we must convert the categorical values into numeric format. For this task, we can use one-hot encoding as it helps preserve the categorical nature of the data without introducing unintended relationships. In this article, we will discuss one-hot encoding, why it’s useful, and how to implement it in Python using popular libraries like pandas and scikit-learn.
What is one-hot encoding?
One-hot encoding is a data encoding technique used to convert categorical features of a dataset into numeric features. If a categorical feature has N unique values, we create N new binary columns in the dataset for one-hot encoding. Each new column represents a unique value in the existing categorical feature. If the existing categorical column contains the value represented by a new column, the value in the new column is set to 1. Values in the rest of the new columns are set to 0.
One-hot encoding example
To understand how one-hot encoding works, consider the following table.
ID | Product_ID | Price | Color |
---|---|---|---|
1 | P1001 | 1099 | Red |
2 | P1002 | 999 | Blue |
3 | P1003 | 499 | Green |
4 | P1004 | 999 | Red |
5 | P1005 | 1499 | Blue |
In this table, we have a Color
column with categorical values having three unique values, i.e., Blue
, Green
, and Red
. To convert this categorical column into numeric values using one-hot encoding, we will create three new columns in the dataset where each new column will represent a unique color value.
ID | Product_ID | Price | Color | Color_Blue | Color_Green | Color_Red |
---|---|---|---|---|---|---|
1 | P1001 | 1099 | Red | |||
2 | P1002 | 999 | Blue | |||
3 | P1003 | 499 | Green | |||
4 | P1004 | 999 | Red | |||
5 | P1005 | 1499 | Blue |
Next, we will scan each row in the Color
column and set the values in the new columns.
- For the
Color
valueBlue
, we will set the value in theColor_Blue
column to1
. The rest of the new columns in the row will be set to0
. - For the
Color
valueGreen
, we will set the value in theColor_Green
column to1
. The rest of the new columns in the row will be set to0
. - For the
Color
valueRed
, we will set the value in theColor_Red
column to1
. The rest of the new columns in the row will be set to0
.
After the above transformation, we get the following table:
ID | Product_ID | Price | Color | Color_Blue | Color_Green | Color_Red |
---|---|---|---|---|---|---|
1 | P1001 | 1099 | Red | 0 | 0 | 1 |
2 | P1002 | 999 | Blue | 1 | 0 | 0 |
3 | P1003 | 499 | Green | 0 | 1 | 0 |
4 | P1004 | 999 | Red | 0 | 0 | 1 |
5 | P1005 | 1499 | Blue | 1 | 0 | 0 |
In this table, we can obtain the color information of the products from the Color_Red
, Color_Blue
, and Color_Green
columns. Hence, the Color
column is no longer required. We will delete this column, as shown in the following table:
ID | Product_ID | Price | Color_Blue | Color_Green | Color_Red |
---|---|---|---|---|---|
1 | P1001 | 1099 | 0 | 0 | 1 |
2 | P1002 | 999 | 1 | 0 | 0 |
3 | P1003 | 499 | 0 | 1 | 0 |
4 | P1004 | 999 | 0 | 0 | 1 |
5 | P1005 | 1499 | 1 | 0 | 0 |
As you can see, we have successfully converted the categorical values in the Color
column to numeric attributes using one-hot encoding. Now, let’s discuss how we can implement one-hot encoding using different approaches in Python.
Data Scientist: Analytics Specialist
Data Analysts and Analytics Data Scientists use Python and SQL to query, analyze, and visualize data — and communicate findings.Try it for freeImplementing one-hot encoding in Python
In Python, we can implement one-hot encoding using the get_dummies()
function in the pandas module and the OneHotEncoder
class in the sklearn module. Let’s discuss both of these approaches.
Implement one-hot encoding using pandas in Python
The get_dummies()
function gives us a straightforward implementation of one-hot encoding in Python. To discuss how it works, let’s first create a dataframe using the data provided in the example table.
import pandas as pd# Define datasetdata=[['P1001',1099,'Red'],['P1002',999,'Blue'],['P1003',499,'Green'],['P1004',999,'Red'],['P1005',1499,'Blue']]# Create dataframeproduct_data=pd.DataFrame(data, columns=["Product_ID", "Price", "Color"])print("The dataset is:")print(product_data)
Output:
The dataset is:Product_ID Price Color0 P1001 1099 Red1 P1002 999 Blue2 P1003 499 Green3 P1004 999 Red4 P1005 1499 Blue
The get_dummies()
function takes a dataframe as its first input argument and a list of column names as input to the columns
parameter. After execution, it returns a dataframe in which the columns given as input to the columns
parameter are transformed using one-hot encoding, as shown in the following example:
# One-hot encoding using pandas get_dummies() functionproduct_data_encoded=pd.get_dummies(product_data, columns=["Color"])print("The encoded data is:")print(product_data_encoded)
Output:
The encoded data is:Product_ID Price Color_Blue Color_Green Color_Red0 P1001 1099 False False True1 P1002 999 True False False2 P1003 499 False True False3 P1004 999 False False True4 P1005 1499 True False False
In the output, you can see that the original Color
column is dropped. Also, the one-hot encoded columns contain boolean values. To get one-hot encoded columns with integers, you can set the dtype
parameter to int
in the get_dummies()
function. You can also set the dtype
parameter to float
for floating point output.
# Get 1/0 encoding instead of True/Falseproduct_data_encoded=pd.get_dummies(product_data, columns=["Color"], dtype=int)print("The encoded data is:")print(product_data_encoded)
Output:
The encoded data is:Product_ID Price Color_Blue Color_Green Color_Red0 P1001 1099 0 0 11 P1002 999 1 0 02 P1003 499 0 1 03 P1004 999 0 0 14 P1005 1499 1 0 0
In this output, you can see that the one-hot encoded columns have 0s and 1s as we set the dtype
parameter to int
in the get_dummies()
function.
We can also perform one-hot encoding on multiple columns at once using the get_dummies()
function in Python. For this, we need to pass all the column names we want to encode in a list to the columns
parameter. To understand this, let’s add a Segment
column to the dataset and perform one-hot encoding on the Color
and Segment
columns.
import pandas as pd# Define data with two categorical columnsdata=[['P1001',1099,'Red','Men'],['P1002',999,'Blue','Women'],['P1003',499,'Green','Men'],['P1004',999,'Red','Kids'],['P1005',1499,'Blue','Boys']]product_data_with_segment=pd.DataFrame(data, columns=["Product_ID", "Price", "Color", "Segment"])print("The dataset is:")print(product_data_with_segment)product_data_encoded=pd.get_dummies(product_data_with_segment, columns=["Color","Segment"], dtype=int)print("The encoded data is:")print(product_data_encoded)
Output:
The dataset is:Product_ID Price Color Segment0 P1001 1099 Red Men1 P1002 999 Blue Women2 P1003 499 Green Men3 P1004 999 Red Kids4 P1005 1499 Blue BoysThe encoded data is:Product_ID Price Color_Blue Color_Green Color_Red Segment_Boys Segment_Kids Segment_Men Segment_Women0 P1001 1099 0 0 1 0 0 1 01 P1002 999 1 0 0 0 0 0 12 P1003 499 0 1 0 0 0 1 03 P1004 999 0 0 1 0 1 0 04 P1005 1499 1 0 0 1 0 0 0
In this example, we have transformed the Color
and Segment
columns using one-hot encoding by passing the list ["Color","Segment"]
to the columns
parameter in the get_dummies()
function. If you want to encode more categorical columns, you can add the column names to the list.
In addition to pandas, Python has the feature-rich sklearn library, which provides multiple functions for data analysis and machine learning tasks. Let’s discuss how to perform one-hot encoding using the sklearn module in Python.
Implement one-hot encoding using OneHotEncoder in Python
The sklearn module provides us with the OneHotEncoder
class, which we can use to train, save, and reuse encoders for one-hot encoding. To implement one-hot encoding using the OneHotEncoder
class, we will first create an encoder, train it, and then use it for one-hot encoding. We can also save and reload the encoder, as discussed in the following subsections.
Create an encoder using the OneHotEncoder class
First, we will create a one-hot encoder object using the OneHotEncoder()
constructor. By default, OneHotEncoder
returns a sparse matrix after encoding. To get one-hot encoded columns instead of a sparse matrix, we will set the sparse_output
parameter to False
. Also, the OneHotEncoder
runs into an error if the input data has different values compared to the training data. To avoid the error and ignore any new values, we will also set the handle_unknown
parameter to ignore
.
from sklearn.preprocessing import OneHotEncoder# Create an encoderencoder= OneHotEncoder(sparse_output=False, handle_unknown='ignore')
Train the encoder using the categorical data
After creating the one-hot encoder, we will train it using the categorical values. We will first create a dataframe using only the given column to train the one-hot encoder for a single column. Next, we will invoke the fit()
method on the OneHotEncoder
object and pass the dataframe containing the categorical column to the fit()
method. After executing the fit()
method, we get a trained one-hot encoder.
- To get the name of the one-hot encoded features, you can invoke the
get_feature_names_out()
method on the trained encoder. - To perform one-hot encoding on new data, you can invoke the
transform()
method on the trained one-hot encoder and pass the data as an input argument. Thetransform()
method returns a 2D array containing one-hot encoded values.
You can perform one-hot encoding on the Color
column using OneHotEncoder
as shown in the following example:
# Train the encoder using the `Color` columncolor_df=product_data[['Color']]encoder.fit(color_df)# Get encoded column namesencoded_columns=encoder.get_feature_names_out()# Transform data using the trained encoderencoded_colors = encoder.transform(color_df)print("The encoded data is:")print(encoded_colors)print("The encoded columns are:")print(encoded_columns)
Output:
The encoded data is:[[0. 0. 1.][1. 0. 0.][0. 1. 0.][0. 0. 1.][1. 0. 0.]]The encoded columns are:['Color_Blue' 'Color_Green' 'Color_Red']
After performing one-hot encoding on the Color
column, we can merge the encoded data to the original dataframe. For this, we will first create a dataframe of encoded values using the transformed data and the encoded column names. Next, we will merge the new dataframe with the original dataframe using the pd.concat()
function. Finally, we will drop the original Color
column, which contains categorical values.
# Create a dataframe using the encoded dataencoded_df=pd.DataFrame(encoded_colors, columns=encoded_columns)print("Encoded dataframe is:")print(encoded_df)# Merge encoded data into the original dataframe and drop the `Color` columnproduct_data_encoded=pd.concat([product_data,encoded_df],axis=1).drop(["Color"], axis=1)print("Merged data is:")print(product_data_encoded)
Output:
Encoded dataframe is:Color_Blue Color_Green Color_Red0 0.0 0.0 1.01 1.0 0.0 0.02 0.0 1.0 0.03 0.0 0.0 1.04 1.0 0.0 0.0Merged data is:Product_ID Price Color_Blue Color_Green Color_Red0 P1001 1099 0.0 0.0 1.01 P1002 999 1.0 0.0 0.02 P1003 499 0.0 1.0 0.03 P1004 999 0.0 0.0 1.04 P1005 1499 1.0 0.0 0.0
We can also train a one-hot encoder in sklearn to transform multiple columns. For this, we need to pass the dataframe containing all the categorical columns as input to the fit()
method, as shown in the following example:
# Create an encoderencoder= OneHotEncoder(sparse_output=False, handle_unknown='ignore')# Train encoder using two columns datacolor_segment_data = product_data_with_segment[['Color','Segment']]encoder.fit(color_segment_data)# Get encoded column namesencoded_columns=encoder.get_feature_names_out()# Transform data using the trained encoderencoded_data = encoder.transform(color_segment_data)# Create a dataframe using the encoded dataencoded_df=pd.DataFrame(encoded_data, columns=encoded_columns)# Merge encoded data into the original dataset and drop the `Color` and `Segment` columnsproduct_data_encoded=pd.concat([product_data_with_segment,encoded_df],axis=1).drop(["Color","Segment"], axis=1)print("Encoded dataframe is:")print(encoded_df)print("Final dataframe is:")print(product_data_encoded)
Output:
Encoded dataframe is:Color_Blue Color_Green Color_Red Segment_Boys Segment_Kids Segment_Men Segment_Women0 0.0 0.0 1.0 0.0 0.0 1.0 0.01 1.0 0.0 0.0 0.0 0.0 0.0 1.02 0.0 1.0 0.0 0.0 0.0 1.0 0.03 0.0 0.0 1.0 0.0 1.0 0.0 0.04 1.0 0.0 0.0 1.0 0.0 0.0 0.0Final dataframe is:Product_ID Price Color_Blue Color_Green Color_Red Segment_Boys Segment_Kids Segment_Men Segment_Women0 P1001 1099 0.0 0.0 1.0 0.0 0.0 1.0 0.01 P1002 999 1.0 0.0 0.0 0.0 0.0 0.0 1.02 P1003 499 0.0 1.0 0.0 0.0 0.0 1.0 0.03 P1004 999 0.0 0.0 1.0 0.0 1.0 0.0 0.04 P1005 1499 1.0 0.0 0.0 1.0 0.0 0.0 0.0
Scikit-learn allows us to save the trained encoder and reuse it for one-hot encoding. You can save the trained one-hot encoder in pickle format using the dump()
method in the joblib
module.
# Save the trained encoderimport joblibencoder_file = "trained_color_segment_encoder.pkl"# Save trained encoder into a filejoblib.dump(encoder, encoder_file)
After saving, you can reload the encoder from the file and perform one-hot encoding, as shown in the following example:
encoder_file = "trained_color_segment_encoder.pkl"# Load trained encoder from fileencoder = joblib.load(encoder_file)# Transform new data using the loaded encodercolor_segment_data = product_data_with_segment[['Color','Segment']]encoded_data = encoder.transform(color_segment_data)
As you can see, the loaded encoder is used exactly the same as the newly trained encoder.
You might have observed that implementing one-hot encoding using the get_dummies()
function is simpler than implementing it using the OneHotEncoder
class. However, is it the better approach? Let’s discuss whether you should use OneHotEncoder
vs the get_dummies()
function.
OneHotEncoder vs get_dummies: what should you use?
Suppose you train an ML model with training data having three unique values, i.e., Blue, Green, and Red. So, the output of the get_dummies()
function with the same unique color values for any dataset will be the same.
Product_ID Price Color_Blue Color_Green Color_Red0 P1001 1099 0 0 11 P1011 222 1 0 02 P1003 499 0 1 03 P1009 499 0 0 14 P1007 1299 0 0 0
Here, you can see that we have new product IDs and prices. However, the numeric representation of the color values doesn’t change. Suppose we have the following dataframe containing a new value Orange
in the Color
column, which we didn’t get earlier.
Product_ID Price Color0 P1001 1099 Red1 P1002 999 Blue2 P1003 499 Green3 P1004 999 Red4 P1006 1299 Orange
When we transform this data using the get_dummies()
function, the output will be as follows:
Product_ID Price Color_Blue Color_Green Color_Orange Color_Red0 P1001 1099 0 0 0 11 P1002 999 1 0 0 02 P1003 499 0 1 0 03 P1004 999 0 0 0 14 P1006 1299 0 0 1 0
In this output, you can see that we get a new column, Color_Orange
. On the contrary, if you process the new data using the trained OneHotEncoder
, we get the following output, which has no extra columns:
Product_ID Price Color_Blue Color_Green Color_Red0 P1001 1099 0.0 0.0 1.01 P1002 999 1.0 0.0 0.02 P1003 499 0.0 1.0 0.03 P1004 999 0.0 0.0 1.04 P1006 1299 0.0 0.0 0.0
In this output, you can see that we didn’t get any new columns. All the values in the Color
columns are 0
for the product ID P1006
as the Orange
color wasn’t present in the training data. Thus, the encoder ignores the value Orange
.
One-hot encoding using the get_dummies()
function is easy to use, but it cannot handle unseen categories and creates a new column for every new value. If a value from the training data isn’t present in the new data, the column for that particular value is also removed.
The get_dummies()
function works consistently only when the number of unique values in the categorical columns never changes. Due to this, we cannot integrate it into a machine learning pipeline as it will keep adding or removing columns whenever a new category value arrives or a value from the training data is absent.
OneHotEncoder
is flexible and can ignore new values, which is essential for proper preprocessing during inference. It always creates the same number of columns as the unique values in the training data. Thus, you should use the OneHotEncoder
class to implement one-hot encoding in machine learning pipelines.
Why use one hot encoding?
To use categorical data in a machine learning model, we must convert them to numerical format. There are various techniques, such as label encoding, one-hot encoding, frequency encoding, target encoding, binary encoding, ordinal encoding, etc. Among these techniques, one-hot encoding is most popular and suitable for categorical values with no inherent order. Let’s discuss why we use one-hot encoding for nominal values.
- Techniques like label encoding or frequency encoding introduce ordinality in categorical values. One-hot encoding uses a separate binary column for each categorical value and sets it to 1 only if the value is present in a particular row. Due to this, the distance between each categorical value stays the same and doesn’t introduce any order, which helps in better numeric representation of the categorical values.
- One-hot encoding creates a new column for each unique value in a categorical column, and only one column in a row can be set to 1, increasing the dataset’s sparsity. Algorithms like linear regression, logistic regression, naive Bayes, and support vector machines (SVMs) work well with sparse data when used with correct regularization techniques.
Despite the above advantages, one-hot encoding can create many columns if the original categorical feature has many unique values. When the number of features increases significantly, the data becomes sparse and harder for models to learn from. This can lead to overfitting, increased model training time, and poor performance on unseen data. Hence, we need to analyze the data and carefully decide when to use one-hot encoding.
One-hot encoding vs label encoding in Python
One-hot and label encoding are the most popular techniques for transforming categorical data. However, both methods differ in how they process the data. To understand this, consider the following example:
Label encoding converts each unique categorical value into a unique integer. On the contrary, one-hot encoding creates a new binary column for each category. The category present in a particular row is set to 1, and other columns are set to 0. Due to this, both these techniques perform differently in different use cases.
- Type of categorical data: Label encoding works well for categorical data with an inherent order, such as course grades, education levels, military ranks, or clothing sizes. It also works well for categorical values that can be represented using the Likert scale. We just need to make sure that higher integers are assigned to the values that are higher in the order. One-hot encoding is best for nominal data that has no inherent order.
- Dimensionality of data: Label encoding is simple and space-efficient, as it doesn’t add extra columns to the data. One-hot encoding adds new columns to the data, increasing dimensionality and sparsity in the dataset. One-hot encoding can lead to high dimensionality if the categorical feature has many unique values.
- Compatibility with tree-based algorithms: Label encoding often works well using decision trees and ensemble algorithms like random forest models and xgboost. As tree-based algorithms don’t assume any relationship between the encoded integers, we get efficient models. On the other hand, one-hot encoding increases the dimensionality of the data, which might result in overfitting and increased complexity.
- Compatibility with distance-based algorithms: Algorithms like linear regression, logistic regression, and K-nearest neighbors assign a ranking or distance between two data points. Suppose we use label encoding to encode categorical values. For instance, when you substitute the colors Blue with 1, Green with 2, and Red with 3, the ML models understand this data as 3>2>1. The model might also deduce that Red is closer to Green and farther from Blue. These inferences are incorrect and can impact the model’s performance. One-hot encoding ensures that all the category values are at the same distance and equally different. There is no unintended hierarchy. Therefore, we should use one-hot encoding when distance calculations are required to process the data.
Conclusion
One-hot encoding is a fundamental technique for preparing categorical data for ML models. Transforming categories into binary vectors ensures that models treat each category equally without assuming any implicit order. Whether working with simple datasets or building complex ML pipelines, understanding when and how to apply one-hot encoding can make a real difference in your model’s performance and accuracy. In this article, we discussed the basics and implementation of one-hot encoding. We also discussed how one-hot encoding differs from label encoding and how we can use these techniques in different scenarios. You can take a sample dataset with different categorical columns and experiment with it to understand how one-hot encoding works in various scenarios.
To learn more about data preprocessing and analysis, you can take the course Learn Data Analysis with Pandas. You might also like this skill path on how to build a machine-learning model that discusses building different ML models using the scikit-learn module.
FAQs
1. What is the purpose of OneHotEncoder?
OneHotEncoder
converts categorical variables into a numerical format that machine learning models can understand. It converts categorical features into binary vectors where each unique value is represented as a separate feature.
2. How to handle unseen data in one-hot encoding?
You can build an encoder with the OneHotEncoder
class in the sklearn module that ignores the unseen values. For that, you can set the handle_unknown
parameter to "ignore"
while creating the encoder.
3. Should I scale or normalize one-hot encoded features?
Scaling one-hot encoded features is unnecessary because they’re already in binary format (0s and 1s).
4. What is sparse vs. dense one-hot encoding?
Sparse one-hot encoding stores only the positions of 1s to save memory. It is helpful for large datasets. Dense one-hot encoding keeps full vectors, which are easier to understand but use more memory. You can set the sparse_output
parameter to False
for dense one-hot encoding and True
otherwise.
'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'
Meet the full teamRelated articles
- Article
Introduction to Data Wrangling and Tidying
A brief overview of the data wrangling process and tidy data - Article
Hashing vs. Encryption vs. Encoding vs. Obfuscation
In this article, you'll learn the difference between Hashing, Encryption, Encoding, and Obfuscation. - Article
Common SQL Interview Questions
Practice with some common SQL interview questions.
Learn more on Codecademy
- Career path
Data Scientist: Analytics Specialist
Data Analysts and Analytics Data Scientists use Python and SQL to query, analyze, and visualize data — and communicate findings.Includes 22 CoursesWith Professional CertificationBeginner Friendly70 hours - Career path
Data Scientist: Machine Learning Specialist
Machine Learning Data Scientists solve problems at scale, make predictions, find patterns, and more! They use Python, SQL, and algorithms.Includes 27 CoursesWith Professional CertificationBeginner Friendly95 hours - Skill path
Code Foundations
Start your programming journey with an introduction to the world of code and basic concepts.Includes 5 CoursesWith CertificateBeginner Friendly4 hours