Data Science One Hot Encoding
In Data Science, one hot encoding is a method of encoding categorical variables as binary vectors that can be more readily used by machine learning algorithms. Algorithms work with numbers such as 0 and 1, not categorical values like "Texas" or "bicycle". However, the data these algorithms are asked to analyze often have values that are encoded categorically. One hot encoding is a process by which those categorical values in the data can be translated into numbers that can be interpreted by those algorithms.
Encoding Categorical Values
The process of one hot encoding is as follows:
- Each possible value in the data being encoded is assigned a unique sequential integer value.
- Each of those values is represented by a binary vector with a position for each integer value.
- Each vector has a value of
1in the position for its corresponding integer value, and a0elsewhere. - The categorical values in the data are replaced by the corresponding vector.
Depending on the implementation, the encoded values may be represented by actual vector data types, or they may be expanded as additional columns in the data.
Example 1: Categories
Original data:
| Category | Value |
|---|---|
| A | 24 |
| A | 90 |
| A | 75 |
| A | 36 |
| B | 45 |
| B | 28 |
| B | 62 |
| C | 65 |
| C | 97 |
Assigning integer values to category values:
| Category | Integer Value |
|---|---|
| A | 0 |
| B | 1 |
| C | 2 |
Assigning vectors to values:
| Category | Integer Value | Vector |
|---|---|---|
| A | 0 | [1,0,0] |
| B | 1 | [0,1,0] |
| C | 2 | [0,0,1] |
Applying the vector values as columns to the original data:
| Category | Value | A | B | C |
|---|---|---|---|---|
| A | 24 | 1 | 0 | 0 |
| A | 90 | 1 | 0 | 0 |
| A | 75 | 1 | 0 | 0 |
| A | 36 | 1 | 0 | 0 |
| B | 45 | 0 | 1 | 0 |
| B | 28 | 0 | 1 | 0 |
| B | 62 | 0 | 1 | 0 |
| C | 65 | 0 | 0 | 1 |
| C | 97 | 0 | 0 | 1 |
Example 2: Colors
Original data:
| Color | Quantity |
|---|---|
| Red | 15 |
| Blue | 22 |
| Green | 13 |
| Blue | 30 |
| Red | 10 |
Assigning integer values:
| Color | Integer Value |
|---|---|
| Red | 0 |
| Blue | 1 |
| Green | 2 |
Assigning vectors:
| Color | Integer Value | Vector |
|---|---|---|
| Red | 0 | [1,0,0] |
| Blue | 1 | [0,1,0] |
| Green | 2 | [0,0,1] |
Final encoded data:
| Color | Quantity | Red | Blue | Green |
|---|---|---|---|---|
| Red | 15 | 1 | 0 | 0 |
| Blue | 22 | 0 | 1 | 0 |
| Green | 13 | 0 | 0 | 1 |
| Blue | 30 | 0 | 1 | 0 |
| Red | 10 | 1 | 0 | 0 |
Frequently Asked Questions
1. What do you mean by one hot encoding?
One hot encoding is a useful technique to convert categorical variables into numerical representations by assigning each unique category a binary vector, where one element is 1 and the rest of the elements are 0.
2. What is one-hot and one-cold encoding?
- One-hot: Binary vector with
1for the category,0elsewhere; no implied order. - One-cold: Binary vector with
0for the category,1elsewhere; indicates absence. - Example (A, B, C):
- A → One-hot:
[1,0,0], One-cold:[0,1,1] - B → One-hot:
[0,1,0], One-cold:[1,0,1] - C → One-hot:
[0,0,1], One-cold:[1,1,0]
- A → One-hot:
3. What is the difference between binary and one-hot encoding?
- Binary encoding: Represents categories as binary numbers; may imply order.
- One-hot encoding: Represents categories as independent binary vectors with a single
1; no order implied. - Example (A, B, C):
- Binary: A →
00, B →01, C →10 - One-hot: A →
[1,0,0], B →[0,1,0], C →[0,0,1]
- Binary: A →
Contribute to Docs
- Learn more about how to get involved.
- Edit this page on GitHub to fix an error or make an improvement.
- Submit feedback to let us know how we can improve Docs.
Learn Data Science on Codecademy
- Use TensorFlow to build and tune deep learning models.
- Includes 7 Courses
- With Certificate
- Intermediate.10 hours
- Learn the basics of Python 3.12, one of the most powerful, versatile, and in-demand programming languages today.
- With Certificate
- Beginner Friendly.24 hours