If you’re an aspiring Cybersecurity Specialist, Back-End Developer, or Data Scientist, then you’ll need to know about hashing. Hashing is an important tool used in data authentication and security, as well as database management.
Never heard of hashing before? Don’t worry, we’ll break it down for you. Ahead, we’ll walk you through everything you need to know about hashing, including what it is, how it works, why people use it, and popular hashing algorithms.
What is hashing?
Hashing is the process of converting data — text, numbers, files, or anything, really — into a fixed-length string of letters and numbers. Data is converted into these fixed-length strings, or hash values, by using a special algorithm called a hash function.
For example, a hash function that creates 32-character hash values will always turn text input into a unique 32-character code. Whether you want to generate a hash value for the word “Codecademy” or for the entire works of Shakespeare, the hash value will always be 32 characters long.
How hashing works
Input data is key
The input data to be hashed is also known as the key. A key can be in just about any format. It could be a string of text, a list of numbers, an image, or even an application file.
The hash function
The central part of any hashing process is the hashing function, which takes the key and converts it to a fixed-length string of characters. But how exactly does that work? How is it that your name and a novel can both be condensed into the same number of characters?
The secret is dividing the data into equal-sized blocks.
The most popular hashing algorithms work with block sizes between 160 and 512 bits. A bit is the basic unit of computer information and corresponds to either a binary 1 or 0.
Let’s suppose that we’re working with SHA-1, a popular hash function that works with block sizes of 512 bits. 512 bits is about the same as 32 words, so if we have a short message to hash, then the SHA-1 function only needs to run once to generate a final hash value.
Of course, most files and datasets are much larger than that, so the algorithm divides all of the data into 512-bit chunks. If our data is 1024 bits, then the algorithm first runs the SHA-1 function for the first chunk of data, generates a hash value, and then combines this first hash value with the second half of the data to generate a final hash value.
For larger inputs, the process repeats until all the 512-bit chunks have been processed by the hashing algorithm. A hash function might process a large dataset or file thousands or even hundreds of thousands of times before it generates the final hash value. This is why hashing algorithms need to be efficient in order to be effective.
The hash value
The final output of the hash function is the hash value, which ideally should be unique to each input. Hash values may only need to be used once for data authentication or digital signatures, or they may be stored for easy lookup in a hash table.
Properties of hashing algorithms
There are several different approaches hashing algorithms and functions use to convert data into hash values, but they all share a few common characteristics..
If you use the same data input and the same hashing algorithm, then the final hash value should be the same. This is a key aspect of using hash functions to authenticate data.
Ideally, no two inputs in a hashing algorithm should yield the same output hash value. This is known as a collision, and the best hashing algorithms have the fewest instances of collisions.
Remember that hashing algorithms process data in small chunks to generate a final hash value. An effective hashing algorithm quickly processes any data type into a unique hash value.
Hash functions are designed to make it impossible to regenerate a message or file from the hash value. In fact, if the hashing function can be reversed to recreate the original input, it’s considered to be compromised. This is one thing that distinguishes hashing from encryption, which is designed to be reversible.
Why is hashing used?
Hashing is used in a variety of applications, from cybersecurity to blockchain to data privacy. Here are a few ways that people use hashing every day.
Message and data authentication
Hashing helps Cybersecurity Specialists ensure that data isn’t intercepted between the sender and the recipient.
Let’s pretend that we’re both spies. We’re working in different cities, doing spy stuff, and since we don’t have a secure internet connection, the only trusted way for us to communicate with each other is by sending hand-delivered notes.
But if I send a note to you, how can we make sure that no one’s tampered with my message? I can encrypt the message, but if someone else has figured out the encryption, how do we know they haven’t intercepted my message and replaced it with another one?
One solution is to use a hashing algorithm to turn the contents of my message into a series of characters. If we can both turn my message into the same string of characters with the hashing algorithm, we’ll know no one tampered with my message while on its way to you.
In other words, hashing is a way to authenticate data or show that the data received wasn’t somehow changed along the way. When you’re downloading software, hashing helps ensure that what you’re installing is the original code and not compromised by a virus, trojan, or other malware.
Detecting changes in data
Even when you’re working with non-sensitive data, hashing is an effective way to compare two sets of data and see if they’re different.
Let’s look at two messages and their hash values using a popular hash function called MD5:
“Meet me at Codecademy HQ on Broadway”
“Meet me at Codecadamy HQ on Broadway”
What did you notice first: the different hash values or the differences between the messages?
If you read through each message, you’ll notice that the word “Codecademy” is spelled differently in the second message. Now, imagine that you want to find out whether two versions of a novel are different. Even a computer would need some time to go through thousands of words, but with hashing, all it takes is comparing a few characters to determine whether two sets of data are different.
Sometimes, you want to be able to store and retrieve sensitive information. For example, many websites don’t store your actual password in a database but rather your password’s hash value instead. That way, if someone does hack into the database, all they’ll find are hash values that can’t be directly used by themselves.
This is especially useful for the health and financial industries. Instead of storing directly identifiable information such as name or social security number, a health or bank database can store the hash value of this information instead.
One of the most publicized applications of hashing is blockchain, specifically with cryptocurrencies like Bitcoin. We go more into blockchain here, but as a quick review, it’s a digital collection of transactions or records about something. Since everyone has access to the same data in a blockchain, how can everyone be sure that no one has altered any of the past transactions? That’s where hashing comes in.
In a blockchain, every new record or transaction is known as a block. In the case of Bitcoin, a block includes the transfer amount, the timestamp, and the hash value for the data in the previous block. So If someone tried to alter the transaction history for a unit of Bitcoin, the hash values would change, and the transaction would be invalid. Hashing is how blockchains can ensure data authenticity and integrity when everyone has access to the same data.
When you’re working with large databases, combing through all the different entries to find the data you need can be exhausting — but hashing can make it easier. Instead of relying on index structure, hashing allows you to search for a data record using a search key and hash function.
Hash files store data in buckets, and each bucket can hold multiple records. Hash functions are used to map search keys to the location of a record within a bucket.
There are two hashing methods you can use in a database management system (DBMS): Static hashing and dynamic hashing.
- With static hashing, a search key and hash function always lead to the same address, and the number of buckets remains fixed. When you need to add a new record to the database, you can use a hash key to automatically generate an address for the record and store it. Then, you can use the same key to locate the record when you need to access, update, or delete it.
- Unlike static hashing, dynamic hashing (AKA extendible hashing) allows for data buckets to be created and removed as needed. This helps prevent bucket overflow, which occurs when a bucket doesn’t have enough space for a new record.
Popular hashing algorithms
Since the early days of digital computing, various hashing algorithms have been developed, each with its own methods, advantages, and disadvantages. Here are three popular hashing algorithms used today.
Message Digest 5 (MD5)
MD5 was once the most common hashing algorithm in the early days of computer cryptography. But, the algorithm suffered from frequent collisions, and while it’s still widely used around the world, it’s no longer used for sensitive or confidential data.
MD5 is also significantly slower than the algorithms listed below, and while using it, there’s a greater chance of ending up with the same hash value for two different inputs.
Secure Hash Algorithm (SHA)
SHA actually refers to a group of hashing algorithms. The SHA-2 family of hash functions was originally developed by the US National Security Agency (NSA) and is the standard for hashing private data. It’s currently used by cryptocurrencies and the CIA.
When the primary purpose of hashing is simply to detect errors and changes in data, then most people work with a cyclic redundancy check (CRC) code. Hashing with CRC32 is also a quick and easy way to check file integrity, particularly with ZIP files and files downloaded from FTP servers.
Hashing is a valuable tool in cybersecurity and database management
Still, you’ll need more than hashing to excel in any of the careers listed above. Try taking online programming courses to keep building your skills — for example, learning how to implement User Authentication and Authorization in Express.