Introduction to Tries
In this article, we will discuss the data structure used to store strings for computer science applications known as a prefix tree or trie for short.
This article is structured as follows:
- Introduction to tries
- The trie node
- Inserting a string
- Searching in a trie
- Deleting a string
- Applications of a trie
Introduction to Tries
The English alphabet contains 26 letters which compose every English word possible (this idea, of course, extends to other languages as well). In computer science, we exploit this fact to create a data structure that allows us to store strings in a memory-efficient fashion.
This data structure is called a prefix tree or trie for short. It is a tree-like data structure in which each node contains links to nodes representing characters. The number of nodes depends of which language, character set, or symbols we are dealing with.
The efficiency in this data structure results from the fact that two or more strings can share a prefix, or one string can be a prefix of another string.
Consider the following set of strings:
We can store these in a trie like so:
We mark the end of a word with a red circle. Notice the word PASS is a prefix of the word PASSIVE.
The Trie Node
A node in the trie is a data structure that represents a character. It is similar to a linked list in that it contains relevant data and pointers to other trie nodes. In Python, we represent a node by defining a class that contains an array (or dictionary) that maps a character to another trie node. The dictionary in a trie node represents the set of next characters that follow the current character. For example, from the picture, the trie node for the character G will contain a dictionary that contains a mapping for the characters A and L to their respective trie nodes. We will learn how to implement in the lesson accompanying this article.
The trie node stores the relevant information regarding a word in a trie. If you are representing a dictionary as a trie, it suffices to know that a string of characters exists in the dictionary; therefore, we include an end-of-key marker in the class (the end-of-key in the diagram is denoted by the red circles).
Inserting a String
To insert a string into the trie, we have to traverse the trie starting from the root and proceeding according to the characters in our string. Remember that a trie is a tree-like data structure. Trees are typically traversed using a depth-first search (DFS); however, since the string defines the path you need to take, you can simply traverse the path iteratively much like you would a linked list.
The algorithm to insert a string into a trie runs in O(L) time, where L represents the length of the string that is being inserted. This is because when we insert a string we either follow the path of an existing prefix, or we create new nodes. For example, if we were to insert the string GLOBAL into the above trie, we would traverse through GLOB, which already exists in the trie. Then, we would add new nodes A and L linking off of GLOB, adding an end-of-key marker at the L.
Searching for a String
Searching for a string in a trie is done in a similar way to inserting a string with some slight differences. To search for a string you start with a pointer to the root and a pointer to the first character in the string. You traverse the tree along the path defined by the string you are searching for. At every node, you inspect the dictionary to see if a mapping exists between the current character and another trie node. If one exists, you proceed; if a mapping does not exist, that means the string does not exist in the trie. If you manage to reach the end of your string, you must inspect the end-of-key marker to see if this string has been entered in the trie as a full word. It could be the case that the string you are searching for is a prefix to a longer string. In this case, your search function should indicate that the string does not exist in the tree as a full entry.
Consider GLOBES. If you enter the word GLOBES into a trie and search for the word GLOBE, your function should return false as GLOBE was not entered as a stand-alone word.
Deleting a Word
Deleting a word from a trie can be a complicated process as it depends on the use case of the trie. Since a word in a trie can be the prefix to another word, care has to be taken such that we only delete the intended word without accidentally deleting any word that depends on it.
If the trie does not consume a lot of memory, a delete operation can be implemented by searching for the word and simply setting the end-of-key marker back to false. In this scenario, the word is deleted from the trie as the search operation will now return false. This does not release the memory of the nodes occupied by the word. The tradeoff for this operation is memory for simplicity. If you want to remove the nodes associated with a word, you must traverse the trie and remove every node from the word with a prefix count of one. This however requires that your trie nodes maintain a prefix count. If prefix count is not relevant in your trie application, you will not be able to delete a word this way as you would not be able to tell if this word shares a prefix with another word in the trie.
For example, let’s say we want to delete the word JOKE from our original trie:
We can keep the word JOKE in the trie and set the end-of-key character to false, but this would not reduce the memory footprint of the trie. If we were to delete the nodes from JOKE, we can only delete nodes OKE. The node J must be retained as it is a prefix of the words JANE and JAZZ.
If the trie contains other relevant data, care has to be taken to see where the new data should be placed if it is to be retained after a delete operation.
As we said, the delete operation is application-specific!
The trie has many applications in software engineering for text storage and retrieval. When you use an application with a text box, you may notice that words are suggested to you as you type (such as during a text message on a phone or a query in a search engine). The base data structure used to implement this functionality is the trie! In a real-world application, the trie that is used is more complicated than our implementation in this article as it is required to store more data than a simple prefix count and word ending, but it is the same general idea.
Another application is spell-checking! When using a word processor, all the words in a specific language are stored in a trie. When a word is typed that can’t be found in the trie, that means the word is a misspelling. An algorithm is implemented to search the trie for the most likely intended word and it is suggested to the user.
There are, of course, many other applications.
The trie is a very powerful and efficient tree-like data structure used to store strings and their relevant data. The efficiency arises due to many strings sharing prefixes. In this article, we have seen how a trie is implemented as well as how we can modify its contents.