Learn how to implement the Rabin-Karp Algorithm in Python!

In this lesson, we will implement the Rabin-Karp algorithm discussed in the previous article using Python. We assume that you are familiar with most of the common data structures and algorithms by now, so this lesson will be a step towards delving into somewhat advanced algorithmic territory!

We can use the Rabin-Karp algorithm to find all occurrences of a pattern in a given text. We can implement this in several ways but the core idea of the algorithm remains the same &mdash; instead of matching the pattern against a substring in the text character-by-character, we will use string hashing to compare them almost instantly. To bring the algorithm to life, here is what we will do in the exercises that follow:

* Quickly recap naive pattern-matching by implementing a brute-force algorithm to locate a pattern in a given string.
* Refresh our understanding of hashing by performing some hashing operations on strings.
* Explore different hash functions to increase efficiency in pattern-matching.
* Optimize the Rabin-Karp algorithm for peak performance.

Let's get started with tackling each of these steps.

Intro

To get a handle on a more advanced string-search algorithm like Rabin-Karp, it is a good idea to be comfortable with naive pattern-matching on a given string. In this exercise, we will refresh our understanding of the straightforward brute-force approach so that we can really appreciate what motivates the improvements in speed-up of the Rabin-Karp algorithm!
 
To quickly recap, this is how naive pattern-matching on a given string works:
* Iterate over the entire string to find all substrings equal in length to the pattern that we are trying to match.
* Iterate over each substring, and check to see if this matches the pattern.

Revisiting Naive Pattern-Matching

Great job so far! You are certainly an expert when it comes to searching for patterns using brute force. But judging from how long it took the last program to run, naive pattern-matching seems to be approaching the limits of its abilities. Fortunately, we can do better with a bit of help from an old friend of ours &mdash; hashing.

Let’s create a simple hash function to calculate the hash values of strings involving only the uppercase alphabet. Of course, not all strings where we want to find patterns will consist of well-arranged characters from `A-Z`, but this is a good way to get some practice before dealing with all kinds of strings.

To calculate unique hash values for a given string of uppercase letters, we must first calculate a unique value for each character somehow. How can we do this?

One idea is to map each character in the string to its corresponding ASCII value and take the product of the ASCII values. We will call this the ASCII hash. Let’s calculate the ASCII hash of some strings using this approach.



Hashing Review

It seemed like we were making decent progress, but our hash function collided to the same value! Indeed, there is only a limited set of values that we can generate using ASCII values. This is a problem because our pattern-matching algorithm can erroneously produce a match between a pattern and a substring even when they are different from one another! In other words, our hash function is not that great at producing unique values for unique strings. Can we do better?

The collision occurred because even though the product of ASCII values seemed unique on the surface, it turned out to have the same prime factors, which resulted in the same value. What if we use prime numbers instead of ASCII values? This would guarantee that every pattern has its own unique prime factorization.

In other words, instead of assigning a unique ASCII value to each character, we assign a unique prime number in increasing order as follows:

```python
{'A': 2, 'B': 3, 'C': 5, ..., 'X': 89, 'Y': 97, 'Z': 101}
```
 Since `'A'` is the first character, we assign it the first prime number: `2`. Similarly, `'B'` takes on the next prime number, `3`, and so on. The complete list of the first twenty-six prime numbers is as follows:

```python
[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101]
```

Just like how we multiplied ASCII values for each character in a substring to get its ASCII Hash, we will multiply the prime-number values (or just prime values) for each character in a substring to get its prime hash. We've already defined the first twenty-six prime numbers in the code, so let's get started!

Hashing Collision

In trying to compute the hash values of all substrings, we can still bypass some of the calculations. That's right! We don't have to perform multiplication after multiplication to compute the hash of every substring.

You might have noticed that after calculating the hash of "ABCD" by multiplying the prime numbers, we don't necessarily have to calculate the hash of "BCDE" by yet again multiplying individual prime numbers. Instead, to recover the hash of "BCDE" from that of "ABCD", we can do something as follows:

* remove the hash of "A" from the hash of "ABCD" by dividing it out, which gives us the hash of "BCD"
* multiply the hash of "E" into the hash of "BCD" which gives us the hash of "BCDE"!

```python
prime_hash_ABCD = prime_hash('ABCD') # prime hash of 'ABCD'
rolling_hash_BCD = prime_hash_ABCD // prime_hash('A') #divide out the prime hash of 'A'
prime_hash_BCDE = prime_hash('E') * prime_hash_BCD #multiply in the prime hash of 'E'
```
Notice how we performed only a single division operation and a single multiplication operation to calculate the hash of `'BCDE'` instead of performing three multiplication operations as follows:

```python
prime_hash_BCDE = prime_hash('B') * prime_hash('C') * prime_hash('D') * prime_hash('E') #three multiplication operations
```
 We saved some computation time in the process! It might not seem like much, but imagine searching for a much larger substring, and we would be saving computation time in orders of magnitude.
 
This is the idea of rolling hashing.

Let's calculate the hash values of all substrings of length 4 in `uppercase` but this time using the idea of rolling hashing. We will now recover the hash value of each subsequent substring of length `4` from the previous one using the divide + multiply trick we showed above. 

But to begin this domino effect, we still have to start with the hash value of the first substring of length `4` that was calculated using multiplication via `prime_hash()`. We already have access to this value via `prime_hash('ABCD')`.

Rolling Hash

We saw a neat way to recover hashes using rolling hash, but unfortunately, we still ended up with a hashing collision in the case of palindromic strings. Simply multiplying primes will not make a good hash function. For this reason, the Rabin-Karp algorithm uses a Polynomial Hash Function. This is a combination of sums and products (as opposed to only products) to make the final output more unique and therefore almost entirely immune to hash values colliding.

In Python, a polynomial hash of a string `'ABCD'` can be calculated as follows:

```python
ord('A') * 26**3 + ord('B') * 26**2 + ord('C') * 26**1 + ord('D') * 26**0
```
In this exercise, we will implement the same idea but automate the process using `for` loops.


Polynomial Hash Function

You are probably wondering why we didn't make use of any "rolling" property of the polynomial hash function and just calculated the hash values for all substrings by adding up the individual contributions of all the characters each time. And you are correct! As we saw in an earlier exercise, we can exploit this "rolling" property here as well to reduce the number of operations and speed up our computation time!

Instead of calculating the hash values of all substrings using the polynomial hash function definition every time, let's try to recover the hash values of subsequent substrings just like we did with prime rolling hashing.

For example, we could recover the same polynomial hash of `'BCDE'` using fewer computations from the polynomial hash of `'ABCD'` as follows:

```python
polynomial_hash_ABCD = polynomial_hash('ABCD')
rolling_hash_BCDE = hash_ABCD - polynomial_hash('A') + polynomial_hash('E')
```
This seems plausible. After all, we used a similar divide + multiply trick for prime rolling hashing and are now using a subtract + add trick for polynomial rolling hashing. What could go wrong? Let's find out!


Exploiting the “Rolling” Property Again!

We now have all the tools in place to complete the full implementation of the Rabin-Karp algorithm! All that is left is to compute the hash of the pattern and compare it with substrings of the same length in the text. If they hash to the same value, it is safe to assume a match has been found. Here is what the pseudocode might look like:

* Find the first substring in `text` equal in length to `pattern`.
* Compare its polynomial hash with that of `pattern`.
* If they are equal, increase `occurrences` by `1`.
* Iterate through remaining substrings in order
* If the polynomial hash of any substring matches that of `pattern`, increase `occurrences` by `1`.
* Return `occurrences`.

Rabin-Karp Algorithm

Congratulations! You have reached a significant milestone in your programming career &mdash; implementing an advanced string-matching algorithm from scratch! Here is a quick recap of what we accomplished together throughout this lesson:
 
* Quick recap of naive pattern-matching and its limits - using a brute-force sequential approach to check for a pattern match against all substrings that performed reasonably well until we ran into a rather large input.
* Looked for ways to speed up pattern-matching by revisiting hashing &mdash; assigning a somewhat random and unique value to a pattern to find it in a given text quickly.
* Explored various ideas involving hashing, starting off with assigning ASCII values to characters, which quickly led to problems when different strings collided to the same hash value.
* Improved on the previous idea so that characters are assigned prime numbers instead of ASCII values which gave a unique prime factorization that led to a unique hash value for most strings (except palindromes).
* Implemented the key idea of rolling hashing where we recovered the hash value of the next substring from that of the previous substring by dividing out the hash of the old character and multiplying in the hash of the new character, thereby performing way fewer calculations in the process!
* Implemented the idea of a Polynomial Hash which takes the best of both addition and multiplication (including higher powers of a number) to create a much better hash function for pattern-matching.
* Combined the ideas of rolling hashing and polynomial hashing to implement the Polynomial Rolling Hash, which allowed us to quickly calculate the hash values of subsequent substrings from that of previous substrings, thereby allowing faster pattern-matching.
* Brought all these ideas together to implement the Rabin-Karp Algorithm and tested it on large hidden inputs where naive pattern-matching would have simply timed out.
* Counted occurrences of a large pattern in an even larger text significantly faster using the Rabin-Karp algorithm on the same input that Naive Pattern-Matching wasn't able to handle so well, and also explored the limits of Rabin-Karp.
 
We hope you learned a lot from this lesson and have deepened your understanding of an advanced string algorithm while also improving your ability to think algorithmically. This isn't the end of the road though! Next up is content the Knuth-Morris Pratt algorithm, an even more advanced pattern-matching algorithm that can find a pattern even faster!


Summary

Implementing the Rabin-Karp Algorithm in Python

Learn about two powerful string searching methodologies: the Rabin-Karp algorithm and the Knuth-Morris-Pratt algorithm!

String Searching Algorithms

## Introduction
In this article, we will cover the **Knuth-Morris-Pratt algorithm** (KMP algorithm). Like the Rabin-Karp algorithm, the Knuth-Morris-Pratt algorithm is an efficient string searching algorithm. It has a worst-case runtime of `O(n)`, making it a significantly faster algorithm than brute force. To understand how the algorithm works, we will walk through a few examples.

## Speeding Up Brute Force Approach

When solving a pattern-matching problem using brute force, we perform many checks that do not have to be there. In other words, we are not solving the problem efficiently.

For example, consider the pattern `'ABCD'` that we are trying to find in the text `'ABABCABCD'`. Both strings are lined up in parallel from index `0`, and the search proceeds as follows:

```tex
\begin{array}{c}
0&1&2&3&4&5&6&7&8 \\
\color{green} \text{A} & \color{green} \text{B} & \color{red} \text{A} & \text{B} &\text{C} & \text{A} &\text{B} & \text{C} &\text{D} \\
\color{green} \text{A} & \color{green} \text{B} & \color{red} \text{C} & \text{D}
\end {array}
```

The first two characters matched correctly but there was a mismatch in the third character. The brute-force algorithm would now proceed as follows:

```tex
\begin{array}{c}
0&1&2&3&4&5&6&7&8 \\
 \text{A} & \color{red} \text{B} & \text{A} & \text{B} &\text{C} & \text{A} &\text{B} & \text{C} &\text{D} \\
\ & \color{red} \text{A} & \text{B} & \text{C} & \text{D}
\end {array}
```

However, having already matched `'AB'` in the previous iteration, we already knew that the second character was a `'B'`. So it wasn't efficient to shift by `1` since we already knew that would produce a mismatch. A more efficient shift would have been to directly jump to index `2` as follows:

```tex
\begin{array}{c}
0&1&2&3&4&5&6&7&8 \\
 \text{A} & \text{B} & \color{green} \text{A} & \color{green} \text{B} & \color{green} \text{C} & \color{red} \text{A} &\text{B} & \text{C} &\text{D} \\
\ & \ & \color{green} \text{A} & \color{green} \text{B} & \color{green} \text{C} & \color{red} \text{D}
\end {array}
```

The first three characters `'ABC'` matched correctly, but the fourth character `'D'`produced a mismatch. Now, instead of shifting by `1`, we can be more clever and use information from the three matched characters. Since we know for sure that we won't be finding an `'A'` before index `5`, a more efficient shift would be to shift by `3` and jump directly to index `5`:

```tex
\begin{array}{c}
0&1&2&3&4&5&6&7&8 \\
 \text{A} & \text{B} & \text{A} & \text{B} & \text{C} & \color{green} \text{A} & \color{green} \text{B} & \color{green} \text{C} & \color{green} \text{D} \\
\ & \ & \ & \ & \ & \color{green} \text{A} & \color{green} \text{B} & \color{green} \text{C} & \color{green} \text{D}
\end {array}
```

## Knuth-Morris-Pratt Algorithm

This is the key idea of the KMP algorithm. If we use information about the pattern to perform a more efficient shift, we can cut down on the running time of the algorithm.

So we need to use information about the pattern to calculate more optimal shifts by skipping characters that will guarantee a mismatch. But how do we go about finding the exact next index that we should jump to? 

In order to make the process easier to formalize, we will deliberately handpick a pattern-text pair with specific characteristics, which will also help us generalize towards the complete algorithm.

Consider the pattern `'ABABAC'` that we are looking to match in the text `'ABABABAC'`. As usual, the search proceeds as follows:

```tex
\begin{array}{c}
0&1&2&3&4&5&6&7\\
\color{green} \text{A} & \color{green} \text{B} & \color{green} \text{A} & \color{green} \text{B} & \color{green} \text{A} & \text{B} \color{green} &\text{A} & \text{C} \\
\color{green} \text{A} & \color{green} \text{B} & \color{green} \text{A} & \color{green} \text{B} & \color{green} \text{A} & \color{red} \text{C}
\end {array}
```

In addition to the first five characters `'ABABA'` matching correctly, notice that the segment `'ABA'` repeats once at the front and once at the back in `'ABABA'`as follows:

```tex
\begin{array}{c}
\color{green} \fbox{\color{green} \text{A} \color{green} \text{B} \color{green} \text{A}} \color{green} \text{B} \color{green} \text{A} \color{red} \text{C}\\
\color{green} \text{A} \color{green} \text{B} \fbox{\color{green} \text{A} \color{green} \text{B} \color{green} \text{A}} \color{red} \text{C}
\end {array}
```

We call the segment `'ABA'` that matches at the front a **prefix** of the substring `'ABABA'`. Similarly, the segment `'ABA'`that matches at the back is a **suffix** of the substring `'ABABA'`. Both segments are also **proper prefixes and suffixes**, respectively since they don't equal the substring `'ABABA'` itself.

Why is this useful? If we can find a **proper suffix** that is also a **proper prefix** (or vice-versa) in the pattern (up to the point where matches have occurred as above), we can now shift to the **location of the proper suffix in the text** directly. This is because all other characters leading up to the proper suffix are guaranteed to mismatch, so we can just skip over them.

```tex
\begin{array}{c}
0&1&2&3&4&5&6&7\\
 \text{A} & \text{B} & \color{green} \text{A} & \color{green} \text{B} & \color{green} \text{A} & \color{green} \text{B} \color{green} & \color{green} \text{A} & \color{green} \text{C} \\
& \ & \ \color{green} \text{A} & \color{green} \text{B} & \color{green} \text{A} & \color{green} \text{B} & \color{green} \text{A} & \color{green} \text{C}
\end {array}
```

So we need to find the longest proper suffix of a partially-matched pattern that is also a proper prefix in the partial-match. How can we go about finding each? It should be no secret by now that the prefix function is exactly the thing that will accomplish this task. Before we implement it in the next lesson, let's first see how to build one by hand.

A prefix function maps the pattern to an array of numbers &mdash; the lengths of the longest proper suffix ending at each index that is also a proper prefix in the pattern. Let’s unpack that with an example.

Consider the same pattern we used earlier &mdash; `'ABABAC'`. Here is how we can compute the values of the prefix function associated with this pattern:

* We start with the substring ending at index `0` &mdash; `'A'`. A proper suffix of a substring cannot equal the substring itself, so it must be shorter in length than `'A'`. For this reason, the length of the longest proper suffix ending at index `0` is `0`. We don't even have to check if there's an equal corresponding proper prefix. Thus, `pi[0] = 0`.


```tex
\begin{array}{c}
\text{i}&0&1&2&3&4&5\\
 \text{pattern} & \color{blue} \text{A} & \text{B} & \text{A} & \text{B} & \text{A} & \text{C} \\
\text{pi[i]} & 0
\end {array}
```

* The substring ending at index `1` is `'AB'`. The longest proper suffix ending at index `1` is `'B'`. However, there isn't a proper prefix that also equals `'B'`. Thus, the length of the longest proper suffix ending at index `0` that is also a proper prefix is `0`. Thus, `pi[0] = 0` again.

```tex
\begin{array}{c}
\text{i}&0&1&2&3&4&5\\
 \text{pattern} & \color{blue} \text{A} &  \color{blue} \text{B} & \text{A} & \text{B} & \text{A} & \text{C} \\
\text{pi[i]} & 0 & 0
\end {array}
```

* The substring ending at index `2` is `'ABA'`. This is a more interesting case. The longest proper suffix of this substring is `'BA'`. There isn't a proper prefix that also equals `'BA'`. However, notice that the next longest proper suffix ending at index `2` is `'A'`, which is also a proper prefix. Thus, `pi[2] = 1`.

```tex
\begin{array}{c}
\text{i}&0&1&2&3&4&5\\
 \text{pattern} & \color{blue} \text{A} &  \color{blue} \text{B} & \color{blue} \text{A} & \text{B} & \text{A} & \text{C} \\
\text{pi[i]} & 0 & 0 & 1
\end {array}
```

* The substring ending at index `3` is `'ABAB'`. The longest proper suffix that is also a proper prefix in this substring is `'AB'`, whose length is `2`. Thus, `pi[3] = 2`.

```tex
\begin{array}{c}
\text{i}&0&1&2&3&4&5\\
 \text{pattern} & \color{blue} \text{A} & \color{blue} \text{B} & \color{blue} \text{A} & \color{blue} \text{B} & \text{A} & \text{C} \\
\text{pi[i]} & 0 & 0 & 1 & 2
\end {array}
```

* The substring ending at index `4` is `'ABABA'`. The longest proper suffix that is also a proper prefix in this substring is `'ABA'`, whose length is `3`. Thus, `pi[4] = 3`.

```tex
\begin{array}{c}
\text{i}&0&1&2&3&4&5\\
 \text{pattern} & \color{blue} \text{A} & \color{blue} \text{B} & \color{blue} \text{A} & \color{blue} \text{B} & \color{blue} \text{A} & \text{C} \\
\text{pi[i]} & 0 & 0 & 1 & 2 & 3
\end {array}
```

* The substring ending at index `5` is `'ABABAC'`. There is no proper suffix that is also a proper prefix in this substring. Thus, `pi[5] = 0`.

```tex
\begin{array}{c}
\text{i}&0&1&2&3&4&5\\
 \text{pattern} & \color{blue} \text{A} & \color{blue} \text{B} & \color{blue} \text{A} & \color{blue} \text{B} & \color{blue} \text{A} & \color{blue} \text{C} \\
\text{pi[i]} & 0 & 0 & 1 & 2 & 3 & 0
\end {array}
```


## Conclusion
We have now seen how the KMP algorithm works through the use of a prefix function. In the next lesson, we are going to use write our own powerful algorithm from scratch and see how it compares to the Rabin-Karp and brute force algorithms. Happy Coding!

Learn about the Knuth-Morris-Pratt algorithm, a powerful string-searching approach!

Introduction to the Knuth-Morris-Pratt Algorithm

Learn how to implement the Knuth-Morris-Pratt algorithm in Python!

In this lesson, we will implement the Knuth-Morris-Pratt (KMP) algorithm discussed in the previous article.

To recap, we use the KMP algorithm to find patterns in a given string, much faster than a typical brute-force approach. You might also remember that the prefix function lies at the heart of the KMP algorithm. In fact, the ideas involved in computing the prefix function are so central to the KMP algorithm that their corresponding implementations are strikingly similar!

If we showed you the elegant implementations of the KMP algorithm and the prefix function right now, the whole thing would almost seem to work like magic! However, all magic in programming arises from bits and pieces of code working together in a logical and inter-related fashion. Together, we will demystify them both by uncovering the inner workings, but it will also require grappling with some non-trivial ideas.

We hope you will have an intuitive grasp of what is going on behind the scenes by the end of this lesson.

On that note, here is an overview of what we will be covering in the following exercises to implement the Knuth-Morris-Pratt Algorithm:

* Revisiting the simplest way to find a pattern in a text using a brute-force algorithm.
* Improving over the brute-force algorithm by noticing that we can use information about the pattern.
* Accomplishing this goal by computing the prefix function for the pattern.
* Identifying some clever manipulations to reduce the running time of computing the prefix function.
* Using the pre-computed values in the prefix function to perform faster pattern-matching - the Knuth-Morris-Pratt Algorithm!

Let’s get started!

In the article we saw how to use the prefix function by hand; let’s implement it in Python!

The prefix function is typically implemented as a Python list. It is best understood (and appreciated) when implemented in increasing levels of efficiency. We will start off with the simplest way to do this, which would be to translate the following information from the previous exercise into code:

* A prefix function maps a pattern to an array of numbers.
* At each index, the array contains the length of the longest proper suffix up to that index in the pattern.
* The additional requirement is that the proper suffix must also be a proper prefix in the pattern.

Here is what the pseudocode would look like:

* Iterate through each index of the entire pattern whose prefix function we would like to build.
* For each index, iterate through all possible lengths of proper prefixes up to that index.
* For each proper prefix, check if there is a proper suffix that is equal to it.
* If there is one, set the value of the prefix function at the current index equal to the length of the proper prefix/suffix.



Implementing the Prefix Function

As we saw in the previous exercise, the brute-force algorithm to compute the prefix function requires double `for` loops resulting in a rather poor running time. However, it turns out that we can exploit the structure of a prefix function by employing some nifty tricks in order to speed up the computation of the same values!
 
An important step in optimizing algorithms is to make clever observations about the most crucial elements of the problem. And indeed, we start by asking - what can we say about the sequence of consecutive values of the prefix function?

```tex
\begin{array}{c}
\text{i}&0&1&2&3&4&5\\
 \text{pattern} & \color{blue} \text{A} & \color{blue} \text{B} & \color{blue} \text{A} & \color{blue} \text{B} & \color{blue} \text{A} & \color{blue} \text{C} \\
\text{pi[i]} & 0 & 0 & 1 & 2 & 3 & 0
\end {array}
```

You might have already noticed that each consecutive value can only increase at most by `1`. This is because each index is increasing by `1`, and so each corresponding length of a valid suffix can increase by no more than `1`.

What are the implications of this subtle observation? If the next value cannot increase by more than `1`, it means we are only left with three cases:

* The value can increase by `1`.
* The value can NOT increase at all i.e., stay the same.
* The value can actually decrease by a certain amount.

Let's handle these three cases separately and see how we can use this logic to speed up our code!


Computing the Prefix Function Faster - Part 1

In the last exercise, we brought down the running time of computing the prefix function to O(n<sup>2</sup>). But the question remains - can we do better? 

Believe it or not, the prefix function can be computed even faster - in just linear time! It does require grappling with some non-obvious ideas that can be a bit tricky to wrap your head around. But since we’ve already laid out important pieces of the puzzle and gradually progressed this far, it should only be a matter of time before we get it running.

Given that we know the length of each longest proper suffix up to an index that is also a proper prefix, the key question that we need to ask is &mdash; 
how can we use this information to derive the length of the longest proper suffix ending at the next index and is also a proper prefix?

```tex
\begin{array}{c}
\text{i}&0&1&2&3&4&5\\
 \text{pattern} & \text{A} & \text{B} &  \text{A} & \text{B} & \text{A} & \text{C} \\
\text{pi[i]} & 0 & 0 & 1 & 2 & ? & ?
\end {array}
```

Or, to put it in programming terms, the question we are asking is the following:

_Is `pattern[i]` equal to `pattern[pi[i - 1]]`?_

Why is it important for these two to be equal? Think about it &mdash; `pi[i - 1]` can also be interpreted as the length of the longest prefix that is also a proper suffix ending at `i - 1`. This means, if we index `pi[i - 1]` into the pattern as `pattern[pi[i - 1]]`, we are actually accessing the character next to that longest proper prefix as follows:

```tex
\begin{array}{c}
\text{i}&0&1&2&3&4&5\\
 \text{pattern} & \color{red} \fbox{\text{A}} & \color{green} \fbox{\text{B}} & \color{blue} \fbox{\text{A}} & \text{B} & \text{A} & \text{C} \\
\text{pi[i]} & 0 & 0 & 1 & 2 & ? & ?\\
\text{pi[i-1]} & 0 & 0 & 0 & 1 & 2\\
\text{pattern[pi[i-1]]} & \text{A} & \text {A} & \color{red} \fbox{\text{A}} & \color{green} \fbox{\text{B}} & \color{blue} \fbox{\text{A}} 
\end {array}
```

If the next character in pattern `pattern[i]` matches `pattern[pi[i-1]]`, this means `pi[i]` is longer than `pi[i-1]` by `1`. This is because we will be adding the same character `pattern[i]` to extend the valid suffix ending at `i-1` (which has already been discovered) to form the next valid suffix ending at `i` as follows:

```tex
\begin{array}{c}
\text{i}&0&1&2&3&4&5\\
 \text{pattern} & \text{A} & \text{B} & \text{A} & \text{B} & \fbox{\text{A}} & \text{C} \\
\text{pi[i]} & 0 & 0 & 1 & 2 & \color{green} 3 & ?\\
\text{pi[i-1]} & 0 & 0 & 0 & 1 & 2 & 3\\
\text{pattern[pi[i-1]]} & \text{A} & \text {A} & \text{A} & \text{B} & \fbox{\text{A}} & \text{B} 
\end {array}
```

In the opposite case that `pattern[i] != pattern[pi[i - 1]]`:

* Look for a valid suffix of a shorter length `j = pi[i - 1] - 1`.
* Index into the prefix function with this length as `pi[j]`. This gives the length of a valid suffix ending at index `j`.
* Index into `pattern` with this length as `pattern[pi[j]]`. This gives the character next to a valid prefix.
* Check to see if `pattern[i] == pattern[pi[j]]`. We can loop the entire process above from this point onwards.

Computing the Prefix Function Faster - Part 2

Let's bring everything together to create the Knuth-Morris-Pratt algorithm!

The pseudocode for the KMP algorithm is extremely similar to that of the computation of the prefix function. In both cases, we are matching strings against each other using the same idea &mdash; the prefix function is a result of matching the pattern against itself, whereas the KMP algorithm uses the values of the prefix function to match the pattern against the text.

During the computation of the prefix function, we were interested in finding out whether the next character forms a new valid suffix. 

In the KMP algorithm, we are now interested in finding out the same for each character in the text. Whenever possible, we align a valid prefix of the pattern with its corresponding valid suffix in the text. This allows us to skip over characters we know cannot produce a match.

The pseudocode breakdown looks something like this:

* We start the algorithm by matching the pattern against the text character-by-character.
* If there is a mismatch, we look for a shorter valid prefix that can potentially align with an existing valid suffix among the characters that have already matched (similar to how we computed the prefix function).
* However, unlike last time, we don’t have to look for the length of a valid prefix all over again &mdash; we have already computed it and stored the corresponding length in the prefix function!
* Finally, if at any point the number of matched characters equals the length of the pattern, we will have found an occurrence of the pattern in the text.
* We can repeat the algorithm to find the next occurrence by looking for a shorter valid prefix length.


The Knuth-Morris-Pratt (KMP) Algorithm

Great job with getting `kmp_algorithm()` to work! Now we can come full circle and put it to the test against the same pattern and text that took `naive_pattern_matching()` a really long time to figure out back in Exercise two. 

Then, we will put it to the test against an even bigger input that took `rabin_karp_algorithm()` a really long time to run.




Rabin-Karp vs KMP

In this lesson, we implemented the Knuth-Morris-Pratt Algorithm in Python to count the number of occurrences of a pattern in a given text. Even though we started out with a couple of brute-force approaches that resulted in large running times, we were able to employ some clever tricks and observations that allowed us to condense the underlying rudimentary details into something quite short and elegant! Here is a quick recap of what we did in this lesson:

* Reviewed the simplest way to find a pattern in a text using a brute-force algorithm.
* Explored ideas to improve over the brute-force algorithm by noticing that we can use information about the pattern &mdash; specifically, the prefixes and suffixes of the pattern &mdash; to skip characters in the text where the pattern cannot match.
* Accomplished this goal by computing the prefix function by finding up to each index the length of the longest proper suffix that is also a proper prefix in the pattern.
* Identified some clever manipulations to reduce the running time of computing the prefix function &mdash; all the way from O(n^3) down to O(n).
* Used the pre-computed values in the prefix function to perform faster pattern-matching via the Knuth-Morris-Pratt algorithm and observed its faster performance in comparison to Naive Pattern-Matching and the Rabin-Karp Algorithm.

We hope you learned a lot in this lesson and also enjoyed the process of algorithmic thinking! It is no small feat to be able to grasp the inner workings of complicated algorithms like Knuth-Morris-Pratt. Some of the greatest minds in history came up with these algorithms after decades of work! 

If something feels like it hasn't fully clicked yet, you are not alone! Feel free to come back to this lesson and go through it as many times as you need before things start making sense. In the meantime, you can head over to work on Projects to test and apply your understanding.

Review

Implementing the Knuth-Morris-Pratt Algorithm in Python

It is a string-search algorithm that searches for the occurrence of a specific string in a given text.

It is a tree-like structure that contains information on various strings.

Searching for a keyword across all websites on the internet

Searching for multiple keywords across all websites on the internet

Searching for a single word across all articles on Wikipedia

Searching for multiple words across all articles on Wikipedia

It would struggle to handle all of the options.

The hash values of &mdash; the current substring, its leading character, and each of the leading characters of the remaining substrings

The hash values of &mdash; the current substring and its leading character

The hash values of &mdash; the current substring, its leading character, and the leading character of the next substring

The hash values of &mdash; the current substring and the leading character of the next substring

It has a best-case running time of O(n + m).

It has an average-case running time of O(n + m).

It has a worst-case running time of O(nm).

Assess your Rabin-Karp algorithm knowledge.

It requires pre-computing the values of the prefix function which are looked up later to save computation time.

Pre-computing the prefix function in advance is equivalent to computing the same values of the prefix function as the algorithm is running.

It is an example of dynamic programming i.e. computing and storing important results in advance to avoid re-computing them later.

The algorithm itself is analogous to computing the prefix function.

It should always produce the correct result, so it is as optimal as the KMP Algorithm.

Even though it should always produce the correct result, it is often too slow to compute these results.

It is too simple to serve as a starting point for building toward the KMP algorithm.

It is much more efficient than the Knuth-Morris-Pratt algorithm in most cases.

A prefix is a substring that starts at the first character of the string.

A suffix is a substring that ends at the last character of the string.

A proper prefix is a prefix that is not equal to the string itself.


A proper suffix is a suffix that is not equal to the string itself.


A proper prefix can equal the length of a proper suffix.


A proper prefix can have the same range of indexes as a proper suffix.


For each index, it records the length of the longest proper prefix (up to that index) which is also a proper suffix of the string.


For each index, it records the length of the longest proper prefix (up to that index) which is also a suffix of the string.


For each index, it records the length of the longest prefix (up to that index) which is also a proper suffix of the string.


For each index, it records the length of the longest prefix (up to that index) which is also a suffix of the string.

It can be computed in O(n^2) running time.

It can be computed in O(n^3) running time.

It can be computed in all three of these levels of time complexity.

It has a worst-case running time of O(m + n).

It has a worst-case running time of O(n).

It has a worst-case running time of O(m).

It has a a worst-case running time of O(mn).

Assess your knowledge on the Knuth-Morris-Pratt algorithm.

Knuth-Morris-Pratt Algorithm

In this project, you will build on the Rabin-Karp Algorithm to handle multiple patterns of varying lengths. Once this has been done, you will extend the functionality of the algorithm to find two-dimensional patterns in a two-dimensional text!


To kick things off, you will continue right where you left off with the final code of the Rabin-Karp Algorithm from the implementation lesson. However, instead of taking in a single pattern as one of the inputs, modify the parameter to take in a list of multiple patterns.


In order to handle matching multiple patterns, you will have to extend the algorithm to compute the hash value of not just a single pattern but of all the patterns in the list. In the simplest implementation of the Rabin-Karp algorithm, you did the following:

```python
pattern_hash = polynomial_hash(pattern)
``` 
This was because there was only a single pattern. This time, however, you will have to compute the hash values of all the patterns and store them in an appropriate data structure.

For simplicity, you can assume that each pattern will be unique and is guaranteed to have a unique polynomial hash that results in no collisions.

Next, you will continue this idea of extending the code to store information about all the patterns. Similar to how you kept track of the total number of occurrences of a single pattern by initializing a variable `occurrences = 0`, this time you will create an appropriate data structure to keep track of the occurrences of each of the patterns.

Remember that the key idea of the Rabin-Karp algorithm remains the same. You want to compute the hash values of all the patterns and try to see if the same hash value is generated by some substring in the text, in which case the two can be compared to verify a match.

Previously, there was only a single pattern with a fixed length, and so you only had to compute the hash values of all substrings in the text of the same length. With multiple patterns, however, there can be multiple lengths, and these need to be handled separately.

This means you need to keep track of all possible pattern lengths in the input and compute the hash values of all substrings in the text with each of these lengths. Think about how you might be able to approach this problem before implementing it.

In the simplest implementation of the Rabin-Karp algorithm, you did the following:

```python
pattern_length = len(pattern)
substring_hash = polynomial_hash(text[:pattern_length])
if (substring_hash == pattern_hash):
	occurrences += 1
```

This was the polynomial hash of the initial substring which was later used to compute the remaining hash values using the idea of Rolling Hashing. However, you will now have to repeat this process for each pattern length (but not necessarily all patterns). 

In the simplest implementation of the Rabin-Karp algorithm, you did the following:

```python
for i in range(text_length - pattern_length):
	previous_hash = substring_hash
	substring_hash = polynomial_rolling_hash(previous_hash, text[i], text[i + pattern_length], pattern_length)
	if (substring_hash == pattern_hash):
		occurrences += 1
```

Continuing in the same iteration of the `for` loop from earlier, repeat this process for each pattern length. 

Well done! If you have gotten this far, you have successfully extended the basic form of the Rabin-Karp algorithm to find multiple patterns in a given string. 

Using these same ideas, can you extend the functionality of the algorithm to find multiple patterns in multiple pieces of text as well (as opposed to a single string of text)? Think about possible ways you can be efficient in your implementation instead of simply running Rabin-Karp over and over!

In the second half of this project, you will modify the Rabin-Karp Algorithm to find a two-dimensional pattern in a two-dimensional text. That's right! Instead of finding a pattern of `m` characters in a text of `n` characters, you will now have to redesign the Rabin-Karp algorithm to find a pattern of `m1` x `m2` characters in a text of `n1` x `n2` characters. 

In other words, the pattern will have `m1` rows of `m2` characters and `m2` columns of `m1` characters. Similarly, the text will have `n1` rows of `n2` characters and `n2` columns of `n1` characters.

Here is an example of a 2x3 pattern:

```tex
\begin{array}{c}
&&0&1&2\\
 &0&\text{A}& \text{B}& \text{C}&\\
 &1&\text{G}& \text{H}& \text{I}&\\
\end {array}
```

And here is an example of all occurrences of the 2x3 pattern in a 7x6 text:

```tex
\begin{array}{c}
&&0&1&2&3&4&5\\
 &0&\color{red}\text{A}& \color{red}\text{B}& \color{red}\text{C}&\text{D}& \text{E}& \text{F}\\
 &1&\color{red}\text{G}& \color{red}\text{H}& \color{red}\text{I}&\text{J}& \text{K}& \text{L}\\
 &2&\text{M}& \text{N}& \text{O}&\text{P}& \text{Q}& \text{R}\\
 &3&\text{S}& \text{T}& \text{U}&\text{V}& \text{W}& \text{X}\\
 &4&\text{Y}& \text{Z}& \color{red}\text{A}&\color{red}\text{B}& \color{red}\text{C}& \text{D}\\
 &5&\text{E}& \text{F}& \color{red}\text{G}&\color{red}\text{H}& \color{red}\text{I}& \text{J}\\
 &6&\text{K}& \text{L}& \text{M}&\text{N}& \text{O}& \text{P}\\
\end {array}
```

The text and pattern in the given example can be represented in Python as a list of strings as follows:

```python
pattern = ['ABC', 'GHI']
text = ['ABCDEF', 'GHIJKL', 'MNOPQR', 'STUVWX', 'YZABCD', 'EFGHIJ', 'KLMNOP']
```
While it is possible to solve this problem using brute force, that would be missing the point. Instead, we encourage you to think about this problem as an extension of the Rabin-Karp algorithm from 1-D to 2-D! 



In the 1-D case, all the characters in the pattern were along a single row, and its polynomial hash value was computed by adding up the individual hash values along this row. 

For the 2-D case, however, the pattern has characters in extra rows as well. This means that you will have to repeat the same 1-D process `m1` times since there are now `m1` rows.

Treat each of the `m1` rows as a separate 1-D case, and compute the polynomial hash of each row of the 2-D pattern separately. This will give you `m1` different hash values.

Next, you need to find a way to somehow combine these individual 1-D polynomial hash values to come up with a single 2-D polynomial hash value for the 2-D pattern. 

There are many ways to do this, but the main thing to keep in mind is that the resulting hash value should be as unique as possible to reduce the chances of hashing collision i.e. non-unique 2-D patterns hashing to the same value. 

For example, simply adding the two hashes won't work since the resulting sum of hash values would collide in the case of two patterns with flipped rows, such as `['ABC', 'DEF'], ['DEF', 'ABC']`. 

Similar to what you did in the lesson, is there instead a polynomial representation that you can come up with that allows for a more unique hash value?



Let's now extend the same idea to compute polynomial hash values for the `n1` x `n2` 2-D text of characters. 

In each of the `n1` rows of the text, there are `n2 - m2 + 1` substrings of length `m2`. The idea is to compute the polynomial hash of each of these substrings and store all the hash values in a `n1` x `(n2 - m2 + 1)` 2-D array.

In other words, you will have to repeat the same problem you have already solved in the 1-D case for each of the `n1` rows.

Treat each of the `n1` rows as a separate 1-D case, and compute the polynomial hash values (using rolling hashing) of all substrings of length `m2` in each row of the 2-D pattern separately.

Almost there! You might have realized that by this point you have essentially boiled down the problem into a 1-D pattern-matching problem along the columns of the array containing the polynomial hash values.

Previously, you performed pattern-matching on a 1-D string of characters. This time, you are trying to perform pattern-matching on each of the columns where the characters are now replaced by hash values.

In other words, you are now looking to match the list of hash values in each column to that of the pattern's hash you computed earlier.

Like you did previously, start by combining the independent 1-D hashes as you did for the pattern so that they represent the 2-D hash corresponding to a potential 2-D pattern match.

The idea is very similar to how you computed the polynomial hash using rolling hashing in the 1-D case. The only difference is the choice of the polynomial function to produce reasonably unique hashes.



If you have reached this far, you have most likely managed to modify the Rabin-Karp algorithm to solve the 2-D version of the pattern-matching problem! To make sure your code is indeed correct, try coming up with a couple more test cases by hand and see if it still produces the correct output.

Once you are confident about the correctness of your code, try generating some random large inputs to stress test your implementation for efficiency. Does it run relatively quickly or does it seem to take a rather long time?



Great job pushing the limits of the Rabin-Karp algorithm! Continue to think of new ways in which you can further extend its functionalities. For example, you might want to try extending the 1-D problem of matching multiple different patterns to the 2-D case. 

If you are still looking to go above and beyond to challenge yourself, think about how you might be able to extend the 2-D case even further to match multiple patterns of varied dimensions in 3-D texts!

It can turn out to be a really cool project and a challenging programming workout. Most importantly, make sure to enjoy the process of discovery, and it will only be a matter of time before you find a solution!

Rabin-Karp Algorithm Project

Congratulations, you’ve successfully completed the Learn Advanced Algorithms with Python: String Searching Algorithms course! You've learned about two powerful string searching methodologies: the Rabin-Karp algorithm and the Knuth-Morris-Pratt algorithm.

Your learning journey into Advanced Algorithms and Data Structures with Python isn't over yet! Here is our roadmap to mastering Advanced Algorithms and Data Structures with Python:

* [Learn Advanced Data Structures with Python: Deques](https://www.codecademy.com/learn/learn-advanced-data-structures-with-python-deques) <-- Completed!
* [Learn Advanced Algorithms with Python: String Searching Algorithms](https://www.codecademy.com/learn/learn-advanced-algorithms-with-python-string-searching-algorithms) <-- Completed!
* [Learn Advanced Data Structures with Python: Trees](https://www.codecademy.com/learn/learn-advanced-data-structures-with-python-trees) <-- Up next!
* [Learn Advanced Algorithms with Python: Hamiltonian Algorithms](https://www.codecademy.com/learn/learn-advanced-algorithms-with-python-hamiltonian-algorithms) <-- Up next!

Once again, congratulations on finishing the Learn Advanced Algorithms with Python: String Searching Algorithms course! We are excited to see what you accomplish next.

Next Steps

## Introduction

The world is full of data, and one of the most commonly used data types we encounter as programmers is strings. In fact, this very article you are reading right now is made up of hundreds of strings! Out of all those strings, you might be interested in finding a particular keyword. How would you go about finding it? 

Chances are, the browser in which you are reading this article is already equipped with a string-search algorithm. You can use the search function on the browser you are using, which would then run that algorithm in the background to locate the keyword and highlight it for you. What are the step-by-step procedures executed by the algorithm to correctly find the keyword?

You are likely familiar with brute-force algorithms, which consider all possible candidate solutions and go through each one to find the correct solution. While it is possible to write such a brute-force algorithm to find strings in an article, it is not considered good practice. The real world consists of large datasets that make brute-force algorithms extremely inefficient!

## A Faster Approach

Since brute-force algorithms can be ineffective, this is where optimal string-search algorithms come into play. The **Rabin-Karp algorithm** is the first one we will explore.

Before we discuss the Rabin-Karp algorithm, let’s lay out the groundwork to ensure we are on the same page. There are different terminologies used to talk about string algorithms, but for consistency, let’s use the terms `text` to refer to a very long string where we would like to find a shorter string and `pattern` to refer to this shorter string that we are looking for. Let’s also say that `pattern` has a length `m` and `text` has a length `n`. The `text` is always at least as long as the `pattern`. In other words, `n >= m`.

A brute-force algorithm first retrieves the length of the `pattern` we are looking for, `m`. Then, it finds all substrings in `text` with the same length `m` as the `pattern`. This already produces a running time of `O(n)`. The brute-force algorithm then has to go through each substring and match it against `pattern`, which takes up additional `O(m)` running time for each of the `O(n)` substrings. This produces a worst-case running time of `O(nm)`, which performs very badly when `n` and `m` are large.

This is where we find the need to look for more optimal algorithms like the Rabin-Karp algorithm, which has a linear running time on average. This means that, while the Rabin-Karp algorithm can still run slow on some worst-case inputs, it runs very fast for most.

### Rolling Hash

The Rabin-Karp algorithm can achieve this speedup due to a certain hash function technique called **rolling hashing**. Instead of matching the characters in the `pattern` one at a time, we can generate its unique hash value in `O(1)` (constant) time, which we can then compare to the hash of a substring of the same length in `text`. This still means we have to check for all substrings of length `m` in the `text`. However, this means we no longer need the extra `O(m)` time to compare each character, which brings down the running time by a factor of `m` to produce a linear running time of `O(n)`.

The way rolling hash works is similar to the idea of running averages. Let's say you want to find a running average of numbers as they arrive. Instead of calculating a new average of all the numbers seen so far from scratch, what if we could use the average we already have to compute the new average? 

The same idea is used in Rolling Hash, except we generate the hash of the next substring in line from the hash of the previous substring that we have already calculated. This saves us quite a bit of computation time. We will learn how to implement this in the upcoming lesson.

### Polynomial Hash

Another key idea used in the Rabin-Karp algorithm is the idea of a **polynomial hash function**. On top of "rolling" this hash function, the choice of the polynomial hash function, in particular, is to reduce the number of possible hashing collisions that can occur while comparing the `pattern` to the substrings in `text`. It can be constructed in many ways, but a common way to define it is as follows:

```
Polynomial Hash of pattern = (ASCII value of first character)*(Size of the alphabet)^(Length of pattern-1) + (ASCII value of second character)*(Size of alphabet)^(Length of pattern - 2) + … + (ASCII value of the last character)*(Size of the alphabet)^(Length of pattern - m)
```

Here, ASCII value is just the character encoding value of a given character. The size of the alphabet refers to the size of the total number of unique characters that can be encountered in a given `text`. As you can see, the use of both addition and multiplication is where this hash function gets its polynomial feature and naming. As we will find out, it will be essential to implementing the Rabin-Karp algorithm effectively.

## Conclusion

This concludes our introduction to the Rabin-Karp algorithm. We have learned:
* How to increase efficiency in string searching algorithms using Rabin-Karp
* How rolling hash functions and polynomial hash functions play a role in the Rabin-Karp algorithm

In the upcoming lesson, we will put everything together using Python. Happy coding!
 



Learn about the Rabin Karp Algorithm, a powerful tool when working with string data!

Introduction to the Rabin-Karp Algorithm

### About this course
Continue your Python 3 learning journey with Learn Advanced Algorithms with Python: String Searching Algorithms. Learn how to circumvent ineffective and inefficient brute-force algorithms by using optimal string-search algorithms like the Rabin-Karp algorithm and the Knuth-Morris-Pratt algorithm.

### Skills you'll gain
- Use rolling hashes
- Implement the prefix function
- Resolve hashing collisions

### Notes on Prerequisites
We recommend that you complete [Learn Advanced Data Structures with Python: Deques](https://www.codecademy.com/learn/learn-advanced-data-structures-with-python-deques) before completing this course.

Learn about two powerful string searching methodologies: the Rabin-Karp algorithm and the Knuth-Morris-Pratt algorithm.