Let’s bring everything together to create the Knuth-Morris-Pratt algorithm!
The pseudocode for the KMP algorithm is extremely similar to that of the computation of the prefix function. In both cases, we are matching strings against each other using the same idea — the prefix function is a result of matching the pattern against itself, whereas the KMP algorithm uses the values of the prefix function to match the pattern against the text.
During the computation of the prefix function, we were interested in finding out whether the next character forms a new valid suffix.
In the KMP algorithm, we are now interested in finding out the same for each character in the text. Whenever possible, we align a valid prefix of the pattern with its corresponding valid suffix in the text. This allows us to skip over characters we know cannot produce a match.
The pseudocode breakdown looks something like this:
- We start the algorithm by matching the pattern against the text character-by-character.
- If there is a mismatch, we look for a shorter valid prefix that can potentially align with an existing valid suffix among the characters that have already matched (similar to how we computed the prefix function).
- However, unlike last time, we don’t have to look for the length of a valid prefix all over again — we have already computed it and stored the corresponding length in the prefix function!
- Finally, if at any point the number of matched characters equals the length of the pattern, we will have found an occurrence of the pattern in the text.
- We can repeat the algorithm to find the next occurrence by looking for a shorter valid prefix length.
Instructions
Create a function called kmp_algorithm()
that takes in two parameters pattern
and text
. Inside the function, create the following variables:
m
to store the length of the pattern,n
to store the length of text,- a list
pi
to hold the prefix function computed onpattern
, - a variable
j
to store the number of characters matched so far, - and a variable
count
to store the number of occurrences ofpattern
intext
.
Create a for
loop to iterate through each index i
of text
. Inside this loop, create another while
loop that runs as long as the following two conditions are True
:
j > 0
text[i] != pattern[j]
Inside this while
loop, make sure to reset the number of matched characters to the length of a shorter valid prefix by retrieving this value from pi
i.e. j = pi[j - 1]
.
Upon exiting the while
loop, create an if
condition to check whether text[i]
equals pattern[j]
. If it does, increase the number of matched characters by 1
i.e. j += 1
.
If the number of matched characters equals the length of the pattern:
- Increase
count
by 1. - Reset the number of matched characters to the length of a shorter valid suffix.
Return the number of occurrences of pattern
in text
.
Great job! You’ve successfully implemented one of the most difficult algorithms in Python! In the next exercise, we will revisit our old friend Rabin-Karp to see how the two algorithms perform against each other.