Let’s bring everything together to create the Knuth-Morris-Pratt algorithm!
The pseudocode for the KMP algorithm is extremely similar to that of the computation of the prefix function. In both cases, we are matching strings against each other using the same idea — the prefix function is a result of matching the pattern against itself, whereas the KMP algorithm uses the values of the prefix function to match the pattern against the text.
During the computation of the prefix function, we were interested in finding out whether the next character forms a new valid suffix.
In the KMP algorithm, we are now interested in finding out the same for each character in the text. Whenever possible, we align a valid prefix of the pattern with its corresponding valid suffix in the text. This allows us to skip over characters we know cannot produce a match.
The pseudocode breakdown looks something like this:
- We start the algorithm by matching the pattern against the text character-by-character.
- If there is a mismatch, we look for a shorter valid prefix that can potentially align with an existing valid suffix among the characters that have already matched (similar to how we computed the prefix function).
- However, unlike last time, we don’t have to look for the length of a valid prefix all over again — we have already computed it and stored the corresponding length in the prefix function!
- Finally, if at any point the number of matched characters equals the length of the pattern, we will have found an occurrence of the pattern in the text.
- We can repeat the algorithm to find the next occurrence by looking for a shorter valid prefix length.
Create a function called
kmp_algorithm() that takes in two parameters
text. Inside the function, create the following variables:
mto store the length of the pattern,
nto store the length of text,
- a list
pito hold the prefix function computed on
- a variable
jto store the number of characters matched so far,
- and a variable
countto store the number of occurrences of
for loop to iterate through each index
text. Inside this loop, create another
while loop that runs as long as the following two conditions are
j > 0
text[i] != pattern[j]
while loop, make sure to reset the number of matched characters to the length of a shorter valid prefix by retrieving this value from
j = pi[j - 1].
Upon exiting the
while loop, create an
if condition to check whether
pattern[j]. If it does, increase the number of matched characters by
j += 1.
If the number of matched characters equals the length of the pattern:
- Reset the number of matched characters to the length of a shorter valid suffix.
Return the number of occurrences of
Great job! You’ve successfully implemented one of the most difficult algorithms in Python! In the next exercise, we will revisit our old friend Rabin-Karp to see how the two algorithms perform against each other.