Knuth-Morris-Pratt(KMP) for Pattern Matching

#Day 8

KMP algorithm is used in finding a substring in a given string. Or we can say it searches for the occurrences of a “word” in a “text string”.

KMP based on dynamic programming(memoize the last part). In the normal scenario with brute forcing, we can get the searched string in O(n*m) where n is the text length and m is the searched word’s length. For example-

Text String -> ABAAABAA

Search String -> AABA

If we take indexes as i=0 and j=0  from both the strings then the comparison will go something like

text[i] compare string[j]=> A compare A       (i=0,j=0)

text[i] compare string[j]=> B compare A       (i=1,j=1)

text[i] compare string[j]=> B compare A       (i=1,j=0)

text[i] compare string[j]=> A compare A       (i=2,j=0)

text[i] compare string[j]=> A compare A       (i=3,j=1)

text[i] compare string[j]=> A compare B       (i=4,j=2)

text[i] compare string[j]=> A compare A       (i=3,j=0)

……. and so on untill we find the match . In this method we have to start from the start of searched string. Hence it is slower and O(n*m) . This same problem can be solved in O(n+m) using KMP algorithm.

In this algorithm, we need to pre-calculate for searched string. If there exist a suffix which is also a prefix at any given index.

For example

ABCAB at index 4(count from 0)  has AB as the suffix which is also a prefix of the string. So if we are matching

ABCABD against ABCABCABD when i =5 and j=5 we get C compare with D

kmp2
Image: 1

Then instead of going back to j=0 in the searched string we go back to j =2 as we have a suffix of length 2 (AB ) which is also a prefix that means AB is already matched in the text string. And we don’t change i.

kmp7
Image: 2

Then we move on with the same algorithm

kmp4
Image: 3
kmp5
Image: 4
kmp6
Image: 5

So we get that the searched string exists in the text string at index 3.

 

So to precalculate the prefix-suffix table for the searched string we use the following algorithm-

initialize an array of size n (length of searched string) with 0.

we take 2 counters i=0 and j=1

if searched[i]==searched[j]

array[j]=i+1

i ++

j++

else if i =0

j++

else

i = array[i-1]

array = [0]*length_of_string
first_index = 0
second_index = 1
while second_index<length_of_string:
    if s[first_index]==s[second_index]:
        array[second_index] = first_index+1
        first_index += 1
        second_index += 1
    else:
        if first_index==0:
            second_index += 1
        else:
            first_index = array[first_index-1]

Using above  code we get the following working on searched string

kmp8
Image: 6
kmp9
Image: 7
kmp11
Image: 8
kmp12
Image: 9
kmp13
Image: 10

From above table we get that at index 3 there is a suffix of length 1 which is the prefix as well. At index 4 we have that of length 2 and rest we have as 0.

Time Compexity

The time complexity of this algorithm is O(n+m) as we traverse the text string only once and the searched string is traversed in linear time.

Source Code

Source code for this algorithm can be found here.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s