KMP Algorithm: Pattern Searching in Text

Agenda:

Quick Intro
Input
Output
Complexity
Algo Logic
Code
Real-Life Use Cases
Conclusion
Sources for further reading

1. Quick Intro

The Knuth-Morris-Pratt (KMP) algorithm is a classic string-searching algorithm used to find occurrences of a "pattern" string within a "text" string.
It was devised by Donald Knuth, James H. Morris, and Vaughan Pratt, hence the name KMP.
The key feature of the KMP algorithm is its ability to perform the search efficiently by avoiding unnecessary comparisons.

2. Input

The KMP algorithm takes two inputs:

Text (T): The string in which we are searching for a pattern.
Pattern (P): The string we are searching for within the text.

3. Output

Indices (I): Starting indices of all occurrences of the pattern in the text. If the pattern is not found, it returns an empty list.

4. Complexity

Time Complexity: O(n + m)

n ---> Length of the text.
m ---> Length of the pattern.

Space Complexity: O(m)

m ---> Length of the pattern.
primarily due to the preprocessing of the pattern to create the LPS array.

5. Algo Logic

The KMP algorithm consists of two phases:

Pre-processing
Searching

Here’s how it works:

A) Pre-processing the Pattern

Construct the LPS (Longest Prefix Suffix) array.
This array is used to skip unnecessary comparisons in the main search algorithm.
Analogy: Imagine you're trying to find a specific tune in a song. As you listen to the song, you can skip certain parts that you've already heard before, because you know they don't match the tune. The LPS table is like a shortcut list that tells you how much of the song you can skip.
Logic:

Match: When a character matches, it's like finding a note that fits the tune. You move forward in the song.
Mismatch: When a character doesn't match, it's like hitting a wrong note. You backtrack to the last known good note (stored in the LPS table) and try again.

B) Searching

Use the LPS array to skip characters while matching the pattern with the text, thereby reducing the overall number of comparisons.

Steps to Construct the LPS Array

Initialize the LPS array with zeros.
Use two pointers, len (length of the previous longest prefix suffix), and i (current position in the pattern).
Compare the characters of the pattern at these pointers and update the LPS array accordingly.

Searching Process

Use two pointers, i (for text), and j (for pattern).
If characters match, increment both pointers.
If characters do not match and j is not at the start, update j using the LPS array.
If characters do not match and j is at the start, increment i.

6. Code

Here’s a Python implementation of the KMP algorithm:

def compute_lps_array(pattern):
    m = len(pattern)
    lps = [0] * m
    length = 0
    i = 1

    while i < m:
        if pattern[i] == pattern[length]:
            length += 1
            lps[i] = length
            i += 1
        else:
            if length != 0:
                length = lps[length - 1]
            else:
                lps[i] = 0
                i += 1
    return lps

def kmp_search(text, pattern):
    n = len(text)
    m = len(pattern)
    lps = compute_lps_array(pattern)
    
    i = 0
    j = 0
    result = []
    
    while i < n:
        if pattern[j] == text[i]:
            i += 1
            j += 1
        
        if j == m:
            result.append(i - j)
            j = lps[j - 1]
        elif i < n and pattern[j] != text[i]:
            if j != 0:
                j = lps[j - 1]
            else:
                i += 1
    return result

# Example usage:
text = "ABABDABACDABABCABAB"
pattern = "ABABCABAB"
print(kmp_search(text, pattern))  # Output: [10]

7. Real-Life Use Cases

The KMP algorithm is widely used in various applications, such as:

Text Editors: Implementing the "find" feature to locate substrings within a document.
Biological Sequence Analysis: Searching for specific DNA or protein sequences within larger genomic datasets.
Plagiarism Detection: Identifying copied content by finding matching substrings in large text databases.
Information Retrieval: Enhancing search engines to efficiently locate keywords within documents.

8. Conclusion

A powerful tool for string matching.
Efficient performance.
Its applications span across various fields.
Especially helpful in competitive programming.

9. Sources for further reading

PDF: KMP_PDF
Coursera: Knuth–Morris–Pratt | Coursera -- Explained effectively with Finite Automata Diagrams (DFA & NFA).

Search This Blog

SBS