KMP Algorithm: Pattern Searching in Text

Agenda:

  1. Quick Intro
  2. Input
  3. Output
  4. Complexity
  5. Algo Logic
  6. Code
  7. Real-Life Use Cases
  8. Conclusion
  9. Sources for further reading

1. Quick Intro

  • The Knuth-Morris-Pratt (KMP) algorithm is a classic string-searching algorithm used to find occurrences of a "pattern" string within a "text" string. 
  • It was devised by Donald Knuth, James H. Morris, and Vaughan Pratt, hence the name KMP
  • The key feature of the KMP algorithm is its ability to perform the search efficiently by avoiding unnecessary comparisons.

2. Input

The KMP algorithm takes two inputs:

  • Text (T): The string in which we are searching for a pattern.
  • Pattern (P): The string we are searching for within the text.

3. Output

  • Indices (I): Starting indices of all occurrences of the pattern in the text. If the pattern is not found, it returns an empty list.

4. Complexity

  • Time ComplexityO(n + m) 
    • n ---> Length of the text.
    • m ---> Length of the pattern.
  • Space Complexity: O(m)
    • m ---> Length of the pattern.
    • primarily due to the preprocessing of the pattern to create the LPS array.

5. Algo Logic

The KMP algorithm consists of two phases:

  1. Pre-processing
  2. Searching

Here’s how it works:

A) Pre-processing the Pattern

  • Construct the LPS (Longest Prefix Suffix) array.
  • This array is used to skip unnecessary comparisons in the main search algorithm.
  • Analogy: Imagine you're trying to find a specific tune in a song. As you listen to the song, you can skip certain parts that you've already heard before, because you know they don't match the tune. The LPS table is like a shortcut list that tells you how much of the song you can skip.
  • Logic:
    • Match: When a character matches, it's like finding a note that fits the tune. You move forward in the song.
    • Mismatch: When a character doesn't match, it's like hitting a wrong note. You backtrack to the last known good note (stored in the LPS table) and try again.

B) Searching

  • Use the LPS array to skip characters while matching the pattern with the text, thereby reducing the overall number of comparisons.

Steps to Construct the LPS Array

  • Initialize the LPS array with zeros.
  • Use two pointers, len (length of the previous longest prefix suffix), and i (current position in the pattern).
  • Compare the characters of the pattern at these pointers and update the LPS array accordingly.

Searching Process

  • Use two pointers, i (for text), and j (for pattern).
  • If characters match, increment both pointers.
  • If characters do not match and is not at the start, update using the LPS array.
  • If characters do not match and is at the start, increment i.

6. Code

Here’s a Python implementation of the KMP algorithm:

def compute_lps_array(pattern):
    m = len(pattern)
    lps = [0] * m
    length = 0
    i = 1

    while i < m:
        if pattern[i] == pattern[length]:
            length += 1
            lps[i] = length
            i += 1
        else:
            if length != 0:
                length = lps[length - 1]
            else:
                lps[i] = 0
                i += 1
    return lps

def kmp_search(text, pattern):
    n = len(text)
    m = len(pattern)
    lps = compute_lps_array(pattern)
    
    i = 0
    j = 0
    result = []
    
    while i < n:
        if pattern[j] == text[i]:
            i += 1
            j += 1
        
        if j == m:
            result.append(i - j)
            j = lps[j - 1]
        elif i < n and pattern[j] != text[i]:
            if j != 0:
                j = lps[j - 1]
            else:
                i += 1
    return result

# Example usage:
text = "ABABDABACDABABCABAB"
pattern = "ABABCABAB"
print(kmp_search(text, pattern))  # Output: [10]

7. Real-Life Use Cases

The KMP algorithm is widely used in various applications, such as:

  • Text Editors: Implementing the "find" feature to locate substrings within a document.
  • Biological Sequence Analysis: Searching for specific DNA or protein sequences within larger genomic datasets.
  • Plagiarism Detection: Identifying copied content by finding matching substrings in large text databases.
  • Information Retrieval: Enhancing search engines to efficiently locate keywords within documents.

8. Conclusion

  1. A powerful tool for string matching. 
  2. Efficient performance.
  3. Its applications span across various fields.
  4. Especially helpful in competitive programming.

9. Sources for further reading

  1. PDF: KMP_PDF
  2. Coursera: Knuth–Morris–Pratt | Coursera -- Explained effectively with Finite Automata Diagrams (DFA & NFA).

Comments

Popular posts from this blog

Z-Function Algorithm: Substring Search

Back of the envelope estimations