KMP Algorithm: Pattern Searching in Text
Agenda:
- Quick Intro
- Input
- Output
- Complexity
- Algo Logic
- Code
- Real-Life Use Cases
- Conclusion
- Sources for further reading
1. Quick Intro
- The Knuth-Morris-Pratt (KMP) algorithm is a classic string-searching algorithm used to find occurrences of a "pattern" string within a "text" string.
- It was devised by Donald Knuth, James H. Morris, and Vaughan Pratt, hence the name KMP.
- The key feature of the KMP algorithm is its ability to perform the search efficiently by avoiding unnecessary comparisons.
2. Input
The KMP algorithm takes two inputs:
- Text (T): The string in which we are searching for a pattern.
- Pattern (P): The string we are searching for within the text.
3. Output
- Indices (I): Starting indices of all occurrences of the pattern in the text. If the pattern is not found, it returns an empty list.
4. Complexity
- Time Complexity: O(n + m)
- n ---> Length of the text.
- m ---> Length of the pattern.
- Space Complexity: O(m)
- m ---> Length of the pattern.
- primarily due to the preprocessing of the pattern to create the LPS array.
5. Algo Logic
The KMP algorithm consists of two phases:
- Pre-processing
- Searching
Here’s how it works:
A) Pre-processing the Pattern
- Construct the LPS (Longest Prefix Suffix) array.
- This array is used to skip unnecessary comparisons in the main search algorithm.
- Analogy: Imagine you're trying to find a specific tune in a song. As you listen to the song, you can skip certain parts that you've already heard before, because you know they don't match the tune. The LPS table is like a shortcut list that tells you how much of the song you can skip.
- Logic:
- Match: When a character matches, it's like finding a note that fits the tune. You move forward in the song.
- Mismatch: When a character doesn't match, it's like hitting a wrong note. You backtrack to the last known good note (stored in the LPS table) and try again.
B) Searching
- Use the LPS array to skip characters while matching the pattern with the text, thereby reducing the overall number of comparisons.
Steps to Construct the LPS Array
- Initialize the LPS array with zeros.
- Use two pointers, len (length of the previous longest prefix suffix), and i (current position in the pattern).
- Compare the characters of the pattern at these pointers and update the LPS array accordingly.
Searching Process
- Use two pointers, i (for text), and j (for pattern).
- If characters match, increment both pointers.
- If characters do not match and j is not at the start, update j using the LPS array.
- If characters do not match and j is at the start, increment i.
6. Code
Here’s a Python implementation of the KMP algorithm:
def compute_lps_array(pattern):
m = len(pattern)
lps = [0] * m
length = 0
i = 1
while i < m:
if pattern[i] == pattern[length]:
length += 1
lps[i] = length
i += 1
else:
if length != 0:
length = lps[length - 1]
else:
lps[i] = 0
i += 1
return lps
def kmp_search(text, pattern):
n = len(text)
m = len(pattern)
lps = compute_lps_array(pattern)
i = 0
j = 0
result = []
while i < n:
if pattern[j] == text[i]:
i += 1
j += 1
if j == m:
result.append(i - j)
j = lps[j - 1]
elif i < n and pattern[j] != text[i]:
if j != 0:
j = lps[j - 1]
else:
i += 1
return result
# Example usage:
text = "ABABDABACDABABCABAB"
pattern = "ABABCABAB"
print(kmp_search(text, pattern)) # Output: [10]
7. Real-Life Use Cases
The KMP algorithm is widely used in various applications, such as:
- Text Editors: Implementing the "find" feature to locate substrings within a document.
- Biological Sequence Analysis: Searching for specific DNA or protein sequences within larger genomic datasets.
- Plagiarism Detection: Identifying copied content by finding matching substrings in large text databases.
- Information Retrieval: Enhancing search engines to efficiently locate keywords within documents.
8. Conclusion
- A powerful tool for string matching.
- Efficient performance.
- Its applications span across various fields.
- Especially helpful in competitive programming.
9. Sources for further reading
- PDF: KMP_PDF
- Coursera: Knuth–Morris–Pratt | Coursera -- Explained effectively with Finite Automata Diagrams (DFA & NFA).
Comments
Post a Comment