28. Find the Index of the First Occurrence in a String
Master the KMP (Knuth-Morris-Pratt) string matching algorithm with this comprehensive guide. Learn how the prefix table optimizes pattern matching, understand the algorithm's core concepts, and implement the solution for LeetCode 28. Perfect for coding interviews and string manipulation tasks.
Problem Definition
LeetCode Link: 28. Find the Index of the First Occurrence in a String
Given two strings needle
and haystack
, return the index of the first occurrence of needle
in haystack
, or -1
if needle
is not part of haystack
.
Example 1:
Input: haystack = "sadbutsad", needle = "sad"
Output: 0
Explanation: "sad" occurs at index 0 and 6.
The first occurrence is at index 0, so we return 0.
Example 2:
Input: haystack = "leetcode", needle = "leeto"
Output: -1
Explanation: "leeto" did not occur in "leetcode", so we return -1.
Constraints:
1 <= haystack.length, needle.length <= 104
haystack
andneedle
consist of only lowercase English characters.
Thought Process
The key idea behind KMP (Knuth-Morris-Pratt) algorithm is: When a character mismatch occurs during string matching, we can utilize information about the previously matched text to avoid starting the matching process from the beginning.
What is KMP?
First, let's understand where the name "KMP" comes from. It's named after its three inventors: Knuth, Morris, and Pratt, taking the first letter of each surname.
Applications of KMP
KMP is primarily used in string pattern matching. Its main innovation is that when a character mismatch occurs, it utilizes information about previously matched text to avoid restarting the matching process from scratch.
The key to KMP's efficiency lies in how it records previously matched content, which is handled by the next array (also known as the failure function).
Many programmers memorize the KMP implementation without truly understanding it. This approach is problematic because:
- You might struggle to write it in interviews
- You might not be able to explain why the next array contains specific values
Let's dive deep into understanding the essence of KMP and demystify the next array.
Prefix Table
When studying the KMP algorithm, you'll often encounter something called the "next array." This next array is actually what we call a prefix table, and understanding it is key to mastering KMP.
The Purpose of the Prefix Table
Think of the prefix table as a clever shortcut finder. Its main job is to help us backtrack efficiently when characters don't match. Instead of starting over from the beginning, it tells us exactly where in the pattern we should continue our search. This is one of the key innovations that makes KMP so efficient.
Let's understand this with a practical example that we'll reference throughout our discussion:
text = "aabaabaafa" # The text we're searching in
pattern = "aabaaf" # The pattern we're looking for
I want you to really internalize this example, as it will help us understand the power of the prefix table. Let's visualize what happens during matching:
- First, characters start matching:
aabaa
- Then we hit a problem: text has
b
but pattern hasf
- Without KMP, we'd start all over
- But with our prefix table, something magical happens...
The Magic of the Prefix Table
When we reach that mismatch at position 6 (where we see b
instead of f
), the prefix table tells us something very interesting. Instead of going back to the very beginning, it lets us jump to the third position in our pattern and continue matching from there. But how does it know to do this?
The answer lies in how the prefix table is structured: For any position i
in our pattern, the prefix table records the length of the longest substring that appears both at the start and end of our pattern up to position i
.
To understand the magic that happens with our prefix table, let's examine our pattern aabaaf
more closely. At the point where we found our mismatch, we had already matched aabaa
in both the text and pattern. This is a crucial moment because it reveals why the prefix table is so powerful.
Consider what we know at this point:
Text: a a b a a b a a f a
| | | | | ×
Pattern: a a b a a f
| | | | | ×
Matching position by position:
1. First 'a' matches
2. Second 'a' matches
3. 'b' matches
4. Third 'a' matches
5. Fourth 'a' matches
6. Mismatch! ('b' ≠ 'f')
At this point, we've successfully matched five characters (aabaa
), but we've hit a snag. This is where the cleverness of KMP comes into play.
The prefix table tells us something remarkable about the substring we've matched so far (aabaa
). Within this substring, we have a pattern that repeats at both its beginning and end - specifically, the sequence aa
. This is like finding a hidden shortcut in our search process.
Here's how we use this information to make an efficient move:
Step 1: At the mismatch
Text: a a b a a b a a f a
| | | | | ×
Pattern: a a b a a f
| | | | | ×
Step 2: Identify the overlap ("aa")
Text: a a b a a b a a f a
a a ← Beginning pattern
a a ← Same pattern at end
Step 3: Make the intelligent jump
Text: a a b a a b a a f a
|
Pattern: a a b a a f
|
Let me explain how we calculate the prefix table, which is a crucial part of the KMP algorithm. I'll break this down step by step to build a deep understanding.
How to Calculate the Prefix Table
We calculate the prefix table (next array) for the pattern aabaaf
. The prefix table records the length of the longest prefix which is also a suffix for each position.
Prefix Table Construction:
Position | Substring (up to i) | Prefixes | Suffixes | Longest Prefix-Suffix Match | Prefix Table Value |
---|---|---|---|---|---|
0 | a | - | - | 0 | 0 |
1 | aa | a | a | 1 | 1 |
2 | aab | a , aa | b , ab | 0 | 0 |
3 | aaba | a , aa , aab | a , ba , aba | 1 | 1 |
4 | aabaa | a , aa , aab , aaba | a , aa , baa , abaa | 2 | 2 |
5 | aabaaf | a , aa , aab , aaba , aabaa | f , af , aaf , baaf , abaaf | 0 | 0 |
The Complete Prefix Table
Pattern: a a b a a f
Position: 0 1 2 3 4 5
Values: 0 1 0 1 2 0
Match the Pattern in the Text
We now use the prefix table to efficiently search for the pattern aabaaf
in the text aabaabaafa
.
Matching Process
Step | Text Index (i) | Pattern Index (j) | Current Character Comparison | Action |
---|---|---|---|---|
1 | 0 | 0 | T[0] == P[0] (a == a) | Match → Move i=1, j=1 |
2 | 1 | 1 | T[1] == P[1] (a == a) | Match → Move i=2, j=2 |
3 | 2 | 2 | T[2] == P[2] (b == b) | Match → Move i=3, j=3 |
4 | 3 | 3 | T[3] == P[3] (a == a) | Match → Move i=4, j=4 |
5 | 4 | 4 | T[4] == P[4] (a == a) | Match → Move i=5, j=5 |
6 | 5 | 5 | T[5] != P[5] (b ≠ f) | Mismatch → Use LPS[4]=2 → Set j=2, i remains 5 |
7 | 5 | 2 | T[5] == P[2] (b == b) | Match → Move i=6, j=3 |
8 | 6 | 3 | T[6] == P[3] (a == a) | Match → Move i=7, j=4 |
9 | 7 | 4 | T[7] == P[4] (a == a) | Match → Move i=8, j=5 |
10 | 8 | 5 | T[8] == P[5] (f == f) | Match → Move i=9, j=6. Pattern found! |
Solution - Prefix Table
class Solution:
def strStr(self, haystack: str, needle: str) -> int:
"""
Finds the first occurrence of needle in haystack using KMP algorithm.
Args:
haystack: The string to search in
needle: The string to search for
Returns:
The starting index of the first match, or -1 if not found
"""
def build_pattern_table(pattern: str) -> list[int]:
"""
Builds the pattern matching table (also known as the prefix function or failure function)
for KMP algorithm.
Args:
pattern: The string pattern to search for
Returns:
A list of integers representing the longest proper prefix that is also a suffix
"""
pattern_length = len(pattern)
pattern_table = [0] * pattern_length
prev_prefix_length = 0
current_pos = 1
while current_pos < pattern_length:
if pattern[current_pos] == pattern[prev_prefix_length]:
prev_prefix_length += 1
pattern_table[current_pos] = prev_prefix_length
current_pos += 1
else:
if prev_prefix_length != 0:
prev_prefix_length = pattern_table[prev_prefix_length - 1]
else:
pattern_table[current_pos] = 0
current_pos += 1
return pattern_table
# Handle edge cases
if not needle:
return 0
if not haystack:
return -1
pattern_table = build_pattern_table(needle)
haystack_pos = 0 # Current position in haystack
needle_pos = 0 # Current position in needle
while haystack_pos < len(haystack):
if haystack[haystack_pos] == needle[needle_pos]:
haystack_pos += 1
needle_pos += 1
if needle_pos == len(needle):
# Found complete match
return haystack_pos - needle_pos
else:
if needle_pos != 0:
# Partial match failed, try shorter prefix
needle_pos = pattern_table[needle_pos - 1]
else:
# No match at current position
haystack_pos += 1
return -1
Key Takeaways and Conclusion
The KMP (Knuth-Morris-Pratt) algorithm represents a significant advancement in string pattern matching, offering several crucial insights:
The heart of KMP's efficiency lies in its intelligent use of the prefix table, which eliminates unnecessary character comparisons by leveraging previously matched information. Rather than starting from scratch when a mismatch occurs, the algorithm uses the prefix table to jump to the most promising next position.
Key points to remember:
- The prefix table stores information about the longest proper prefix that is also a suffix, enabling smart jumps during pattern matching
- KMP achieves O(n + m) time complexity, where n is the text length and m is the pattern length
- Understanding the prefix table construction is crucial for mastering the algorithm
- The algorithm is particularly effective for patterns with repeating substrings
The true power of KMP isn't just in its implementation, but in its fundamental insight: previous comparisons contain valuable information that can be used to optimize future matching attempts. This principle extends beyond string matching and can be applied to other algorithmic problems where pattern recognition is important.
For developers and computer science students, mastering KMP provides not just a solution to string matching problems, but a valuable lesson in algorithmic thinking and optimization techniques. It demonstrates how clever preprocessing can lead to significant performance improvements in seemingly simple tasks.
Discussion