Problem Definition

LeetCode Link: 28. Find the Index of the First Occurrence in a String

Given two strings needle and haystack, return the index of the first occurrence of needle in haystack, or -1 if needle is not part of haystack.

Example 1:

Input: haystack = "sadbutsad", needle = "sad"
Output: 0
Explanation: "sad" occurs at index 0 and 6.
The first occurrence is at index 0, so we return 0.

Example 2:

Input: haystack = "leetcode", needle = "leeto"
Output: -1
Explanation: "leeto" did not occur in "leetcode", so we return -1.

Constraints:

  • 1 <= haystack.length, needle.length <= 104
  • haystack and needle consist of only lowercase English characters.

Thought Process

The key idea behind KMP (Knuth-Morris-Pratt) algorithm is: When a character mismatch occurs during string matching, we can utilize information about the previously matched text to avoid starting the matching process from the beginning.

What is KMP?

First, let's understand where the name "KMP" comes from. It's named after its three inventors: Knuth, Morris, and Pratt, taking the first letter of each surname.

Applications of KMP

KMP is primarily used in string pattern matching. Its main innovation is that when a character mismatch occurs, it utilizes information about previously matched text to avoid restarting the matching process from scratch.

The key to KMP's efficiency lies in how it records previously matched content, which is handled by the next array (also known as the failure function).

Many programmers memorize the KMP implementation without truly understanding it. This approach is problematic because:

  1. You might struggle to write it in interviews
  2. You might not be able to explain why the next array contains specific values

Let's dive deep into understanding the essence of KMP and demystify the next array.

Prefix Table

When studying the KMP algorithm, you'll often encounter something called the "next array." This next array is actually what we call a prefix table, and understanding it is key to mastering KMP.

The Purpose of the Prefix Table

Think of the prefix table as a clever shortcut finder. Its main job is to help us backtrack efficiently when characters don't match. Instead of starting over from the beginning, it tells us exactly where in the pattern we should continue our search. This is one of the key innovations that makes KMP so efficient.

Let's understand this with a practical example that we'll reference throughout our discussion:

text = "aabaabaafa"    # The text we're searching in
pattern = "aabaaf"     # The pattern we're looking for

I want you to really internalize this example, as it will help us understand the power of the prefix table. Let's visualize what happens during matching:

  1. First, characters start matching: aabaa
  2. Then we hit a problem: text has b but pattern has f
  3. Without KMP, we'd start all over
  4. But with our prefix table, something magical happens...

The Magic of the Prefix Table

When we reach that mismatch at position 6 (where we see b instead of f), the prefix table tells us something very interesting. Instead of going back to the very beginning, it lets us jump to the third position in our pattern and continue matching from there. But how does it know to do this?

The answer lies in how the prefix table is structured: For any position i in our pattern, the prefix table records the length of the longest substring that appears both at the start and end of our pattern up to position i.

To understand the magic that happens with our prefix table, let's examine our pattern aabaaf more closely. At the point where we found our mismatch, we had already matched aabaa in both the text and pattern. This is a crucial moment because it reveals why the prefix table is so powerful.

Consider what we know at this point:

Text:    a a b a a b a a f a
         | | | | | ×
Pattern: a a b a a f
         | | | | | ×
         
Matching position by position:
1. First 'a' matches
2. Second 'a' matches
3. 'b' matches
4. Third 'a' matches
5. Fourth 'a' matches
6. Mismatch! ('b' ≠ 'f')

At this point, we've successfully matched five characters (aabaa), but we've hit a snag. This is where the cleverness of KMP comes into play.

The prefix table tells us something remarkable about the substring we've matched so far (aabaa). Within this substring, we have a pattern that repeats at both its beginning and end - specifically, the sequence aa. This is like finding a hidden shortcut in our search process.

Here's how we use this information to make an efficient move:

Step 1: At the mismatch
Text:    a a b a a b a a f a
         | | | | | ×
Pattern: a a b a a f
         | | | | | ×

Step 2: Identify the overlap ("aa")
Text:    a a b a a b a a f a
         a a                 ← Beginning pattern
               a a           ← Same pattern at end

Step 3: Make the intelligent jump
Text:    a a b a a b a a f a
                   |
Pattern:       a a b a a f
                   |

Let me explain how we calculate the prefix table, which is a crucial part of the KMP algorithm. I'll break this down step by step to build a deep understanding.

How to Calculate the Prefix Table

We calculate the prefix table (next array) for the pattern aabaaf. The prefix table records the length of the longest prefix which is also a suffix for each position.

Prefix Table Construction:

PositionSubstring (up to i)PrefixesSuffixesLongest Prefix-Suffix MatchPrefix Table Value
0a--00
1aaaa11
2aaba, aab, ab00
3aabaa, aa, aaba, ba, aba11
4aabaaa, aa, aab, aabaa, aa, baa, abaa22
5aabaafa, aa, aab, aaba, aabaaf, af, aaf, baaf, abaaf00

The Complete Prefix Table

Pattern:    a  a  b  a  a  f
Position:   0  1  2  3  4  5
Values:     0  1  0  1  2  0

Match the Pattern in the Text

We now use the prefix table to efficiently search for the pattern aabaaf in the text aabaabaafa.

Matching Process

Step Text Index (i) Pattern Index (j) Current Character Comparison Action
1 0 0 T[0] == P[0] (a == a) Match → Move i=1, j=1
2 1 1 T[1] == P[1] (a == a) Match → Move i=2, j=2
3 2 2 T[2] == P[2] (b == b) Match → Move i=3, j=3
4 3 3 T[3] == P[3] (a == a) Match → Move i=4, j=4
5 4 4 T[4] == P[4] (a == a) Match → Move i=5, j=5
6 5 5 T[5] != P[5] (b ≠ f) Mismatch → Use LPS[4]=2 → Set j=2, i remains 5
7 5 2 T[5] == P[2] (b == b) Match → Move i=6, j=3
8 6 3 T[6] == P[3] (a == a) Match → Move i=7, j=4
9 7 4 T[7] == P[4] (a == a) Match → Move i=8, j=5
10 8 5 T[8] == P[5] (f == f) Match → Move i=9, j=6. Pattern found!

Solution - Prefix Table

class Solution:
    def strStr(self, haystack: str, needle: str) -> int:
        """
        Finds the first occurrence of needle in haystack using KMP algorithm.
        
        Args:
            haystack: The string to search in
            needle: The string to search for
            
        Returns:
            The starting index of the first match, or -1 if not found
        """
        def build_pattern_table(pattern: str) -> list[int]:
            """
            Builds the pattern matching table (also known as the prefix function or failure function)
            for KMP algorithm.
            
            Args:
                pattern: The string pattern to search for
                
            Returns:
                A list of integers representing the longest proper prefix that is also a suffix
            """
            pattern_length = len(pattern)
            pattern_table = [0] * pattern_length
            prev_prefix_length = 0
            current_pos = 1
            
            while current_pos < pattern_length:
                if pattern[current_pos] == pattern[prev_prefix_length]:
                    prev_prefix_length += 1
                    pattern_table[current_pos] = prev_prefix_length
                    current_pos += 1
                else:
                    if prev_prefix_length != 0:
                        prev_prefix_length = pattern_table[prev_prefix_length - 1]
                    else:
                        pattern_table[current_pos] = 0
                        current_pos += 1
                        
            return pattern_table

        # Handle edge cases
        if not needle:
            return 0
            
        if not haystack:
            return -1
            
        pattern_table = build_pattern_table(needle)
        
        haystack_pos = 0  # Current position in haystack
        needle_pos = 0    # Current position in needle
        
        while haystack_pos < len(haystack):
            if haystack[haystack_pos] == needle[needle_pos]:
                haystack_pos += 1
                needle_pos += 1
                
                if needle_pos == len(needle):
                    # Found complete match
                    return haystack_pos - needle_pos
            else:
                if needle_pos != 0:
                    # Partial match failed, try shorter prefix
                    needle_pos = pattern_table[needle_pos - 1]
                else:
                    # No match at current position
                    haystack_pos += 1
                    
        return -1

Key Takeaways and Conclusion

The KMP (Knuth-Morris-Pratt) algorithm represents a significant advancement in string pattern matching, offering several crucial insights:

The heart of KMP's efficiency lies in its intelligent use of the prefix table, which eliminates unnecessary character comparisons by leveraging previously matched information. Rather than starting from scratch when a mismatch occurs, the algorithm uses the prefix table to jump to the most promising next position.

Key points to remember:

  1. The prefix table stores information about the longest proper prefix that is also a suffix, enabling smart jumps during pattern matching
  2. KMP achieves O(n + m) time complexity, where n is the text length and m is the pattern length
  3. Understanding the prefix table construction is crucial for mastering the algorithm
  4. The algorithm is particularly effective for patterns with repeating substrings

The true power of KMP isn't just in its implementation, but in its fundamental insight: previous comparisons contain valuable information that can be used to optimize future matching attempts. This principle extends beyond string matching and can be applied to other algorithmic problems where pattern recognition is important.

For developers and computer science students, mastering KMP provides not just a solution to string matching problems, but a valuable lesson in algorithmic thinking and optimization techniques. It demonstrates how clever preprocessing can lead to significant performance improvements in seemingly simple tasks.