Pattern MatchingCS240E: Data Structure and Data management (Enriched)David Duan, 2019 Winter, Prof. BiedlMotivationPattern MatchingDefinitionsBrute-Force AlgorithmImprovementKarp-Rabin AlgorithmKarp-Rabin Fingerprint Algorithm (Naive)Karp-Rabin Fingerprint Algorithm (Rolling Hash)Boyer-Moore AlgorithmBad Character HeuristicGood Suffix HeuristicConclusionKnuth-Morris-Pratt AlgorithmMotivationHow the Automaton WorksFailure Array KMP AlgorithmSuffix TreesMotivationSuffix TreePattern MatchingAnalysisSuffix ArrayMotivationPattern Matching in Suffix ArraysBuild Suffix ArrayAnalysis
Given a text of length and a pattern of length , we return the first such that for , which is the index of the first occurrence of in , or FAIL
, if does not occur in .
We check every possible guess.
The worst case performance is , as there are valid guesses and checking for each guess (string comparison) takes time.
The runtime for brute-force algorithm is , which is quadratic if is large!
To improve the performance, we can break the problem into two parts:
The Range Query problem in our previous module can be seen an example of prepocessing.
For Pattern Matching, we can either do preprocessing on the pattern , where we eliminate guesses based on completed matches and mistakes, or on text , where we create a data structure to make finding matches faster.
Key Idea We use hashing to eliminate guesses.
Strategy
The initial hashes are called fingerprints. We compute the fingerprint/hash value (using the ordinary flatten-then-modular approach) for each substring of length in . If this fingerprint equals the fingerprint for the pattern, we them to check if it really is a match; otherwise, we discard the current guess.
Pseudocode
Correctness
We will never miss a match, as .
Complexity
The overall runtime is still worst case as computation (line ) depends on characters.
We didn't really improve anything!
Strategy
Suppose in a previous iteration, we have computed , and now we want . Observe . Thus:
Given the hash value of the previous guess, we can compute the hash value of the next guess in constant time!
Pseudocode
Correctness
Correctness for rolling hash comes from modular arithmetic. The rest are identical to before.
Complexity
Worst case is still but this is very unlikely. The expected time is , where we spend doing the initial computations and and during each guess (roughly of them) we perform constant operations.
This is easy and already faster than brute-force, but we can do even better!
Key Idea We compare the characters of the pattern and the substring from text from right to left, skipping the unnecessary comparisons by using the good-suffix and bad-character heuristics.
We compare the end characters of the pattern with the text. If they do not match, then the pattern can be moved on further. If the characters do not match in the end, there is no need to do further comparisons. In addition, we can also see what portion of the pattern has matched (matched suffix), so we could utilize this information and align the text and pattern by skipping any unnecessary comparisons. At the time of a mismatch, each heuristic suggests possible shifts, and the BM algorithm shifts the pattern by considering the maximum shift possible due to two heuristics.
Notation In the following examples, let !
denote the mismatched character and #
denote the starting position of after our shift .
This heuristic is based on the mismatched character.
If there is a mismatch between the character of the pattern and the text, then we check if the mismatched character of the text occurs in the pattern or not. If this mismatched character does not appear in the pattern, then the pattern will be shifted next to this character; if the character appears somewhere in the pattern, we shift the pattern to align with the occurrence of that character with the bad character of the text string.
Example I
The mismatched character is at , where we see . Since , we can shift to , i.e., our next guess is , since substrings all contain but does not. This saves us many unnecessary comparisons.
#
i 0 1 2 3 4 5 6 7 8 9 0 1
=====================================================
T | a | c | a | c | a | b | a | c | a | b | a | b | a |
=====================================================
P | a | c | a | c | a | c |
=====================================================
| a | c | a | c | a | c |
=====================================================
✗
Example II
We have a matched suffix and mismatched character at . Since , we shift to .
xxxxxxxxxx
#
i 0 1 2 3 4 5 6 7 8 9 0 1
=====================================================
T | a | c | a | d | a | c | a | c | a | b | a | b | a |
=====================================================
P | a | c | a | c | a | c |
=====================================================
| a | c | a | c | a | c |
=====================================================
✗ ✓ ✓
Example III
We have a matched suffix and mismatched character at . This time and the only occurrence of is at . Thus, we shift to to align and .
xxxxxxxxxx
# !
i 0 1 2 3 4 5 6 7 8 9 0 1
=====================================================
T | a | c | a | d | a | c | a | c | a | b | a | b | a |
=====================================================
P | a | d | a | c | a | c |
=====================================================
| a | d | a | c | a | c |
=====================================================
✗ ✓ ✓
Example IV
We have a matched suffix and a mismatched character at . This time however, there are two occurrences of in , each at and . We always shift the minimal amount to avoid missing any possible match. In this case, we prefer , because we only shift character, where shifts (and missed the matched string at ! I came up with this example to show the importance of align the bad character to its last occurrence in the pattern).
xxxxxxxxxx
# !
i 0 1 2 3 4 5 6 7 8 9 0 1
=====================================================
T | a | a | c | a | c | c | c | c | a | b | a | b | a |
=====================================================
P | a | c | a | c | c | c |
=====================================================
P1 | a | c | a | c | c | c |
=====================================================
P2 | a | c | a | c | c | c |
=====================================================
✗ ✓ ✓
Last Occurrence Function
In Example IV, we saw the importance of aligning the bad character to its last occurrence in the pattern. Therefore, we need a last-occurrence function , where
For Example IV, , , and for all . When we see the mismatched at , check and shift character.
Boyer-Moore Pseudocode (Bad Character Heuristic)
Remark. Sometimes produces a negative value:
xxxxxxxxxx
i 0 1 2 3 4 5 6 7 8 9 0 1
=====================================================
T | a | b | a | a | b | a | c | c | a | b | a | b | a |
=====================================================
P | c | a | b | a | b
=====================================================
✗ ✓ ✓
If we were to align with the last occurrence of in , we would need to shift . To avoid negative shifting, we take , which guarantees that is incremented by at least . This also motivates the Good Suffix heuristic as we could do better than only shifting in the worst case.
This heuristic is based on the matched suffix.
We shift the pattern to the right in such a way that the matched suffix subpattern is aligned with another occurrence of the same suffix in the pattern.
For this heuristic, an array is used. Each entry contains the shift distance of the pattern if a mismatched at position occurs, i.e., if the suffix of the pattern starting at position has matched. In order to determine the shift distance, two cases have to be considered.
Example I: Align Suffix
xxxxxxxxxx
!,#
i 0 1 2 3 4 5 6 7 8 9 0 1
=====================================================
T | l | m | n | n | p | a | ...
=====================================================
P | s | p | a | a | p | a |
=====================================================
| s | p | a | a | p | a |
=====================================================
✗ ✓ ✓
We matched suffix and found a mismatch at , where but . Therefore, we want to shift forward. The intuition is that, after shifting, two characters aligned with and must be , otherwise it is guaranteed to fail. Since occurs in at (recall this is the index where the substring starts), we align with for our next guess.
Example II: Minimal Shifting
xxxxxxxxxx
# !
i 0 1 2 3 4 5 6 7 8 9 0 1
=====================================================
T | n | p | a | l | m | n | n | p | a | a | p | a |
=====================================================
P | n | p | a | n | p | a | a | p | a |
=====================================================
| n | p | a | n | p | a | a | p | a |
=====================================================
✗ ✓ ✓
In this example, there are two occurrences of in . We always prefer the minimal amount of shifting to avoid missing possible matches.
Example III: Character Before Suffix
xxxxxxxxxx
#,!
i 0 1 2 3 4 5 6 7 8 9 0 1
=====================================================
T | a | b | c | d | e | f | a | a | a |
=====================================================
P | n | a | a | p | a | a | p | a | a |
=====================================================
P1 | n | a | a | p | a | a | ...
=====================================================
P2 | n | a | a | p | a | a | p | a | a |
=====================================================
✗ ✓ ✓
In this example, we matched suffix and the mismatch occurs at , which tells us that in front of the matched suffix doesn't work. Therefore, shifting to make align with will make us compare with , which we already know from the first iteration that is bad. Thus, we shift to make align with to avoid unnecessary comparisons.
Example IV: Bad Character & Good Suffix
xxxxxxxxxx
#,!
i 0 1 2 3 4 5 6 7 8 9 0 1
=====================================================
T | a | b | c | d | e | f | x | a | a |
=====================================================
P | n | a | a | p | a | a | p | a | a |
=====================================================
| n | a | a | p | a | a | ...
=====================================================
✗ ✓ ✓
If we were only to use good-suffix heuristic, we would shift the same way as in Example III. However, the bad-character heuristic tells us does not occur in at all, so we shift to one character after . The previous three examples are designed in the way that we can ignore the bad-character heuristic.
Example V: Suffix of Suffix Matches Prefix of Pattern
xxxxxxxxxx
! #
i 0 1 2 3 4 5 6 7 8 9 0 1
=====================================================
T | a | b | c | a | a | b | b | a | b | b | a |
=====================================================
P | a | c | b | b | a | b | b | a |
=====================================================
P1 | a | c | b | b | a | b | b | a |
=====================================================
P2 | a | c | b | b | ...
=====================================================
✗
The matched suffix is , but is not occurred elsewhere in our pattern . We then search for partial suffix, or suffix of this suffix (which is still a suffix of ), see if any part of it is a prefix for . Note that we only accept when a partial suffix is a prefix of ; anywhere else the partial suffix occurs in are not accepted. Consider the above example. We know is not occurred elsewhere in , so we start searching if a partial suffix is a prefix of . Although does occur, we know the character preceding is a mismatch (otherwise, would occur), so we don't accept this occurrence of in the middle of . Thus, is wrong. The correct shift is , where we see is both a suffix and a prefix of .
Example VI: Together
xxxxxxxxxx
! #
i 0 1 2 3 4 5 6 7 8 9 0 1
=====================================================
T | c | h | u | a | c | h | u | p | i | k | a | p | ...
=====================================================
P | p | i | k | a | c | h | u |
=====================================================
P1 | p | i | k | a | c | ...
=====================================================
✗
By bad-character, we would shift forward (recall the argument "" to the function in the previous algorithm). By good-suffix, since there is no suffix of that is also a prefix of , we shift the entire thing past the current location, because we know non-of the substrings would work.
Suffix Skip Array
We can precompute the suffix skip array of amount to shift, where
Then, we can update guess by . This is computable in time (similar to KMP failure function, see below).
Recall from CS241 (yikes), we could match strings using DFAs. We could embrace the idea of using states to represent our matching progres. In addition, we have a new type of transition (failure):
With these rules, computations of the automaton are deterministic (but not formally a valid DFA).
Recall if we are at state , then we have the first characters matched at this moment.
The following automaton helps us match the pattern :
The normal transitions and states are intuitive (same as in CS241). To understand the failure transitions, consider what happens when we feed it the following inputs:
Following the failure arc is equivalent to shifting our pattern forward.
We can compute and store these failure arcs using an array , where the failure arc from state leads to . The indices differ by because we don't need to store the failure arc for state , as it has a loop if the character for valid transition doesn't match. Thus, stores the destination of failure arc originated from state 1 (which is always 0) and stores the destination of failure arc orginated from state .
Brute Force
Since we are basically trying to find the longest proper prefix of that is also a proper suffix of , we could do the following in time (suppose ):
We consider because we want proper suffixes of . is the length of the longest prefix of that is a suffix of .
Example: An Intuitive Way to Compute Failure Array
New Character | ||||||
Longest Valid Proper Prefix & Suffix of | ||||||
Current State and Valid Character Transition |
Let our starting state be and process the first character . There is no valid proper prefix of that is also a proper suffix, so the substring is . Intuitively, if we fail transition for next character, we would go back to state .
New Character | ||||||
Longest Valid Proper Prefix & Suffix of | ||||||
Current State and Valid Character Transition |
We are at state and the new character is . The valid transition character from current state (we look at the previous column!) does not match our new character, so we can't extend the substring. Thus, we stay at with the substring being .
New Character | ||||||
Longest Valid Proper Prefix & Suffix of | ||||||
Current State and Valid Character Transition |
We are still at state and the new character is . This time, the valid transition character (again, last column) matches our new character so we know we can extend our substring. In other words, previous substring plus new character, i.e., is our current longest valid proper prefix and suffix after seeing . We increment our state, so that the destination of the failure arc from state is state .
New Character | ||||||
Longest Valid Proper Prefix & Suffix of | ||||||
Current State and Valid Character Transition |
We are at state and the new character is , which matches the valid transistion character at current state. We add to current substring and increment our state. Note that at any time, the second row could serve as a sanity check to see if your algorithm is wrong.
New Character | ||||||
Longest Valid Proper Prefix & Suffix of | ||||||
Current State and Valid Character Transition |
Following the same logic, characters match so we extend the substring and increment state.
New Character | ||||||
Longest Valid Proper Prefix & Suffix of | ||||||
Current State and Valid Character Transition |
This time, , so we follow the failure arc at state to go back to ; (comparison at state ) so we follow the failure arc at state to go back to . At this point, the substring is empty and our stay at . Note that we don't need to compute for the last state, because if we got there then our text is accepted; we can happily write and move on.
Pseudocode