# Pattern Matching

## CS240E: Data Structure and Data management (Enriched)

### David Duan, 2019 Winter, Prof. Biedl

#### Motivation

##### Pattern Matching

Given a text $T = T[0 \cdots n-1]$ of length $n$ and a pattern $P = P[0 \cdots m-1]$ of length $m$, we return the first $i$ such that $P[j] = T[i+j]$ for $0 \leq j \leq m-1$, which is the index of the first occurrence of $P$ in $T$, or FAIL, if $P$ does not occur in $T$.

##### Definitions
• A substring $T[i\cdots j]$, $0 \leq i \leq j < n$, is a string of length $j-i+1$ which consists of characters $T[i], \ldots, T[j]$ in order.
• A prefix of $T$ is a substring $T[0 \cdots i]$ of $T$ for some $0 \leq i \leq n-1$.
• A suffix of $T$ is a substring $T[i \cdots n-1]$ of $T$ for some $0 \leq i \leq n-1$.
• A border of $T$ is a substring $r$ with $r = T[0\cdots b-1] = T[n-b \cdots n-1]$ where $b \in \{0, n-1\}$. In other words, a border of $T$ is a substring that is both proper prefix and proper suffix of $x$. We call its length $b$ the width of the border.
• A guess or shift is a position $i$ such that $P$ might start at $T[i]$. Note that the valid guesses initially are $0 \leq i \leq n-m$.
• A check of a guess is a single position $j$ with $0 \leq j < m$ where we compare $T[i+j]$ to$P[j]$. Note that we must preform $m$ checks of a single correct guess, but may make fewer checks of an incorrect guess.
##### Brute-Force Algorithm

We check every possible guess.

The worst case performance is $\Theta((n-m+1)m)$, as there are $n-m+1$ valid guesses and checking for each guess (string comparison) takes $\Theta(m)$ time.

The runtime for brute-force algorithm is $\Theta(mn)$, which is quadratic if $m$ is large!

##### Improvement

To improve the performance, we can break the problem into two parts:

1. Preprocessing: we build data structures or extract information that makes query easier, which may take a long time,
2. Query: we carry out the original task, which should be very fast as we have already paid the price during the first stage.

The Range Query problem in our previous module can be seen an example of prepocessing.

For Pattern Matching, we can either do preprocessing on the pattern $P$, where we eliminate guesses based on completed matches and mistakes, or on text $T$, where we create a data structure to make finding matches faster.

#### Karp-Rabin Algorithm

Key Idea We use hashing to eliminate guesses.

• $h(k_1) \ne h(k_2) \implies k_1 \ne k_2$, i.e., there is no need for $strcmp$ if hash values differ.
• $h([x_0\cdots x_4]) = (x_0x_1x_2x_3x_4)_{10} \mod M$ where $M$ is a prime.
##### Karp-Rabin Fingerprint Algorithm (Naive)

Strategy

The initial hashes are called fingerprints. We compute the fingerprint/hash value (using the ordinary flatten-then-modular approach) for each substring of length $m$ in $T$. If this fingerprint equals the fingerprint for the pattern, we $strcmp$ them to check if it really is a match; otherwise, we discard the current guess.

Pseudocode

Correctness

We will never miss a match, as $h(T[i\cdots i+m-1]) \ne h(P) \implies T[i\cdots i+m-1] \ne P$.

Complexity

The overall runtime is still $\Theta(mn)$ worst case as computation (line $3$) depends on $m$ characters.

We didn't really improve anything!

##### Karp-Rabin Fingerprint Algorithm (Rolling Hash)

Strategy

Suppose in a previous iteration, we have computed $h(41592) = 76$, and now we want $h(15926) = 15926 \mod 97$. Observe $15926 = (41592 - 4 \cdot 10000) \cdot 10 + 6$. Thus:

Given the hash value of the previous guess, we can compute the hash value of the next guess in constant time!

Pseudocode

Correctness

Correctness for rolling hash comes from modular arithmetic. The rest are identical to before.

Complexity

Worst case is still $\Theta(mn)$ but this is very unlikely. The expected time is $O(m+n)$, where we spend $O(m)$ doing the initial computations and $strcmp$ and during each guess (roughly $O(n)$ of them) we perform constant operations.

This is easy and already faster than brute-force, but we can do even better!

#### Boyer-Moore Algorithm

Key Idea We compare the characters of the pattern and the substring from text from right to left, skipping the unnecessary comparisons by using the good-suffix and bad-character heuristics.

We compare the end characters of the pattern with the text. If they do not match, then the pattern can be moved on further. If the characters do not match in the end, there is no need to do further comparisons. In addition, we can also see what portion of the pattern has matched (matched suffix), so we could utilize this information and align the text and pattern by skipping any unnecessary comparisons. At the time of a mismatch, each heuristic suggests possible shifts, and the BM algorithm shifts the pattern by considering the maximum shift possible due to two heuristics.

Notation In the following examples, let ! denote the mismatched character and # denote the starting position of $P$ after our shift $P$.

This heuristic is based on the mismatched character.

If there is a mismatch between the character of the pattern and the text, then we check if the mismatched character of the text occurs in the pattern or not. If this mismatched character does not appear in the pattern, then the pattern will be shifted next to this character; if the character appears somewhere in the pattern, we shift the pattern to align with the occurrence of that character with the bad character of the text string.

Example I

The mismatched character is at $i = 5$, where we see $b \ne c$. Since $b \notin P$, we can shift $P$ to $i + 1 = 6$, i.e., our next guess is $T[6\cdots 11]$, since substrings $T[0\cdots5], T[1\cdots 6], \ldots, T[5\cdots 10]$ all contain $b$ but $P$ does not. This saves us many unnecessary comparisons.

                                          #              i     0   1   2   3   4   5   6   7   8   9   0   1                =====================================================            T   | a | c | a | c | a | b | a | c | a | b | a | b | a |                =====================================================            P   | a | c | a | c | a | c |                =====================================================                                        | a | c | a | c | a | c |                =====================================================                                      ✗

Example II

We have a matched suffix $ac$ and mismatched character $d$ at $i = 3$. Since $d \notin P$, we shift $P$ to $i + 1 = 4$.

xxxxxxxxxx                                  #            i     0   1   2   3   4   5   6   7   8   9   0   1                =====================================================            T   | a | c | a | d | a | c | a | c | a | b | a | b | a |                =====================================================            P   | a | c | a | c | a | c |                =====================================================                                | a | c | a | c | a | c |                =====================================================                              ✗  ✓  ✓

Example III

We have a matched suffix $ac$ and mismatched character $d$ at $i = 3$. This time $d \in P$ and the only occurrence of $d$ is at $j = 1$. Thus, we shift $P$ to $2$ to align $T[i]$ and $P[j]$.

xxxxxxxxxx                          #   !            i     0   1   2   3   4   5   6   7   8   9   0   1                =====================================================            T   | a | c | a | d | a | c | a | c | a | b | a | b | a |                =====================================================            P   | a | d | a | c | a | c |                =====================================================                        | a | d | a | c | a | c |                        =====================================================                              ✗  ✓  ✓

Example IV

We have a matched suffix $cc$ and a mismatched character $a$ at $i = 3$. This time however, there are two occurrences of $a$ in $P$, each at $j = 0$ and $j = 2$. We always shift the minimal amount to avoid missing any possible match. In this case, we prefer $P_1$, because we only shift $1$ character, where $P_2$ shifts $3$ (and missed the matched string at $T[1\cdots 6]$! I came up with this example to show the importance of align the bad character to its last occurrence in the pattern).

xxxxxxxxxx                      #       !                 i     0   1   2   3   4   5   6   7   8   9   0   1                =====================================================            T   | a | a | c | a | c | c | c | c | a | b | a | b | a |                =====================================================            P   | a | c | a | c | c | c |                =====================================================            P1      | a | c | a | c | c | c |                  =====================================================            P2              | a | c | a | c | c | c |                =====================================================                              ✗  ✓  ✓

Last Occurrence Function

In Example IV, we saw the importance of aligning the bad character to its last occurrence in the pattern. Therefore, we need a last-occurrence function $L:\Sigma \to \N_0$, where

For Example IV, $L(a) = 2$, $L(c) = 5$, and $L(x) = -1$ for all $x \in \Sigma \setminus \{a,c\}$. When we see the mismatched $a$ at $i = 3$, check $L(a) = 2$ and shift $3 - 2 = 1$ character.

Remark. Sometimes $j - L[T[i+j]]$ produces a negative value:

xxxxxxxxxx            i     0   1   2   3   4   5   6   7   8   9   0   1                =====================================================            T   | a | b | a | a | b | a | c | c | a | b | a | b | a |                =====================================================            P   | c | a | b | a | b                =====================================================                          ✗  ✓  ✓

If we were to align $T$ with the last occurrence of $a$ in $P$, we would need to shift $-1$. To avoid negative shifting, we take $i + \max\{i, j-L[T[i+j]]\}$, which guarantees that $i$ is incremented by at least $1$. This also motivates the Good Suffix heuristic as we could do better than only shifting $1$ in the worst case.

##### Good Suffix Heuristic

This heuristic is based on the matched suffix.

We shift the pattern to the right in such a way that the matched suffix subpattern is aligned with another occurrence of the same suffix in the pattern.

For this heuristic, an array $S$ is used. Each entry contains the shift distance of the pattern if a mismatched at position $i-1$ occurs, i.e., if the suffix of the pattern starting at position $i$ has matched. In order to determine the shift distance, two cases have to be considered.

1. The matching suffix occurs somewhere else in the pattern.
2. The suffix of the matching suffix occurs as a prefix of the pattern.

Example I: Align Suffix

xxxxxxxxxx                             !,#            i     0   1   2   3   4   5   6   7   8   9   0   1                =====================================================            T   | l | m | n | n | p | a | ...                =====================================================            P   | s | p | a | a | p | a |                =====================================================                            | s | p | a | a | p | a |                      =====================================================                              ✗  ✓  ✓

We matched suffix $pa$ and found a mismatch at $3$, where $P = a$ but $T = n$. Therefore, we want to shift forward. The intuition is that, after shifting, two characters aligned with $T$ and $T$ must be $pa$, otherwise it is guaranteed to fail. Since $pa$ occurs in $P$ at $1$ (recall this is the index where the substring $pa$ starts), we align $P$ with $T$ for our next guess.

Example II: Minimal Shifting

xxxxxxxxxx                              #           !            i     0   1   2   3   4   5   6   7   8   9   0   1                =====================================================            T   | n | p | a | l | m | n | n | p | a | a | p | a |                =====================================================            P   | n | p | a | n | p | a | a | p | a |                =====================================================                            | n | p | a | n | p | a | a | p | a |                =====================================================                                          ✗  ✓  ✓

In this example, there are two occurrences of $pa$ in $P$. We always prefer the minimal amount of shifting to avoid missing possible matches.

Example III: Character Before Suffix

xxxxxxxxxx                                         #,!            i     0   1   2   3   4   5   6   7   8   9   0   1                =====================================================            T   | a | b | c | d | e | f | a | a | a |                =====================================================            P   | n | a | a | p | a | a | p | a | a |                =====================================================            P1                          | n | a | a | p | a | a | ...                =====================================================            P2              | n | a | a | p | a | a | p | a | a |                =====================================================                                          ✗  ✓  ✓

In this example, we matched suffix $aa$ and the mismatch occurs at $i=6$, which tells us that $p$ in front of the matched suffix $aa$ doesn't work. Therefore, shifting to make $P$ align with $T$ will make us compare $P_2 = p$ with $T = a$, which we already know from the first iteration that $paa$ is bad. Thus, we shift to make $P$ align with $T$ to avoid unnecessary comparisons.

Example IV: Bad Character & Good Suffix

xxxxxxxxxx                                         #,!            i     0   1   2   3   4   5   6   7   8   9   0   1                =====================================================            T   | a | b | c | d | e | f | x | a | a |                =====================================================            P   | n | a | a | p | a | a | p | a | a |                =====================================================                                            | n | a | a | p | a | a | ...                =====================================================                                          ✗  ✓  ✓

If we were only to use good-suffix heuristic, we would shift the same way as in Example III. However, the bad-character heuristic tells us $x$ does not occur in $P$ at all, so we shift $P$ to one character after $x$. The previous three examples are designed in the way that we can ignore the bad-character heuristic.

Example V: Suffix of Suffix Matches Prefix of Pattern

xxxxxxxxxx                              !               #            i     0   1   2   3   4   5   6   7   8   9   0   1                =====================================================            T   | a | b | c | a | a | b | b | a | b | b | a |                =====================================================            P   | a | c | b | b | a | b | b | a |                =====================================================            P1              | a | c | b | b | a | b | b | a |                =====================================================            P2                              | a | c | b | b | ...                =====================================================                             ✗

The matched suffix is $abba$, but $abab$ is not occurred elsewhere in our pattern $P$. We then search for partial suffix, or suffix of this suffix (which is still a suffix of $P$), see if any part of it is a prefix for $P$. Note that we only accept when a partial suffix is a prefix of $P$; anywhere else the partial suffix occurs in $P$ are not accepted. Consider the above example. We know $abba$ is not occurred elsewhere in $P$, so we start searching if a partial suffix is a prefix of $P$. Although $bba$ does occur, we know the character preceding $bba$ is a mismatch (otherwise, $abba$ would occur), so we don't accept this occurrence of $bba$ in the middle of $P$. Thus, $P_1$ is wrong. The correct shift is $P_2$, where we see $a$ is both a suffix and a prefix of $P$.

Example VI: Together

xxxxxxxxxx                              !               #            i     0   1   2   3   4   5   6   7   8   9   0   1                =====================================================            T   | c | h | u | a | c | h | u | p | i | k | a | p | ...                =====================================================            P   | p | i | k | a | c | h | u |                =====================================================            P1                              | p | i | k | a | c | ...                =====================================================                         ✗

By bad-character, we would shift $1$ forward (recall the argument "$1$" to the $\max$ function in the previous $BoyerMooreBCH$ algorithm). By good-suffix, since there is no suffix of $achu$ that is also a prefix of $pikachu$, we shift the entire thing past the current location, because we know non-of the substrings $T[1\cdots 7], \cdots, T[6 \cdots 12]$ would work.

Suffix Skip Array

We can precompute the suffix skip array of amount to shift, where Then, we can update guess by $i \leftarrow i + (m - i - S[j])$. This is computable in $\Theta(m)$ time (similar to KMP failure function, see below).

##### Conclusion
• BM works well in practice, even bad-character heuristic alone is often good enough.
• Preprocessing takes $\Theta(m+|\Sigma|)$: $\Theta(|\Sigma|)$ for last-occurrences and $\Theta(m)$ for suffix-skip.
• Search takes $\Theta(n)$ time (often better in practice).
• Auxiliary space $\Theta(m+|\Sigma|)$: $\Theta(|\Sigma|)$ for last-occurrences and $\Theta(m)$ for suffix-skip.

#### Knuth-Morris-Pratt Algorithm

##### Motivation

Recall from CS241 (yikes), we could match strings using DFAs. We could embrace the idea of using states to represent our matching progres. In addition, we have a new type of transition $\times$ (failure):

1. Use this transition only if no other fits.
2. Does NOT consume a character (like an $\varepsilon$-transition).

With these rules, computations of the automaton are deterministic (but not formally a valid DFA).

##### How the Automaton Works

Recall if we are at state $j$, then we have the first $j$ characters matched at this moment.

The following automaton helps us match the pattern $ababaca$: The normal transitions and states are intuitive (same as in CS241). To understand the failure transitions, consider what happens when we feed it the following inputs:

• $ababaca$: Trivial; we get to the accepted state 7.
• $abcX$: $ab$ matched so we get to state 2. However, the next character is a mismatch (we see this by comparing the next character in the input string, i.e., $c$, with the character for the valid transition at the current state, i.e., $a$), so we follow the failure arc from state 2, which sends us back to state 0. We then compare the next character $c$ with the character for a valid transition character $a$, which is a mismatch. Thus, we follow the loop and stay in state 0. Intuitively, this means that we need to start from the beginning when we match $X$.
• $aaX$: $a$ matched so we get to state 1. However, the next one is a mismatch, so we don't consume anything and follow the failure arc back to 0. Since next character is $a$ and the character for valid transition is also $a$, we can go to state 1. Intuitively, this means that when we start matching $X$, we have already matched an $a$.
• $ababeX$: The first four letters matched, so we are at state 4. Next character is a mismatch, so we follow the failure arc of state 4 to go back to state 2. The transition fails at state 2 again, so we follow the failure arc back to 0. Since $e \ne a$, we follow the loop and stay at state 0. This means the same as $abcX$.

Following the failure arc is equivalent to shifting our pattern forward.

##### Failure Array

We can compute and store these failure arcs using an array $F[0\cdots m-1]$, where the failure arc from state $j$ leads to $F[j-1]$. The indices differ by $1$ because we don't need to store the failure arc for state $0$, as it has a loop if the character for valid transition doesn't match. Thus, $F$ stores the destination of failure arc originated from state 1 (which is always 0) and $F[j]$ stores the destination of failure arc orginated from state $j+1$.

Brute Force

Since we are basically trying to find the longest proper prefix of $P$ that is also a proper suffix of $P$, we could do the following in $O(m^3)$ time (suppose $P = ababaca$): We consider $P[1\cdots j]$ because we want proper suffixes of $P$. $F[j]$ is the length of the longest prefix of $P$ that is a suffix of $P[i \cdots j]$.

Example: An Intuitive Way to Compute Failure Array

Let our starting state be $0$ and process the first character $p = P = a$. There is no valid proper prefix of $p$ that is also a proper suffix, so the substring is $\varepsilon$. Intuitively, if we fail transition for next character, we would go back to state $0$.

We are at state $0$ and the new character is $b$. The valid transition character from current state (we look at the previous column!) does not match our new character, so we can't extend the substring. Thus, we stay at $0$ with the substring being $\varepsilon$.

We are still at state $0$ and the new character is $a$. This time, the valid transition character (again, last column) matches our new character so we know we can extend our substring. In other words, previous substring plus new character, i.e., $\varepsilon + a = a$ is our current longest valid proper prefix and suffix after seeing $aba$. We increment our state, so that the destination of the failure arc from state $3$ is state $1$.

We are at state $1$ and the new character is $b$, which matches the valid transistion character at current state. We add $b$ to current substring and increment our state. Note that at any time, the second row could serve as a sanity check to see if your algorithm is wrong.

Following the same logic, characters match so we extend the substring and increment state.

This time, $c != b$, so we follow the failure arc at state $3$ to go back to $1$; $c != b$ (comparison at state $1$) so we follow the failure arc at state $1$ to go back to $0$. At this point, the substring is empty and our stay at $0$. Note that we don't need to compute for the last state, because if we got there then our text is accepted; we can happily write $\square$ and move on.

Pseudocode