Define problem & terminology Fixed-string ("pattern") matching in "text" string "Substring matching" Small text, large pattern Alphabet Find all matches vs. find first match Real-world uses & variations of problem? -- discussion Uses fgrep-like uses Spell-checking (sort of) Early search engines (later, mostly moved to hashing or probabilistic techniques, and now mostly to inverted index tables) DNA analysis Variations Regexp matching -- grep/egrep special case: subsequence matching -- DNA analysis Approximate matching -- e.g., edit distance Stemming -- i.e., stripping & adding affixes Tokenization Allowing false positive matches (due to hash collisions) 2D pattern matching -- 2D pattern & text (The symbols don't have to be letters. E.g., bitmaps!) Note: Algorithmically, only similar for bitmap-like data. Many 2D and 3D pattern matching applications assume more compact data structures. Focus on the original problem Several algorithms Concentrate on high-level issues, leaving some details to reading from textbook Will mention extensions when applicable to these algorithms Naive/brute force For each index, left to right, try to match. Example Worst-case? Text is all same character Pattern is all same character, except at end Asymptotic analysis? Rabin-Karp, 1981 Intuition -- hashing is the solution to all algorithmic problems While we haven't used it in 482, it gets used everywhere in practice Intuition (not the alg!) Compute hash of pattern For each index of text, compute hash of |pattern| substring Attempt match only at indices where hashes are equal Computing like this would take Theta(mn) time Pick a hash function with a specific property: hash(xb) can be computed from hash(ax) and b As always, it should be a _good_ hash function, mapping uniformly to many different values E.g., interpret the |pattern|-length string as a base |Sigma| number, modulo some q. Example pattern = ababaca text = abcabababacababaca Asymptotic analysis? Asymptotically, no benefit If good hash function, expect factor q improvement Also requires Theta(m) preprocessing Finite automaton Simple version From pattern, build NFA with initial Sigma*-loop-transition -- easy From NFA, build DFA -- see COMP 481 Run DFA on text Example pattern = ababaca text = abcabababacababaca Matching looks at each text symbol exactly once Standard general algs to build DFA from NFA are too slow Can directly build DFA from fixed string in linear time O(m |Sigma|) See book The pattern->NFA->DFA construction is also straigtforward for regexps But construction is in general exponential in size of pattern Knuth-Morris-Pratt, 1977 Intuition: To improve naive algorithm, when match fails, could often shift pattern by more than one. pattern = ababaca text = abcabababacababaca X Shift pattern to make its prefix "aba" align with suffix of matched substring "ababa": pattern = ababaca text = abcabababacababaca Define prefix function pi according to this intuition: pi[q] = length of the longest prefix of P that is a proper suffix of P[0..q] pattern = ababaca text = abcabababacababaca pi = 0012301 Confirm a few values of this pi. How used by matching? When match fails, retry match at P[pi[index of last matching pattern char]]. Corresponds to example. Except, when match fails at first pattern char, just shift. pattern = ababaca pattern = ababaca text = abcabababacababaca When match succeeds, continue matching at P[pi[index of last matching pattern char]]. pattern = ababaca pattern = ababaca text = abcabababacababaca Similar to DFA DFA records how much to shift, based on where match failed, and what the missed text char was. pi doesn't distinguish based on missed text char. pi can be computed in linear time O(m) -- see book def'n can be improved -- see book exercise Boyer-Moore, 1974 Fairly easy to describe by example, but complicated to implement. Numerous variations Idea 1 Like naive algorithm, but test each potential match right-to-left. Idea 2 -- occurrence heuristic Example rum rum conundrum --> conundrum Match at "m/n" fails. Move right 3 (= length of pattern) positions, since pattern doesn't have an "n". Example drum drum conundrum --> conundrum Match at "m/u" fails. Move right 1 position, to line up "u"s. Based upon previous examination of the pattern, we know where the right-most occurrence of each letter is. Example natu natu conundrum --> conundrum Match at "u/u" succeeds, but then match at "t/n" fails. Move right 2 positions, to line up "n"s. We know about this posible match because (1) we've previously examined the pattern to known it starts with an "n", (2) we just discovered there's an "n" there in the text, and (3) based upon pre-examining the pattern we know that moving only 1 position would result in the non-"u" in the pattern lined up with the "u" in the text. Example date date detective --> detective Match at "e/a" fails. Move to line up "e"s (as with previous "drum" example). Moving left?!? No! Fix heuristic by moving right one position instead of moving left. date date detective --> detective Idea 3 -- match heuristic Same idea as KMP. Example banana banana a banana --> a banana Use knowledge of where else "ana" occurs in pattern. Idea 4 -- how to combine heuristics Choose shift as the maximum given by the two heuristics. Can potentially add in other heuristics.