Tries Define by example Draw for {a,and,apple,apply,bad,badly,bat} Nodes = true/false end of word? Links labeled by letter Each node has an array or list of links Note that a set has a unique trie (unless child list is unsorted) How to search? Just a tree traversal. How to insert? Search If necessary, add chain of nodes for suffix Mark last node true How to delete? Search, remembering last node which was true or had multiple branches If final node doesn't have children, delete chain after the remembered node Slightly generalization Dictionary, keyed by strings Nodes store value, if any, associated with string A bit of history Fredkin (1960) Names comes from "reTRIEval", since used to store and lookup strings Pronounced "try" (usually) or "tree" Also known as "prefix tree" Each node represents a string prefix Numerous variants -- we'll explore some Often use "trie" to refer to the whole class of variants Compare to some other data structures Compare to balanced BSTs and hash tables (open addressing or chaining) m = string length M = sum of all string lengths n = #strings Asymptotic time to search/insert/delete BST -- O(m log n) Hash -- best: Theta(m) -- average: Theta(m) assuming good hash function for this set of strings -- worst, open: Theta(mn) -- worst, chaining: whatever is used for chaining data structure Trie -- Theta(m) "Exact" space for data structure Assume characters and pointers each one word BST -- A node for each string Each node has string pointer, length or EOS, 2 children = M + 4n Plus whatever info per node used for balancing Hash, open -- Assume dynamic table Each table entry is string pointer (NULL = empty) Each string has length or EOS M + n + table size = M + (1+c)n Where c is load factor Trie, array -- Each node has boolean, |Sigma| children <= M+1 nodes -- at most no overlap, plus root <= (|Sigma|+1)(M+1) Trie, list -- Each node has boolean, 2 pointers per child <= M+1 nodes -- at most no overlap, plus root Lists have symbol, next ptr, & ptr to each non-root node <= 2(M+1) + 3M = 5M+2 Summary =M+kn vs. <=k'M Tightness of latter bound depends on amount of prefix sharing. M,n are not independent. Tries allow other uses, e.g., A traversal outputs the strings in sorted order. Data structure for Burstsort (cache-efficient string sort) Useful in searching for patterns and near-matches Will explore this some in a few minutes Variations Patricia trie / radix tree Condense each chain into a single link labeled with a string Draw for original example Alternate data structures for a node's children Sibling list As in Fibonacci heaps Draw for original example Dynamic array, sorted or unsorted BST (for Patricia tries, not plain tries) Hash table (for Patricia tries, not plain tries) Ternary search tree Bentley, Sedgwick (1997) Each node contains a letter or end-of-string and has three children Left child -- Strings for which the current node's letter is too large Middle child -- String containing the current node's letter Right child -- Strings for which the current node's letter is too small Which letter is in each child node is dependent upon insertion order Thus, a set of strings doesn't have a unique TST Draw for original example Suffix tries / Suffix trees / Position Trees The Patricia trie of allnon-empty suffixes of a string (padded with a EOS marker) EOS marker ensures that only leaves can end suffixes Draw example -- abaababa Asymptotic space? Define |S| = n #leaves = n #internal non-root nodes < n-1, since each is branching #roots = 1 total #nodes < 2n space per node = O(1), if using sibling lists #edges = #non-root nodes < 2n space per edge = O(1), since label can be an index into string O(n) The example string is the "5th Fibonnaci string", a sequence which illustrates the worst-case behavior. Also linear time to build -- Ukkonen (1995) Useful for a variety of problems, e.g., Finding longest common substrings of two strings Lempel-Ziv data compression Searching for a regular expression String matching with suffix tries Pretty easy -- suggestions? Build suffix trie of TEXT -- O(n) Search for pattern starting from trie root -- O(m)