On Tuesday: binary trees (mutating vs. applicative) deletion via tree rotation some big-O analysis FYI: what I call "applicative" the book calls "persistent" Today: treaps (getting better worst-case bounds through randomization) skip lists (a non-traditional way to get log N performance) Huffman coding (an interesting application of trees) B-trees (getting better typical performance) brief mention of others (AVL, R-B, etc.) treaps (designed by Seidel and Aragon) - randomized data structure (no possible worst case input!) - easy to implement (vs. AVL or R-B trees) - fundamental idea: every node has a new random number ("priority"), assigned when the node is created - maintain *tree* property on keys - then, maintain *heap* property on priorities via rotation (smaller priorities go higher in the treap) Insert: first find "normal" leaf location - recursively, if heap property is not satisfied, rotate tree Deletion is still easy - rotate to leaves, as before, then remove skip lists: linked lists with "skip" pointers - exponential decay distribution of node "height" (half have height 1, 1/4 have height 2, 1/8 have height 3, etc.) - can be generalized so every node has "fingers" that go various distances around the list (used in p2p systems like Chord) Huffman coding: basic data compression - goal: transmit input using small number of bits - input: sequence of tokens from a fixed set (words in dictionary, pixel levels in image, etc.) - insight: represent common tokens with fewer bits than uncommon tokens Step 1: compute probability of every token occuring Until set has only one element left: find two elements in set with lowest probability remove from set create new tree node with both elements & sum probabilities add back to set Encoding can be read from the tree. ("greedy" algorithm -- ignores many other possibilities) Why is coding unambiguous? No code is a *prefix* of another code B-trees Goal: put lots more data into every node, reduce the branching factor Significant efficiency gains - typically a fixed size for each block (e.g., 4KB) - each node might be a disk block -- huge cost to fetch, so get the most bang for you buck - similar issues for main memory vs. cache Your file system looks like this - directory "i-nodes" have names and pointers to contents - if too many files in a directory, need to branch the directory i-nodes We still want to insert (key, value) tuples, just like a normal tree Now, nodes will have a maximum number of keys (2t-1) in them plus pointers to their (2t) children - children have keys *between* the parents keys - if you have to insert in a full node, compute median and push to parent - leaves two children having (t-1) elements, each - if and only if root is full, then allocate a new node on top - only way new BTree node ever gets allocated Deletion gets ugly - same general idea as before: rotate to leaf then remove - merge nodes when possible Big-O analysis Tree operations that were O(log n) are O(t log_t n) here - higher t radically reduces the tree height, but adds to linear search - constant factors make t almost "free" once you pay for expensive disk reads and/or cache fetches