Indexing - fast searching - complex queries - document similarity - document ranking Traditional searching: grep - Build a regular expression - Running time proportional to size of input Searching - Goal: do a lot of work up front (build the index), less work per query Standard technique: inverted index index[word] = document1, document2, ... Query is a list of words. Get the documents for each word, then apply boolean AND (or other boolean functions). What about `NEAR' operator (in AltaVista / Google)? - add meta-data index[word] = (document1, location1), (document2, location2), ... Stop words, stemming, etc. - some words occur far too often - variations on the same word Customized input engines - Google has special case handlers for PDF and images Similarity - Goal: given a document, find *similar* documents - useful for Web searches - also useful for spam identification - Word vectors document[word] <-- frequency of word (sometimes multiplied by a `relevance' criteria -- less frequent words overall might be more relevant) document similarity == dot product of word vectors (normalized) Efficient storage? - Buckets of similar documents N-dimensional array, where N << number of words bucket[i0][i1][i2]...[iN] <-- set of word vectors where each index is how many words in a given range occur in that vector - To find "similar" documents, start searching where i0..iN matches your document, then start looking +/- 1, then +/- 2, etc. - 1-D hash tables can efficiently store N-D sparse arrays like this (so long as you don't want to do much iteration) - Higher dimensionality makes it harder to find "adjacent" buckets - Smaller dimensionality increases false collisions Document ranking / popularity - problem: porn vendors were gaming search engine metrics for list ordering (e.g., put word 100 times in a page) - solution? popularity metric - option 1: observe outgoing links that users click (still easy to fake, plus users often don't go where they want) - option 2: use web links to indicate popularity (Google PageRank) - very simple calculation - more incoming links == more popular - popular pages' outgoing links are more important as well - system of linear equations Popularity[page] == real number (sum over pages = 1) Popularity[p0] = sum p over all pages: link from p to p0? true: Popularity[p] / NumOutgoingLinks[p] false: 0 How to solve a large system of linear equations? Iteratively. (This problem occurs throughout computer science.) Treat this as a large matrix-vector equation LinkMatrix * Popularity = Popularity Solution 0: Do a matrix multiply, O(N^2), get approximate answer. Repeat often. Solution 1: Shooting (a variant on scatter-gather) Find biggest popularity: O(N) (or log N with a priority queue) Recompute Popularity for all outgoing links: O(NumOutgoingLinks) Normalization is critical (sum of all popularity is always 1) Other cool tricks - Associate text from the link I like foo "I like foo" is added to the list of words for http://foo.com - Keep popularity numbers around for documents you haven't loaded yet - Index / reindex popular documents more often than unpopular ones Current problems with Google - Where do you get the initial weights? - Buy thousands of dot-com addresses and have them link to you (try searching for car rentals or hotels in some city...) - Is "associated text" legit? - "Miserable failure" links George W Bush, Jimmy Carter, Michael Moore, Hillary Clinton - distinction between *relevance* and *popularity* Various future directions - Understanding different linguistic senses of a word - Does "apple" refer to a computer or a fruit? - What about foreign languages (e.g., "pomme")? - Combine remote indexing with local knowledge - data in your local files or your surfing history says something about what you find interesting - privacy concerns - Collaborative filtering (Amazon: "people who bought X tend to also buy Y") - Complex advertising options - Google AdWords: you bid for keywords (pay per click) - higher bids --> higher probability of appearing - clickthrough rates also figure in (pay less for more relevance) - Google AdSense: matching ads to 3rd-party web pages - how to match ads to pages? see if keyword search would find page? - how to filter inappropriate ads (competitors, porn, etc.)?