Myeongjae Jeon, Reducing latency of web search queries

Web search engines must search a massive number of web documents to produce high quality search results. To constantly provide fast responses to user queries on this massive-scale data, today's web search relies on a highly parallel workflow -- an entire web index for the documents is sharded and distributed among hundreds or thousands of servers, and a search query is parallelized by distributing the query processing across the servers. Within each of these servers, the query is, however, processed sequentially. Although each server may be processing multiple queries concurrently, with modern multicore servers, parallelizing the processing of an individual query within the server may nonetheless improve the users experience by reducing the response time.

In this talk, I present new architectures and mechanisms for parallel query execution in web search environment. There are two fundamental techniques to support key elements of architecture design. First, intra-query parallelization of index searching parallelizes each individual query with small speculative execution and good load balancing. The key idea is for a parallel search to mimic the sequential order of execution that almost never scans the entire index. Second, prediction framework predicts query execution time with high accuracy and allows identifying a majority of long-running queries. This makes the query parallelization extremely effective because parallelization substantially reduces the execution time of long-running queries with low overhead and high parallelization efficiency. In turn, I propose several approaches to reducing the latency of web search, based on these two techniques. These approaches have been or are to be prototyped in Microsoft Bing, and evaluated experimentally with production workloads.