Last month I took the online course Text Retrieval & Search Engines on Coursera, and I’m using this blog to write down the main takeaways of every week. In the fourth and final week of the course, we shifted our gaze to Web Search and Recommender Systems. In this blog post I share my notes on web search.
Scaling up a text retrieval system from a limited set of texts to a web search engine brings with it a number of challenges and opportunities. Obviously the first challenge is scalability: how can we handle the size of the web and ensure completeness of coverage? How do we serve many users quickly? The answer to these questions lies in parallel indexing and searching, for example on the basis of MapReduce. The second challenge is the low quality of a lot of information on the web. Spam detection and robust ranking methods must harness our search engine against low-quality data. Third, we must take the dynamics of the web into account: new pages are created constantly, while other pages are updated very quickly. The news isn’t all bad, however. The web also brings a few opportunities: we can leverage many additional heuristics, such as the information available in links, to improve the search accuracy of our system.
Three components are essential in building a web search engine: the crawler, indexer and retriever. First, the crawler (also often referred to as a spider or robot) needs to fetch all pages we would like to index. While building a toy crawler is easy, a real-world crawler must be able to deal with the many intricacies of the world wide web, including server failures, crawling courtesy, the variation in file types, etc. Crawling is usually done in a parallel fashion, following a breadth-first strategy to avoid overloading servers. The process is repeated many times, but not every web page is crawled equally often. Efficient crawlers target pages that are updated and accessed frequently.
The second essential component in a web search engine is the index. Although web indexes are still based on standard IR techniques, the system must be able to deal with index files that do not fit into a single machine. Solutions come in the form of distributed file systems on a cluster of machines, such as the Google File System, and software frameworks for parallel computing, such as MapReduce (of which Hadoop is an open-source implementation). Because it hides many low-level details, MapReduce makes parallel programming relatively easy. At its core are two main functions: the map function that defines the tasks that are executed in parallel, and the reduce function that processes the output of all map functions. In the context of inverted indexing, the map function counts how many times each word occurs in each document, and the reduce function concatenates this information and sorts it.
Finally, the retriever in a web search engine needs to go beyond standard IR models. Generally, it uses link information to improve scoring, exploits clickthrough data as a source of implicit feedback and relies on machine learning to combine various types of features. Links are crucial to the web: they indicate how useful a web page is, and their anchor text provides an additional summary for a document. The most well-known algorithm for capturing page popularity through links is probably PageRank. Essentially, PageRank treats links like citations in scientific literature. It assumes that pages that are “cited” more often will be more useful, and also considers “indirect citation”: a link to a page is given more weight if it appears on another page that is often linked to. A similar algorithm, HITS (Hypertext-Induced Topic Search), relies on the notion of authorities (pages with many in-links) and hubs (pages with many out-links). It iteratively applies the idea that good authorities are linked to by good hubs, and that good hubs point to good authorities. Both PageRank and HITS have many applications beyond web search, for example in network analysis.
Given that there is so much useful information available for ranking web pages, a web search engine needs to figure out how it can combine all these features into a single ranking function. The general idea is that the relevance of a page is a weighted sum of all its features. The values of the weights can be learned on the basis of training data whose relevance is known, with an approach such as logistic regression. Other, more advanced learning algorithms attempt to directly optimise a retrieval measure such as MAP or nDCG. Details can be found in Tie-Yan Liu’s “Learning to Rank for Information Retrieval” (2009), or Hang Li’s “A Short Introduction to Learning to Rank” (2011).