Yves Peirsman bio photo

Yves Peirsman

Natural Language Processing Enthusiast at NLP Town.

Twitter LinkedIn

This month I’m taking the online course Text Retrieval & Search Engines on Coursera, and I’m using this blog to write down the main takeaways every week. In this second week of the course, we focused on two main topics: the implementation details of text retrieval systems and the evaluation of their output.

A text retrieval system typically consists of three parts. First there is the indexer, where documents are tokenized, transformed to their desired document representation, and indexed. Second we have the scorer, which takes the representation of a query and the documents in the collection, and comes up with a list of documents relevant to the query. Third there is an optional feedback loop, where users give feedback about the initial results, and the ranked list is adjusted. Whereas the implementation of the indexer and scorer are fairly standard, this feedback loop has many possible variations.

Let us start with the indexer. The most popular indexing method for basic search algorithms is the so-called inverted index, which contains all the crucial information about the terms and documents in the collection. It has a dictionary with basic statistics for each term (such as its frequency and the number of documents in which it occurs), and the postings list with the documents that match a term, together with the frequency of that term in every document. The index may also contain position information, so that the system can check whether two matched terms are adjacent, for example. Since most words occur very rarely, consulting this inverted index is much more efficient than scanning all documents sequentially.

However, one problem with indexes is their sheer size. Because computers may run out of memory during indexing, text retrieval systems often construct their indexes through a sort-based method. As they scan all documents in a collection, they collect (term id, doc id, term frequency) tuples. Every time they run out of memory, they sort these tuples on term id and write the result to the disk as a temporary file. At the end of the indexing process, they merge and sort all temporary files to build the final inverted index. Of course, this final index will still be very large, and a lot of disk space can be saved by compressing it. This is very effective because many of its values have very skewed distributions. Term frequencies are the clearest example of this (they tend to obey Zipf’s law), but document information, too, can be compressed very effectively. This is done by replacing the consecutive doc ids for a term by the difference between them, and then compressing the resulting values. Popular compression methods are gamma coding or delta encoding.

The scorer, too, can be optimized for fast searching. For example, statistics such as document lengths are generally precomputed. Scoring works by keeping a score accumulator for each document in the collection. When the system fetches the inverted list of documents and frequencies for a term, it updates the score accumulators for each document incrementally. There are several tricks to improve the efficiency of this process. It is a good idea, for example, to process rare terms in a query first, in order to quickly identify the most promising document accumulators. Efficient text retrieval systems make use of caching (they store the results of popular queries and keep inverted index lists for popular terms in memory), pruning (they keep only the most promising accumulators), and parallel processing.

With all the variation in the implementation of text retrieval systems, it is important to be able to assess how good individual systems are. This is generally done with the Cranfield Evaluation Methodology. This methodology relies on a sample document collection, a sample set of queries, a set of relevance judgements, and measures to quantify how well the results of the system match the ideal ranked list. The two most popular measures are precision (what percentage of the retrieved results are relevant?), and recall (what percentage of the relevant documents are retrieved?). These are often combined in the F-measure, the harmonic mean of precision and recall.

These concepts of precision and recall need to be translated to ranked lists, however. One way to get an idea of the performance at different points in the list is to plot its precision-recall curve. This curve can also be expressed in one number, the average precision. This is the sum of the precision values at every point where a new document is retrieved, divided by the total number of relevant documents in the collection. The arithmetic and geometric means of this precision over a set of queries are called Mean Average Precision (MAP) and Geometric Mean Average Precision (gMAP).

An additional difficulty arises with multi-level relevance judgements. In this case, we can use the Normalized Discounted Cumulative Gain (nDCG@10 documents, for example). This measure sums the gains (the utility of a document from a user’s perspective) of all documents in the ranked list. It divides them by the log of their rank (hence discounted) and normalizes them by the ideal DCG at the same cut-off (hence normalized).

Additionally, there are some practical issues we must take into account when evaluating a text retrieval system. There must be a sufficient number of queries and documents in our test set, and they must be representative of the task. Documents are often selected by pooling the top K results from a diverse set of ranking methods. These documents are then judged by human assessors, and it is assumed that other documents are not relevant. This is acceptable for comparing systems that contributed to the pool, but it is problematic for evaluating new systems. Finally, our measures must capture the perceived usefulness of a system by its users, and we need to perform a statistical significance test such as the sign test or wilcoxon test to determine if the observed difference between two systems doesn’t simply result from the particular queries we chose for our test set.