Yves Peirsman bio photo

Yves Peirsman

Crazy about Natural Language Processing.

Twitter LinkedIn

Last week, Text Retrieval & Search Engines by the University of Illinois kicked off on Coursera. I’ve signed up to the online course, and on this blog I intend to write down the main takeaways every week.

The first week was very much an introduction to the field of text retrieval, with a high-level overview of the general concepts, and an exploration of some effective ranking functions that are used in search engines today. I was happy to see the first lecture started with a gentle introduction into NLP. After all, text must be processed in some way to prepare it for retrieval. We learnt how only shallow language processing, such as part-of-speech tagging, is really robust and efficient enough to be applied for large-scale text retrieval. Luckily, most search tasks can be effectively answered by bag-of-words techniques, and only more complex applications, such as Google’s Knowledge Graph, require deeper NLP.

Next, we explored some fundamental approaches to text retrieval. In the push mode, the system takes the initiative and presents the user with relevant documents. This requires the user to have a stable information need, and is the approach taken by recommender systems. In the pull mode, users take the initiative and search for documents that answer to their information need. This is the approach taken by classic search engines. Here we can further distinguish between two types of behaviour: querying and browsing.

There are two general strategies to solve the text retrieval problem. The first is document selection — a binary classification problem where the system decides whether a document is relevant or not. The second is document ranking, where the system decides if one document is more likely relevant than the other. In general, the latter option is preferred, because the classifier is probably inaccurate and relevance is not a binary concept. However, the so-called probability ranking principle reminds us that returning a ranked list of documents is only the optimal strategy under two assumptions: that the relevance of any document is independent of the relevance of any other, and that users browse the results sequentially.

Next, we spent a good deal of time designing a good ranking function. Although different types of retrieval models exist (similarity-based models, probabilistic models, axiomatic models, etc.), these tend to result in similar ranking functions with similar variables. Indeed, most state-of-the-art retrieval models have the same backbone, the bag-of-words approach, and work on the basis of term frequency, document length and document frequency. Some good examples of such models are BM25, pivoted length normalisation, query likelihood and PL2.

The vector space model is actually a very simple concept: it represents a document and query on the basis of a term vector, and assumes the relevance of a document to a query is proportional to the similarity between the query and the document. The main differences between individual models lie in the precise definition of the dimensions in the vector space, the choice of similarity measure, etc. In the simplest instantiation, the dimensions in the vector space simply correspond to the words in the vocabulary, queries and documents are represented by a bit vector, and the similarity between them is computed as the dot product. If implemented in this way, the relevance of a document is simply the number of query words that appear in the document.

This simple approach has some problems, however. For example, because we want to give a higher weight to terms that appear more often in a document, it is better to use a term frequency vector than a simple bit vector. This approach is best combined with a sublinear transformation of the term frequency, because higher term frequencies in a document give diminishing returns. Next, in order to discriminate between informative and non-informative words, we should multiply the frequency of a word with its inverse document frequency, which downweights terms that appear in many documents. Finally, we should penalise long documents, bells -ause they have a better chance of matching any query. The videos discussed the most popular methods for each of these corrections.

Through the ranking equation, this week’s lectures built up nicely to the IR models that are behind state-of-the-art search engines today. The pace was relatively slow, as tends to be the case with most Coursera courses, but in the end some fairly advanced ranking functions, such as BM25F and BM25+, were discussed. This indicates the course should be interesting for beginners as well as more advanced students. I’m looking forward to seeing where next week’s lectures will take us.