This month I’m taking the online course Text Retrieval & Search Engines on Coursera, and I’m using this blog to write down the main takeaways every week. In this third week of the course, we again focused on two main topics: probabilistic models for information retrieval and methods for user feedback. In this blog post I share my notes on probabilistic models.

Probabilistic models for IR rank documents according to the probability that they are relevant to a query. There are several possible implementations, but in this course we focused on the so-called Query Likelihood Model. This approach makes the assumption that we can model $p(R=1|q,d)$, the probability that a document is relevant for a given query, as $p(q|d,R=1)$, the probability that the query is observed if a given document is relevant. In other words, if a user is interested in document $d$, how likely are they to enter query $q$? To answer this question, the Query Likelihood Model makes use of language models.

Language models are basically probability distributions over word sequences: they give us the probability of observing a certain word or word sequence in a text. Information Retrieval generally relies on *unigram* language models, which only have probabilities for single words. They assume that texts are created by drawing words at random, independently of one another. The presence of one word therefore does not tell us anything about the probability of observing another word. This assumption is clearly false, but it allows us to simplify our approach drastically. For example, we can now compute the probability of a text as the product of the probabilities of its words.

Now, to rank all documents for a query, we use each document in the collection to estimate a language model, and compute the probability of the query on the basis of the resulting word probabilities. The simplest way of building such a language model is with Maximum Likelihood Estimation, where we set the parameters in such a way that our observed data (the document) receives the highest possible probability. In practice, this means that we will use the relative frequencies of the words observed in the document as their probabilities: $$p(w|d) = \frac{c(w,d)}{|d|}$$

This Maximum Likelihood Estimation has one downside, however: because it assigns zero probability to words that were not observed in the document, any query with such an unseen word will receive a probability of zero. To solve this method, we need to *smooth* our language models: we take away some probability mass from the words that were observed in the document, and assign it to the unseen words. One of the most popular smoothing methods is Jelinek-Mercer Smoothing, which applies linear interpolation between the relative frequencies of the words in the document and their probability according to a language model on the basis of the full text collection: $$p(w|d)=(1-\lambda)\frac{c(w,d)}{|d|} + \lambda p(w|C)$$ Another effective method is Dirichlet Prior Smoothing. Here we add pseudo word counts to the observed text, based on the collection language model. As a result, the smoothing coefficient is dynamic, depending on the length of the document, and longer documents will receive less smoothing: $$p(w|d) = \frac{c(w,d) + \mu p(w|C)}{|d| + \mu} = \frac{|d|}{|d|+\mu}\frac{c(w,d)}{|d|} + \frac{\mu}{|d|+\mu} p(w|C)$$

One interesting result of these smoothing methods is that they lead to a ranking function that is very similar to the ranking methods we obtained with the vector space model. All the familiar heuristics, such as term frequency, inverse document frequency and document length normalization, are there, in one form or another. However, this time we obtained them in a very natural way: our new ranking functions followed quite automatically from thinking about information retrieval as a probabilistic problem.