Yves Peirsman bio photo

Yves Peirsman

Natural Language Processing Enthusiast at NLP Town.

Twitter LinkedIn

This month I’m taking the online course Text Retrieval & Search Engines on Coursera, and I’m using this blog to write down the main takeaways every week. In this third week of the course, we again focused on two main topics: probabilistic models for information retrieval and methods for user feedback. In this blog post I share my notes on user feedback.

There are several ways in which a search engine can get feedback from its users. It can simply ask them to make explicit relevance judgements about its initial result list, for example. The system can subsequently update the query to give a better ranking, by adding new terms (so-called query expansion) or by adjusting the weight of old terms. However, because users are generally reluctant to make this effort, developers of IR systems have come up with more creative ways of improving the results. In pseudo-relevance feedback, they assume the top ranked documents are relevant, and improve the query on the basis of these documents. For example, the system can find new words that are related to the query words, by comparing the word frequencies in these documents to those in the full document collection. In implicit feedback, the system tracks user behaviour: if a user clicks on a document, it assumes this document will be relevant. It can then use the snippet that was shown to the user to improve the query.</em>

The most effective method for feedback in the vector space model is called Rocchio Feedback. Basically, the search engine adapts the query by moving its vector $\overrightarrow{q}$ closer to the centroid of the relevant documents $D_r$, and away from the centroid of the non-relevant documents $D_n$. Three parameters $\alpha$, $\beta$ and $\gamma$ control the amount of movement: $$ \overrightarrow{q_m} = \alpha \overrightarrow{q} + \frac{\beta}{|D_r|}\sum_{\forall \overrightarrow{d_j} \in D_r} \overrightarrow{d_j} - \frac{\gamma}{|D_n|}\sum_{\forall \overrightarrow{d_j} \in D_n}\overrightarrow{d_j}$$ Some tricks of the trade help make the Rocchio method more effective. For example, it is customary to truncate the updated query vector, since too many words get a non-zero weight after query expansion. Also, because the non-relevant documents are typically not clustered together, their centroid tends to be fairly uninformative and is often ignored. Finally, you can prevent overfitting by setting $\alpha$ fairly high. Particularly in the case of pseudo-relevance feedback, $\beta$ should be kept relatively small, in order to avoid adjusting the query vector too much to documents whose relevance is not 100% certain.

Things are slightly more complex in the Query Likelihood Model. Because this model treats the query as a sample from a language model, adding terms or adjusting weights is rather awkward (Zhai 2008). Therefore it is often replaced by the more general Kullback-Leibler Divergence retrieval model. This approach computes the score $s$ of a document $D$ with respect to a query $Q$ as the Kullback-Leibler divergence between the document language model $\theta_D$ and the query language model $\theta_Q$: $$\begin{align*}s(D,Q)&=-\sum_{w \in V}p(w|\theta_Q) log \frac{p(w|\theta_Q)}{p(w|\theta_D)} \\ &= \sum_{w \in V}p(w|\theta_Q) log p(w|\theta_D) - \sum_{w \in V}p(w|\theta_Q) log p(w|\theta_Q)\end{align*}$$ Because the last term in this equation depends on the query only, documents can be ranked on the basis of just the first term, the so-called negative cross-entropy. It is now easy to see the Query Likelihood model is a special case of this approach: the frequency of a word in the query can be interpreted as the maximum likelihood estimate of its probability in the query language model, $p(w|\theta_Q)$, and the relevance of a query to a document is the sum of the probability of the query words in the document language model, $p(w|\theta_Q)$.

The advantage of this KL Divergence model is that is becomes much easier to take user feedback into account. We can do this by updating the query language model with a third language model — the feedback language model — for example by linear interpolation. Of course, the question is how we estimate this feedback language model. The answer takes the form of a generative mixture model, where we assume the relevant documents are generated from a mixture of a topic language model and a background language model. The background model is computed as the maximum likelihood estimate of all documents in the collection, while the topic model should maximise the probability of seeing the relevant documents. A parameter $\lambda$ controls the weight that will be given to either of these documents.

In contrast to the other parts of a typical IR system, there is no real standard for the feedback component, and many competing approaches exist. Some of these are discussed in ChengXiang Zhai’s 2008 paper Statistical Language Models for Information Retrieval, specifically in the context of language models. Don’t miss it in case you’re interested in a much more detailed overview of the challenges and solutions to user feedback.