In NLP research, models of word meaning are making a comeback. While previously, attention seemed to shift from words to phrases and sentences, word meaning is again at the forefront of research. Maybe people have figured out that, before we can capture phrases and sentences correctly, we need better models of word meaning first. One particularly interesting contribution last year came from Stanford, with Jeffrey Pennington, Richard Socher and Christopher D. Manning presenting GloVe: Global Vectors for Word Meaning.
In recent years, two main types of models of word meaning have emerged: matrix factorization methods such as Latent Semantic Analysis, which work on the basis of global co-occurrence counts of words in a corpus, and local context window methods such as Mikolov et al.’s skip-gram model, which are trained on local co-occurrences. Pennington et al. position themselves in the first group, as GloVe, short for Global Vectors, relies on global corpus statistics rather than local context.
One thing I like about Pennington et al.’s approach is their commitment to finding word vectors with semantically meaningful dimensions. They do this by building a model that is biased towards dimensions of meaning that potentially distinguish between two words, such as gender. They start by defining a learning function that gives a value for a combination of two target words and a context word. The ideal outcome of such a function, they claim, is the ratio between the co-occurrence probabilities of those target words with that context word. When a context word is highly correlated with one of the target words, this ratio will be either very high or very low; when it does not distinguish between the words, this ratio will lie around 1. That sounds logical enough.
In addition, Pennington et al. continue, this function F should have some other desirable characteristics:
- Because vector spaces are inherently linear, F should depend only on the difference between the vectors of the two target words.
- F should not mix the dimensions of the word vectors in undesirable ways.
- Because the distinction between a target word and a context word is irrelevant, F should be invariant when these words are swapped.
After identifying a suitable candidate, Pennington et al. use it to factorize an initial global co-occurrence matrix into a matrix with fewer dimensions, which are more semantically meaningful. In particular, they cast the factorization of the co-occurrence matrix as a least-squares problem that involves the logarithm of the original co-occurrence values and a weighting function that manages the impact of rare and frequent co-occurrences.
Through a number of experiments, they then shown that the resulting word vectors perform very well across a variety of tasks, including a word analogy task that requires them to answer questions like “Athens is to Greece as Berlin is to ___”. As with all models of this kind, performance depends heavily on a number of parameter settings: the context window for the co-occurrences has an optimal size of 10 words, and the word vectors ideally have around 300 dimensions.
In addition to the paper, the accompanying website shows that the final word matrix captures a number of semantic distinctions very nicely. For example, word pairs like man-woman, brother-sister and king-queen differ along the same dimensions. There’s only a small number of examples, but it’s really fascinating stuff.
Unsurprisingly, GloVe has generated quite some interest. The code is freely available from the Stanford website, just like several sets of pre-trained word vectors. Someone wrote a helpful tutorial for coding GloVe in Python, and there’s another toy implementation on Github. While most people are impressed by GloVe's performance, other reactions have been lukewarm, arguing that the GloVe paper makes an unfair comparison with other competing methods.
Lately there has been some kind of arms race for the best word vectors, with several types of models competing. While this is surely a good thing, we shouldn’t forget word vectors can only give an abstract approximation of word meaning, no matter how good they are. The meaning of a word is not just some average of all its uses in a corpus: for each occurrence it crucially depends on the local context in which the word is used. Among all attention that word meaning is now getting, modelling the local meaning of a single occurrence of a word is a challenge I would love to see addressed more often.