Yves Peirsman bio photo

Yves Peirsman

Natural Language Processing Enthusiast at NLP Town.

Twitter LinkedIn

N-grams have long been part of the arsenal of every NLPer. These fixed-length word sequences are not only ubiquitous as features in NLP tasks such as text classification, but also formed the basis of the language models underlying machine translation and speech recognition. However, with the advent of recurrent neural networks and the diminishing role of feature selection in deep learning, their omnipresence could quickly become a thing of the past. Are we witnessing the end of n-grams in Natural Language Processing?

N-grams are sequences of n linguistic items, usually words. Depending on the length of these sequences, we call them unigrams, bigrams, trigrams, four-grams, five-grams, etc. N-grams have often been essential in developing high-quality NLP applications. Traditional models for sentiment analysis, for example, benefit immensely from n-gram features. To classify a sentence such as I did not like this movie as negative, it’s not sufficient to know that it contains the words not and like — after all, so does the positive sentence What’s not to like about this movie? A sentiment model will have a much better chance of guessing the positive or negative nature of a sentence correctly when it relies on the bigrams or trigrams, such as did not like, in addition to the individual words.

This benefit of n-grams is apparent in many other examples of text classification as well. To illustrate this, let’s do a fun experiment. Let’s take a look at the television series A Game of Thrones, and investigate whether we can use the dialogues in the individual episodes to determine which book from George R. R. Martin’s book series A Song of Ice and Fire they are based on. According to Wikipedia, the episodes in series 1 are all based on the first novel, A Game of Thrones, series 2 covers the events in A Clash of Kings, series 3 and 4 are adapted from A Storm of Swords, and series 5 and 6 mix events from A Feast for Crows and A Dance of Dragons. It turns out we can induce this mapping pretty well on the basis of the dialogues in the episodes.

For each episode and book, I collected the n-grams in a feature vector. I used Tf-Idf to weight the n-gram frequencies and applied L2-normalization. For each episode, I then picked the book whose feature vector had the highest cosine similarity with the episode’s vector. If we look at just the unigrams, 46 of the 60 episodes (76.67%) are assigned to the correct book, according to the Wikipedia mapping. If we also take bigrams into account, this figure climbs to 50 (83.33%). Adding trigrams takes us to 52 correct episodes (86.67%). This indicates that the n-grams capture the similarities between the books and the episodes very well: random guessing would have only given us a success ratio of 22%. The diagram above, which plots the mappings of the model with unigrams, bigrams and trigrams, also suggests another interesting pattern: the later the television series, the more difficult it becomes to connect its episodes to the correct book. While this is obviously just a toy example, this trend is in line with observations that with each Game of Thrones series, the screen adaptation has become less faithful to the books.

Apart from text classification, n-grams are most often used in the context of language models. Language models are models that can assign a probability to a sequence of words, such as a sentence. This ability underpins many successful NLP applications, such as speech recognition, machine translation and text generation. In speech recognition, say, it is useful to know that the sentence there is a hair in my soup is much more probable than there is a air in my soup. Traditionally, language models estimated these probabilities on the basis of the frequencies of the n-grams they had observed in a large corpus. A bigram model would look up the frequencies of the relevant bigrams, such as a hair and a air, in its training corpus, while a trigram model would rely on the frequencies of the three-word sequences, such as is a hair. As the figures from the Google Books N-gram corpus below illustrate, these frequencies should normally point towards a large preference for there is a hair in my soup. Even when the speaker does not pronounce the first letter in hair, the language model will help ensure that the speech recognition system gets the sentence right.

The predictions of language models are usually expressed in terms of perplexity. The perplexity of a particular model on a sentence is the inverse of the probability it assigns to that sentence, normalized for sentence length. Intuitively, the perplexity expresses how confused the model is by the words in the sentence: a perplexity of 50 means that the model behaves as if it were randomly picking a word from 50 candidates at each step. The lower the perplexity, the better the language model is at predicting the text.

Let’s return for a minute to the increasing difficulty of linking the Game of Thrones episodes in the later series to the books. If these newer episodes are indeed less faithful to their source material than earlier ones, this should be reflected in a rising perplexity of a language model trained on the books. To verify this, I trained a bunch of language models on all books in the Song of Ice and Fire series, using the KenLM toolkit. KenLM implements n-gram language models that are equipped with some advanced smoothing techniques to deal with word sequences they have never seen in the training corpus. The results in the figure below confirm our expectations: the median perplexity of the four-gram language model on the sentences in the episodes climbs with every series. While on average, the median sentence perplexity in the first series is 38, it increases to 48 in the second and third series, 52 in the fourth series, 56 in the fifth and 57 in the sixth. This pattern is repeated by the other language models I trained (from bigram to 5-gram models). Clearly, n-gram language models have a harder time predicting the dialogues in the later episodes from the language in the books.

As with so many traditional NLP approaches, in recent years n-gram based language models have been outshone by their neural counterparts. Because neural networks rely on distributed representations that capture the meaning of words (the so-called word embeddings), they are much better at dealing with unseen word sequences. For example, an n-gram model may not be able to predict that pitcher of beer is more likely than pitcher of cats if it has never seen these trigrams in the training corpus. This is true even when the training corpus contains many examples of glass of beer or bottle of beer, but no examples of glass of cats or bottle of cats. A neural language model, by contrast, will typically have learnt that pitcher, glass and bottle have similar meanings, and it will use this information to make the right prediction. These dense word embeddings also have a convenient side effect: neural language models take up much less memory than n-gram models.

The first neural language models were still heavily indebted to their predecessors, as they applied neural networks to n-grams. Bengio et al.’s (2003) seminal approach learnt word embeddings for all words in the n-gram, along with the probability of the word sequence. Later models, such as those by Mikolov et al. (2010) and Sundermeyer et al. (2013) did away with the n-grams altogether and introduced recurrent neural networks (RNNs) to the task of language modelling. These RNNs, and Long Short-Term Memories in particular, not only learn word embeddings, but are also able to model dependencies that exceed the boundaries of traditional n-grams. Even more recently, other neural architectures have been applied to language modelling, such as convolutional networks and RNNs with attention. Teams at Google, Facebook and other places seem to have got caught up in a race for ever-lower perplexities.

RNN-based language models learn word embeddings to make predictions. Adapted from the Stanford CS224n course slides.

There are quite a few toolkits for training and testing neural language models. NPLM implements Bengio’s original feed-forward neural language models, RWTHLM adds the later RNN and LSTM architectures, while TheanoLM lets you customize the networks more easily. It’s also fairly straightforward to build your own language model in more generic deep learning libraries such as Tensorflow, which is what I did for this blog post. There’s a simple example of an LSTM language model in the Github Tensorflow repo.

I repeated the experiment above with an LSTM-based language model that I trained for ten epochs, using a dimensionality of 256 for the word embeddings and the LSTM cell. Ideally, we’d like to have more data than just the sentences in George R. R. Martin’s books to train such a model, but the results are nevertheless interesting. First of all, the median sentence perplexities of each episode confirm our earlier findings: with each series, the dialogues in Game of Thrones get more difficult to predict. Second, the perplexities of our neural language model are generally lower than the ones of the four-gram language model. This is true in particular for the later series: while the median sentence perplexities of both models average around 38 in the first series, the last series shows more diverging results: the four-gram language model has a median sentence perplexity of 58 on average, while the LSTM model has only 49. This difference showcases the strength of neural language models — a result of the semantic knowledge they store in their word embeddings and their ability to look beyond n-gram boundaries.

N-grams will not disappear from NLP any time soon. For simple applications, it’s often more convenient to train an n-gram model rather than a neural network. Doing so will also save you a lot of time: training an n-gram model on the data above is a matter of seconds, while an LSTM can easily need a few hours. Still, when you have the data and the time, neural network models offer some compelling advantages over their n-gram competitors, such as their ability to model semantic similarity and long-distance dependencies. These days both types of model should be in your toolkit when you’re doing text classification, need to predict the most likely word in a sentence, or are just having a bit of fun with books and television series.