Word embeddings are one of the main drivers behind the success of deep learning in Natural Language Processing. Even technical people outside of NLP have often heard of word2vec and its uncanny ability to model the semantic relationship between a noun and its gender or the names of countries and their capitals. But the success of word2vec extends far beyond the word level. Inspired by this list of word2vec-like models, I set out to explore embedding methods for a broad variety of linguistic units — from sentences to tweets and medical concepts.
Although the idea of representing words as continuous vectors has been around for a long time, none of the previous approaches have been as successful as word2vec. Popularized in a series of papers by Mikolov and colleagues, word2vec offers two ways of training word embeddings: in the continuous bag-of-word (CBOW) model, the context words are used to predict the current word; in the skip-gram model, the current word is used to predict its context words. Because semantically similar words occur in similar contexts, the resulting embeddings successfully capture semantic properties of their words. Most famously, high-quality embeddings can (sometimes) answer analogy questions, such as “man is to king as woman is to _”. Indeed, the semantic (and syntactic) information that is captured in pre-trained word2vec embeddings has helped many deep learning models generalize beyond their small data sets. It is not surprising, then, that this framework has been so influential, both at the word level and beyond. While competitors such as GloVe offer alternative ways of training word vectors, other models have tried to extend the word2vec approach to other linguistic units.
One problem with the original word2vec model is that it maps every word to a single embedding. If a word has several senses, these
are all encoded in the same vector. To address this problem, Trask and colleagues developed sense2vec, an adaptation of word2vec
that uses supervised labels to distinguish between senses. For example, in a corpus that has been labelled with parts of speech,
bank/verb are treated as distinct tokens. Trask et al. show that downstream NLP tasks such as dependency parsing can
benefit when word2vec operates on this “sense” level rather than the word level. Of course, depending on the type of information you choose
to model, “sense2vec” may or may not be a fitting name for this approach. Senses are more than just combinations of words and and their
parts of speech, but in the absence of large sense-tagged corpora, POS-tagged tokens can be a valid approximation.
Either way, it is clear that performance on certain NLP tasks can get a boost from working with relevant, finer-grained units than simple words.
Some tasks require even more specific information about a word. Part-of-speech tagging, for instance, can benefit enormously from intra-word information that is encoded in smaller units such as morphemes. The adverbial suffix -ly in English is a good example. For this reason, many neural-network approaches operate at least partially on the character level. A good example is Dos Santos and Zadrozny’s model for part-of-speech tagging, which uses a convolutional network to extract character-level features. However, character embeddings are usually trained in a supervised way. This sets them apart from word2vec embeddings, whose training procedure is fully unsupervised.
Still, not all NLP tasks need such detailed information, and many of them focus on units larger than single words. Kiros et al. present a direct way of translating the word2vec model to the sentence level: instead of having a word predict its context, they have a sentence predict the sentences around it. Their so-called skip-thought vectors follow the encoder-decoder pattern that is so pervasive in Machine Translation nowadays. First a recursive neural network (RNN) encoder maps the input sentence to a sentence vector, and then another RNN decodes this vector to produce the next (or previous) sentence. Through a series of eight tasks such as paraphrase detection and text classification, this skip-thought framework proves to be a robust way of modelling the semantic content of sentences.
Unfortunately, sentences are not always as well-behaved as those modelled by Kiros et al. Social media content in particular is a hard beast to tame. Tweets, for example, do not combine into paragraphs and are riddled by slang, misspellings and special characters. That’s why tweet2vec, a model developed by Dhingra and colleagues uses a character-based rather than a word-based network. First, a bidirectional Gated Recurrent Unit does a forward and backward pass over all the characters in the tweet. Then the two final states combine to a tweet embedding, and this embedding is projected to a softmax output layer. Dhingra et al. train their network to predict hashtags. The idea is that tweets with the same hashtag should have more or less semantically similar content. Their results show that tweet2vec indeed outperforms a word-based model significantly on hashtag prediction, particularly when the tweets in question contain many rare words.
Whether they are word-based or character-based, the sentence-modelling methods above require all input sentences to have the same length (in words or characters). This becomes impractical when you’re dealing with longer paragraphs or documents that vary in length considerably. For this type of data, there is doc2vec, an approach by Le and Mikolov that models variable-length text sequences such as sentences, paragraphs, or full documents. It builds embeddings for both documents and words, and concatenates these embeddings to predict the next word in a sliding-window context. For example, in the sliding window the cat sat on, the document vector would be concatenated with the word vectors for the cat sat to predict the next word on. The document vector is unique to each document; the word vectors are shared between documents. Compared to its competitors, doc2vec has some unique advantages: it takes word order into account, generalizes to longer documents, and can learn from unlabelled data. When the resulting document vectors are used in downstream tasks such as sentiment analysis, they prove very competitive indeed.
But why stop at paragraphs or documents? The next victim that has fallen prey to the word2vec framework is topic modelling. Traditional topic models such as Latent Dirichlet Allocation do not take advantage of distributed word representations, which could help them model semantic similarity between words. Moody’s lda2vec aims to cure this by embedding word, topic and document vectors into the same space. His method owes a lot to word2vec, but in lda2vec the vector for a context word is obtained by summing the word vector with the vector of the document in which the word occurs. In order to obtain LDA-like topic distributions, these document vectors are defined as a weighted sum of topic vectors, where the weights are sparse, non-negative and sum to one. Moody’s experiments show that lda2vec produces semantically coherent topics, but unfortunately his paper does not offer an explicit comparison with LDA topics.
Niu and Dai's topic2vec is even more similar to word2vec. In the CBOW setting, the context words are used to predict both a word and topic vector; in the skip-gram setting, these two vectors themselves predict the context words. Niu and Dai argue that their topics are more semantically coherent than those produced by LDA, but because they only give a few examples, their argument feels rather anecdotic. Moreover, topic2vec still depends on LDA, as the topic assignments for each word in the corpus are required to train the topic vectors. When it comes to word2vec-like topic embeddings, I’m still to be convinced.
After their conquest of classic NLP tasks, it’s no surprise embedding methods have also found their way to specialized disciplines where large volumes of text abound. One good example is med2vec, an adaptation of word2vec to the medical domain. Choi and colleagues present a neural network that learns embeddings for medical codes (diagnoses, medications and procedures) and patient visits. Their method differs from word2vec in two crucial respects: it is explicitly modified to produce interpretable dimensions, and it is sensitive to the order of the patient visits. While the quality of the code embeddings in the evaluation is mixed, the embedding dimensions are clearly correlated with specific medical conditions. The visit embeddings in their turn prove quite effective at predicting the severity of the clinical risk and future medical codes. Med2vec is not the only example of embeddings in the medical domain: Deep Patient learns unsupervised representations of patients to predict their medical future on the basis of their electronic health records.
And the list doesn’t end there. In the domain of scientific literature, author2vec learns representations for authors by capturing both paper content and co-authorship, while citation2vec embeds papers by looking at their citations. And if we leave language for a moment, word2vec-like approaches have been applied to people in a social network, playlists on Spotify, video games and Major League Baseball Players. Not all of these applications are equally serious, but they all attest to the success of the word2vec paradigm.
It’s clear word2vec embeddings are here to stay. Whether the framework is used to embed words or other linguistic units, the resulting vectors have played a huge role in the success of deep learning methods. At the same time, they still have clear limitations. Simple tokens may be relatively easy to embed, but word senses and topics are a different matter. How do we make the embedding dimensions more interpretable? And what can we gain from adding explicit linguistic information beyond word order? In these respects, embeddings are still in their infancy.