In this series of blog posts, I present some of the most influential papers from the recent history of Natural Language Processing. First up is Dekang Lin's Automatic Retrieval and Clustering of Similar Words, published in 1998. Lin’s paper, which has been cited more than 1500 times according to Google, explains how computers can automatically find words with a similar meaning. This similarity can be exploited by search engines like Google, for example, in order to find more relevant web pages for a given search term.
Last month I learnt a new word: echidna. Unless you know Greek, the word itself gives you no clue to its meaning. For all you know, its meaning might be similar to influenza, geyser, hedgehog or oak. Now suppose I gave you these four sentences, which describe my first encounter with an echidna:
I looked up at the sound, and suddenly I saw an echidna next to the trail.
The echidna was eating ants with its long nose.
I tried to take a picture, but the echidna hid in the bushes.
Eventually the echidna crawled out and looked me in the eye.
Although none of these sentences defines echidna, together they give you a vague idea of what the word means: it must be some kind of small, ant-eating animal with a long nose. Confronted with the list of words above, you can easily pick the most similar one: it must be hedgehog. Now imagine having hundreds of sentences like these. Your understanding of echidna would become very precise indeed.
Just like you did, computers guess the similarity in meaning between two words on the basis of its use in example sentences. One of their strategies is described in Lin’s classic paper Automatic Retrieval and Clustering of Similar Words. Its central idea is to exploit the syntactic behaviour of the words in a text to determine their similarity in meaning. For example, in the sentences above, echidna occurs as the subject of eat, hide and crawl, and as the object of see. This information can be automatically extracted by a so-called parser, a piece of software that uncovers the syntactic structure of a sentence. The general idea behind Lin’s approach is that the more often two words occur in the same syntactic relationship (for example, as the subject of eat), the more similar their meanings must be.
Lin tested this hypothesis on a large collection of newspaper articles, totalling 64 million words. For all nouns, verbs and adjectives/adverbs in that collection, he found the words from the same word class with the most similar syntactic behaviour. This proved to be a very reliable technique of extracting words with a similar meaning. Here are some of the word pairs that Lin identified:
Nouns: earnings-profit, plan-proposal, employee-worker, battle-fight, actor-actress, oil-petroleum, gallery-museum, pub-tavern, waiter-waitress, …
Verbs: fall-rise, injure-kill, concern-worry, convict-sentence, limit-restrict, narrow-widen, hit-strike, …
Adjectives/Adverbs: high-low, bad-good, extremely-very, alleged-suspected, stormy-turbulent, communist-leftist, sad-tragic, enormously-tremendously, …
Although the words in these pairs tend to have a very similar meaning, they are not always substitutable. In fact, many word pairs contain polar opposites, such as fall-rise, narrow-widen, high-low, bad-good, etc. Other words differ in one specific semantic dimension, such as gender (actor-actress, waiter-waitress). Similarity in meaning can take on many forms.
In general, however, Lin’s results indicate that the syntactic similarity between two words gives a pretty good idea of their semantic similarity. It’s no surprise that Dekang Lin works at Google these days: search engines often use strategies like the one he developed to expand people’s queries. For example, if you Google lawyers in Belgium, Google also looks for websites with the word attorney, in order to give you more relevant hits. Lin’s technique offers one way to do this automatically.
P.S.: By the way, if you still want to know what an echidna looks like, here’s my picture:
However close your guess was, I bet you didn’t expect it to be so cute.