The most recent issue of the International Journal of Corpus Linguistics features an article on word variation by my two PhD advisors and myself. In this paper we explore how computational models of meaning can be used to find synonyms across two varieties of the same language, such as the Dutch used in Belgium and the Netherlands, or the German spoken in Austria and Germany. Here’s my story about a meeting between Natural Language Processing and linguistics, and peer review gone wrong.
It’s a well-known fact that speakers of the same language often use different words to refer to the same thing. This is not only the case in dialects, but also in languages where different standard varieties have developed. Two such pluricentric languages are Dutch, with different standard varieties in Belgium and the Netherlands, and German, with different standard varieties in Germany and Austria, among other places. For example, while the Dutch like to call their uncle oom, the Flemish like to say nonkel, and while the Germans refer to the first month of the year as Januar, many Austrians say Jänner.
As part of my PhD research at the University of Leuven, I investigated if we can identify such synonyms fully automatically. I compiled a long list of words typical for Belgian Dutch and Austrian German, on the basis of the Referentiebestand Belgisch-Nederlands and the Variantenwörterbuch des Deutschen. Because I wanted to find the synonyms of these words automatically, I needed large collections of texts in the relevant standard varieties. For Dutch, I used the Twente Nieuws Corpus, together with a similar collection of Belgian newspaper articles that we compiled at the University of Leuven. For German, I used the Huge German Corpus for German German, and the Deutsches Referenzkorpus for Austrian German.
To detect the synonyms, I applied an NLP method that I discussed earlier on this blog, which assumes that we can find synonyms by looking for words that often have the same context. For example, because Belgian-Dutch nonkel (uncle) often occurs together with the word tante (aunt), we can expect its Netherlandic-Dutch counterpart oom to do the same. I wrote a piece of software that for each of the words typical of Belgian Dutch or Austrian Dutch looks up all its contexts in the Belgian-Dutch or Austrian-Dutch corpus, and then identifies the words with the most similar contextual behaviour in the Netherlandic-Dutch and German-Dutch corpus, respectively.
The results proved that this so-called distributional-semantic method can indeed be used to detect synonyms across different standard varieties of the same language. It automatically detected jam (jam) as the Netherlandic-Dutch synonym for Belgian confituur, woonkamer (living room) for living, schoon (clean) for proper, and many more examples. The same was true for German, where it correctly identified Fleischer (butcher) as the German-German synonym for Austrian Fleischhauer, or Krankenhaus (hospital) for Spittal. Whereas the method did struggle with infrequent words or words with several meanings, overall its results were very convincing.
While I’m happy to see this article published, I’m pretty dissatisfied with the publication process. It’s hard to believe, but I submitted this article in October 2011 — three years and six months ago. I’ve reviewed enough papers myself to know that this can take some time. Articles for the International Journal of Corpus Linguistics are reviewed by three anonymous specialists, who doubtlessly have very busy lives themselves. But it took the journal two years and five months to get the reviews for our article, after which it kindly asked me to revise the paper in less than three months. That’s just unacceptable. Fast forward another year, and our article is finally available for everyone, except it’s not, because John Benjamins is one of those publishers that keeps academic articles behind a paywall. Oh well. If I ever need reminding why I left academia, I’ll know where to go.