2009.04.09

thai wikipedia dictionary

An article at reganmian.net describes an interesting use for multilingual Wikipedia data. By processing the links between articles of the same subject in different languages, a unique dictionary can be created. It's not a general purpose dictionary that contains words of everyday usefulness, like 'apple', 'go', or 'city'. Instead, it contains the titles of movies and songs, names of famous people, terms to describe historic events, and other words or phrases which wouldn't be found in a general purpose dictionary. Each entry in the dictionary is the title of a wikipedia article.

The linked article contains an English-Chinese dictionary created by using the Chinese version of Wikipedia. Fortunately, the scripts used are also available. It was quite easy to modify them to produce an English-Thai dictionary, using the Thai version of Wikipedia.

The Thai Wikipedia is significantly smaller than the Chinese Wikipedia. The front page currently claims 44,892 Thai articles, by comparison to 243,361 Chinese articles. Interestingly the Thai wikipedia is similar in size to the Chinese version from two years ago, in 2007. The XML dumps of these sites contain more than article entries. They include the template pages, special pages, category pages, help, etc. The uncompressed data from the Thai Wikipedia is 419mb.

I modified the ruby script that was used for the Chinese - English Dictionary to be slightly more strict when filtering links. The modified script attempts to exclude templates, help, standalone numbers, years, and other titles, which don't produce interesting words or phrases. It produces a dictionary with 24,468 entries. By contrast, the Chinese wikipedia produces a dictionary with 123,300 entries. In some cases the filter may have been too aggressive. A portion of the discrepancy between 44,892 and 24,468 can be attributed to Thai Wikipedia articles which don't have English language counterparts.

I put up a web based interface to search the dictionary. It's searchable in both Thai and English. Each Thai entry is a link to it's associated wikipedia article. This lets the dictionary double as a quick and easy English language search of the Thai Wikipedia.

Many of the entries are transliterations, rather than translations. I find it interesting to see which instances of the same word are transliterated, and which are translated. Or, in some cases how they have been transliterated. Many borrowed words in Thai are transliterated several different ways. Seven different ways for the word 'Internet', for example.

The various language specific versions of the Wikipedia don't constitute parallel corpora. Though, because some of the data is structured, there are some opportunities for automated extraction of translations. The titles which function as implicit links between articles of differing languages being a case in point. By manual inspection there's more interesting things to be had. Where else would you go to find out how to talk about super powers in Thai?