Thai Wikipedia Dictionary
2009.04.09/19:18:20 : prevnext : rss

An article at reganmian.net describes an interesting use for multilingual Wikipedia data. By processing the links between articles of the same subject in different languages, a unique dictionary can be created. It's not a general purpose dictionary that contains words of everyday usefulness, like 'apple', 'go', or 'city'. Instead, it contains the titles of movies and songs, names of famous people, terms to describe historic events, and other words or phrases which wouldn't be found in a general purpose dictionary. Each entry in the dictionary is the title of a wikipedia article.

The linked article contains an English-Chinese dictionary created by using the Chinese version of Wikipedia. Fortunately, the scripts used are also available. It was quite easy to modify them to produce an English-Thai dictionary, using the Thai version of Wikipedia.

The Thai Wikipedia is significantly smaller than the Chinese Wikipedia. The front page currently claims 44,892 Thai articles, by comparison to 243,361 Chinese articles. Interestingly the Thai wikipedia is similar in size to the Chinese version from two years ago, in 2007. The XML dumps of these sites contain more than article entries. They include the template pages, special pages, category pages, help, etc. The uncompressed data from the Thai Wikipedia is 419mb.

I modified the ruby script that was used for the Chinese - English Dictionary to be slightly more strict when filtering links. The modified script attempts to exclude templates, help, standalone numbers, years, and other titles, which don't produce interesting words or phrases. It produces a dictionary with 24,468 entries. By contrast, the Chinese wikipedia produces a dictionary with 123,300 entries. In some cases the filter may have been too aggressive. A portion of the discrepancy between 44,892 and 24,468 can be attributed to Thai Wikipedia articles which don't have English language counterparts.

I put up a web based interface to search the dictionary. It's searchable in both Thai and English. Each Thai entry is a link to it's associated wikipedia article. This lets the dictionary double as a quick and easy English language search of the Thai Wikipedia.

Many of the entries are transliterations, rather than translations. I find it interesting to see which instances of the same word are transliterated, and which are translated. Or, in some cases how they have been transliterated. Many borrowed words in Thai are transliterated several different ways. Seven different ways for the word 'Internet', for example.

The various language specific versions of the Wikipedia don't constitute parallel corpora. Though, because some of the data is structured, there are some opportunities for automated extraction of translations. The titles which function as implicit links between articles of differing languages being a case in point. By manual inspection there's more interesting things to be had. Where else would you go to find out how to talk about super powers in Thai?

Comments

Stian Haklev : 2009.04.14/01:13:38

Hi, came here through my web log. Very happy to see other people reusing the script and building on it! Stian

sandy : 2009.10.05/22:01:00

it's a bad wikipedia

Waifipspignee : 2010.01.02/17:06:09

Ceftin Cash Delivery Can you purchase Ceftin without a prescription Purchase Generic Ceftin Online side effects of ceftin Order Ceftin With No Prescription Buy Ceftin Shipped Cod alcohol with ceftin Ceftin Hyclate 100Mg Buy Ceftin Sales Online Order Ceftin Generic Buy Overnight Delivery Cheap Ceftin does ceftin contain penecilin By Buy Ceftin Online Ordering Ceftin Online Without A Prescription Buy Ceftin Drugs

Insopsmaima : 2010.01.25/04:59:03

Incorporate decaptcher into imacros. It's pretty easy to setup. -==Ok Lets Begin=-- NOTE: To use DeCaptcher You have to Install Apache and PHP5 on Windows: It is quite simple with WampServe. - DOWNLOAD WampServer 2.0i here - http://086430f8.qvvo.com After download, install and run WampServe. (Remember to run WampServe every time when you restart windows) Copy and paste the following into your IMacros exactly as it is displayed. Code: ONDOWNLOAD FOLDER=C:\wamp\www\api\ FILE=pic.jpg TAG POS=1 TYPE=IMG ATTR=HREF:*api.recaptcha.net/* CONTENT=EVENT:SAVEPICTUREAS TAB OPEN TAB T=2 SET !TIMEOUT 180 URL GOTO=http://localhost/api/main.php TAG POS=1 TYPE=BODY ATTR=TXT:* EXTRACT=TXT SET !VAR1 !EXTRACT Now Download my DeCaptcher API file: Click Here to download - http://d662cb9e.ubucks.net Unzip api.zip (Right click on the file). Open main.php - (to open use Notepad or you favorite editor). You will see: define( 'HOST', "127.0.0.1" ); // YOUR HOST define( 'PORT', MY PORT ); // YOUR PORT define( 'USERNAME', "mylogin" ); // YOUR LOGIN define( 'PASSWORD', "mypassword"); // YOUR PASSWORD Login to http://f634333a.seriousurls.com It should now look like: General Information: Server - - 72.233.64.162 Port - 5343 Priority Payment - $0 Balance - $29.7049 OK. Time to edit main.php. After editing main.php should now look like. define( 'HOST', "72.233.64.162" ); // YOUR HOST define( 'PORT', 5343 ); // YOUR PORT define( 'USERNAME', "myusername" ); // YOUR LOGIN define( 'PASSWORD', "mypassword" ); // YOUR PASSWORD Now save main.php and copy api folder to the: C:\wamp\www\ Now your done and ready to Start Your Bot + DeCaptcher support.

scoump : 2010.02.06/14:33:33

New here and thought, what could be a better way to introduce myself than wish all the best to rburns.paiges.net :D http://www.forex1st.com/forex-trading.gif