Kanji Lookup Tool

October 28, 2008

Recently I’ve been interested in Django, a web application framework that’s often praised for allowing developers to come up with rich database-driven web apps in a very short time, using very little code. I went through the official tutorial a few weeks ago, but the best way to learn a new technology is of course to use it on a real project. In that spirit, I took some time at the weekend to write a kanji dictionary for my mobile phone.

First, some background on what kanji are and how they work. Japanese has three alphabets: hiragana, katakana and kanji; each are used for slightly different purposes. The first two have 50 characters each, and each character has only one reading, whereas kanji has well over 10,000 characters, each of which has multiple readings. A kanji character’s readings are split into onyomi (the sounds used when combining the character with another kanji to make a word) and kunyomi (the sounds used when combining the character with hiragana characters to make a word). Don’t worry – it’s not as complicated as it sounds!

[Note to Japanese speakers: there were a number of over-simplifications in the previous paragraph. Please don’t cringe too hard.]

Because there are so many kanji with so many readings, people need a way to look them up. Common ways to index the character are by onyomi, kunyomi, stroke count (how many strokes it takes to write the character) and radical (character sub-parts that are common to many characters). I often find that I know some readings for a character but I need to know the rest, e.g. I know the onyomi but need the kunyomi or vice versa. So I made a simple web app that I could view on my phone, allowing me to quickly and easily search for kanji based on:

  • onyomi
  • kunyomi
  • stroke count
  • finally, the character itself

The search is designed such that you just input as much information as you know, and a list of all matching kanji is returned. You choose the character you’re looking for and view information about it.

I got the kanji information from Kanjidic2, a humungous XML file containing very detailed info on over 13,000 characters. I also used this Japanese Wiktionary page to get a list of the 214 radicals, so that I could display radicals on the information page for each kanji.

Using Django for this reasonably simple project didn’t turn out to be that fast, as I was learning Django as I went along, but I was very surprised by how little code I ended up writing to get the job done. I was also impressed with how well the models and views are separated. A lot of frameworks claim to enforce MVC and in fact end up cutting corners, but Django really does enforce the separation whilst still making the development process easy and natural.

What I did

1. Define 4 simple models: Kanji, OnYomi, KunYomi, Radical. Django automatically sets up the database tables to fit those models.

2. Write parsers to import all the data from Kanjidic2 and the radicals data (distilled from the Wiktionary page using grep and sed)

3. Set up the admin pages. Django does pretty much all of this for you.

Admin interface screenshot

Admin interface screenshot

4. Write the views (the Python code that does the actual work of searching, etc.)

5. Write the templates (HTML pages with Django extensions)

And voila, we’re done! You can see the application in action at http://67.207.135.42/kanji/. It’s designed to be viewed on a mobile phone but it works perfectly well in the browser as well, as it’s just standard XHTML. I plan to leave it running there for the foreseeable future, so feel free to use it. I’ve already used it in earnest a few times just in the last few days.

Note: On the search form you have to input onyomi in katakana and kunyomi in hiragana.

Still to do

  • Improve the Kanjidic parser. While Kanjidic is an amazing resource, it contains some kanji which are useless for my purposes (e.g. kanji that are not listed in UTF-8 – because the app and database use UTF-8, these kanji show up as completely blank!) Another option would be to switch to UTF-16, but there’s very little need. Only very rare or historical kanji are not present in UTF-8, and anyway my phone can’t even display all of UTF-8 properly.
  • Paginate the search results. This should be reasonably simple using Django’s Paginator object.
  • Make it pretty. I could add stylesheets to pretty up the pages a little bit, and maybe even separate stylesheets for handheld devices and desktop PCs.

You can get a copy of the code here. There’s no documentation, so if you want some help setting it up, leave a comment and I’ll get back to you.

Leave a comment