Motivation for Technical Choices

The problem of storing documents written in English (and most other European languages) so that they can be served to users with no specialized software, is a no-brainer. This is not so for Bengali (and other Indian languages) due to the complex nature of its written script. A few words of explanation about the technical choices we made are thus in order.

There are two important questions to be settled:

  1. How the documents would be stored
  2. How the documents would be viewed
Hypothetically speaking, a Bengali `document' can be stored in several forms. For example, they might be just images, or perhaps PDF files. However, we feel that text should be represented simply as text, without any further complications (retaining, most importantly, the ability to freely edit that text). We also wish to use standards that are open, cross-platform, and widely recognized. A few years ago, it might have been difficult to meet these criteria (which led to the development of several unrelated, mutually incompatible and often proprietary platform-specific protocols to deal with this problem). But with the widespread acceptance of Unicode, the problem of unambiguously representing Bengali text is mostly solved. What remains is the tools to properly render this text, and we have gotten to the point that such tools are available (although sometimes unstable) for all major platforms.

Format of the documents

Our documents will be HTML files or plain text files, with the Unicode code definition used to represent bengali characters. The encoding used is UTF-8. This part is quite well-implemented in almost all modern platforms.

Viewing the documents

Unicode represents bengali text as a sequence of bengali characters. Unlike most European scripts, just rendering these characters is not enough for Bengali (and other Indic scripts), it is necessary to form new glyphs by combining several characters. In the picture below, the characters on the left is an example of what might be in the utf-8 encoded html file, while on the right is what we would expect to actually see on the screen.

The rules for converting from the first form to the second are not that difficult, however, until recently, there was no accepted standard that described it. This has been addressed in the extenstion to the TrueType font format known as Open Type (along with some rules in Unicode for reordering the characters before combining them). A description of the parts of the specification relevant to indic scripts is available through Microsoft's typography site (you will have to look around a bit, the exact links seem to move from time to time).

Support in various operating systems

Although unicode is well supported, Opentype is evolving technology, and may not be supported on all platforms. This page should tell you more about what the current state of affairs is on the platform of your choice.

It should be noted that Open Type is not the only technology that could potentially deal with complex scripts like Bengali (i.e., deal with the problem of correctly rendering text supplied as unicode). However, it is currently the only technology that we are familiar with, so all discussion here will be restricted to that.

Last modified: Sun Mar 14 21:43:18 CST 2004 by Deepayan Sarkar <deepayan at stat.wisc.edu>