Reading Devanagari

Andy Dibble


Abstract

Here I build upon the algorithm for reading Devanagari advanced in the Ingalls and Ingalls paper "The Mahabharata: Stylistic Study, Computer Analysis, and Concordance."1 While they developed an application capable of accurately reading hand-written Devangari, I will use a single standard font, the Devanagari subset of Arial Unicode MS, although the algorithm presented here is capable of learning a wide range of Devanagari fonts. Also because of time constraints and because hundreds of templates are necessary to cover the full breadth of Devanagari, I will be dealing with a representative subset of the script. The primary contribution of the approach presented here is removal of the need to shift segmented characters over templates looking for the best match. The algorithm here simply involves resizing the segmented character to the size of the template.

My paper

The Devanagari Script

If you don't know Devanagari see:

Problem Statement

Optical character recognition of Devanagari presents several difficulties. The most obvious is separation of compound consonants, like the ones below taken from the Omniglot site, into their constituent consonants. I will not be tacking this particular problem in this project because the more straightforward way to solve it involves simply creating many more templates, a fairly tedious and programmatically uninteresting task. The more difficult way—separating out the constituents—involves machine learning techniques not reasonably implemented in a three week project. A more reasonable difficulty is the integration of characters above the matra without those below it. This issue presents it self in the case of most vowels following consonants (e.g. कौ, की). कि (ki) has its vowel written before it, but the vowel is actually after it sequentially. A similar issue is the handling of characters with internal whitespace, the consonants ण, श, ग, ङ and vowels आ, ओ, औ. These integration issues cover all the cases I will consider in the following, and constitute most, but not all, cases that cannot simply be handled by augmenting the “alphabet” with most character templates. So, in summary, I will be dealing with a subset of Devanagari consisting of all non-composite consonants, vowels, and consonant-vowel pairs (except those written beneath consonants).

Method

Overview:

The method of Devanagari OCR presented here involves segmenting an image in four ways: into lines, into words, into words below and above the matra, and into characters. These segmented characters are then applied either in the creation of templates or the recognition of Devanagari characters in conjunction with previously created templates.

Preprocessing:

1.Because very few allowances for noise have been added to this application, input images are obtained with the print-screen function and cropped so that only whitespace surrounds the MS Arial Unicode Devanagari text. This also removes any need to rotate the text. 2.Convert images to binary form using the Matlab graythresh() method. This function uses the OTSU algorithm to compute the threshold between black and white pixels

Segmentation:

3.Recover lines of text by looking for breaks between rows of pixels with maximal whitespace and those with less than maximal whitespace. A vertically aligned histogram of the image’s rows is used to perform this computation. 4.Recover words in a similar fashion by looking for breaks by analysis of the horizontal histogram of image columns. 5.Sequential breaks which are less than a certain number of pixels apart are both removed. a.Removing line breaks very near to each other performs some noise correction without reducing accuracy. b.Removing word breaks very near to each other corrects for those characters which are actually divided by whitespace, which occur with characters that break the matra (e.g. ध (dha), थ (tha), भ (bha), श (sha), अ (a) ).

Line and Word Segmentation:

6.The rows which designate the matra are recovered by taking all those pixels with a vertical histogram value less than 1.5 times the minimum histogram value (a black pixel has a value of 0). a.This constant of 1.5 is empirically derived and used because matra rows have far more black pixels than other rows, but not all the same histogram value because of characters that break the matra. A similar constant would work equally well. 7.To recover characters below and above the matra, breaks between maximal and sub-maximal horizontal histogram values are again used. This process is performed separately for the part of a word above and below the matra. No breaks are removed in this case.

Matra and character segmentation:

8.Segmented characters can be applied either in creation of templates or character recognition. Template Creation: 1.Crop whitespace on the bottom of characters below the matra; crop whitespace on the top of characters above the matra. 2.Save the character images to a designated output directory. 3.Manually identify the output images by renaming the output files a.File names are the same as the mappings above under Problem Statement (plus .jpg extension) for characters below the matra, except: i.There is no “a” after the consonant (e.g. k.jpg, not ka.jpg). ii.Consonants with internal whitespace (श,ग,ण, ङ) only have a template for their first part (i.e. without the following vertical bar). iii.The vertical bar is A.jpg. b.Files names for characters above the matra have their vowel name with a suffix of “^” (e.g. ai^.jpg). i.The only exception is the top of ई () which has a name of I.jpg (without the ^ suffix). 4.Duplicate templates are deleted. 5.Characters above the matra go in sub-directory ./top . Characters below the matra go in a sub-directory ./bottom . a.This arrangement of templates improves efficiency.

Character Recognition:

1.Crop whitespace on the bottom of characters below the matra; crop whitespace on the top of characters above the matra. 2.For characters below the matra perform SSD with all templates in /bottom. a.Resize each character to fit the dimensions of each candidate template. 3.The template with the lowest match indicates the next character. The character itself is retrieved from the template filename. Integration: Once the steps for Character Recognition above are performed, they must be synthesized into the final string of text. This is necessary because vowels written above the matra have associations to consonants below the matra, some consonants have internal whitespace, and many vowels above the matra also create a vertical bar (designated as “A”) below the matra. 4.Create a map between characters below the matra and those above it. A character below has a character above as its value if the character above’s right-hand side is closest in the word horizontally to the character below horizontally. a.Tom Benson’s implementation of a findNearest() algorithm was used to create these mappings. 5.For each character in the bottom: a.If it is a consonant and not a consonant with internal whitespace (g (ग), sh (श), N (ण)) or potentially internal whitespace (D (ड) vs. n- (ङ)) i.Output it ii.If it has a vowel association (i.e. mapping with vowel above matra), output the vowel iii.If the next character is “A” (vertical bar), output the vowel associated with the “A.” b.If it is a consonant with internal whitespace: i.Output the consonant ii.If the “A” after it has a vowel association, output the vowel iii.If the second “A” after it has a vowel association, out the vowel. c.If it is a vowel: i.If it is vowel other than “e” (ए), “i” (इ), or “a” (अ), output the vowel. ii.If the vowel is “i” and has an vowel association, output “I” (ई) iii.If the vowel is “e” and has an vowel association, output “ai” (ऐ) iv.If the next character is “A,” output the vowel associated with the “A.” v.Otherwise output “a”

Results

The OCR program was run on two images not used to create template. Both covered the full range of consonants and vowels. While the second, larger image had many more consonant- vowel combination characters, both had at least several instances of each combined vowel.

On Test Image 1, the program preformed flawlessly.

Output from running Test Image 1:
Akakhagagha icachajajha ITaThaiDaDha auyiyI utathodaudha UpaphAbabha Rn-an~aNanama Lyiralava eshaSesaha aiyayai oyoyau auyiyI mIgasedhaNabha cagakaDa yalashathaucho jaTighaThoya okhagINaDa dhabhAyamoralashajhi chAcaSahilana RRditauDhaDaja Uthashavamabhayapo phadhan-achaitaNa Ajain~araucaTabhayaphadhavItaDha DisahavApabho Tajhajauchagakhakoga

On Test Image 2, the program performed recognition flawlessly on 90% of the lines. For the fourth line from the bottom, beginning with “ahakhi…,” the program outputted only an extra space and “ri.” I have not had the time to uncover why this difficulty occurred. There were also some issues with vowels followed by the characters for “n-“ or “D.” How I was outputting these characters caused Matlab to group characters together, causing the dimensions of the output character array to be incorrect. I was not able to remedy this, but why debugging it appeared the program identified the characters correctly nonetheless, even if the output could not be property represented.

Output from running Test Image 2:
akAkhaugI Agoghain-a cachAjaujhan~aiTa ItIthodadhanapiphabibhama ucachAja UpIpha R LNatetha RRshaSai ethida aiseha on~aTauTha aumiyerAla kacan-oghAga AphipInAdhodai An~IjhAghegaikha ishivaira IhasiSashauva ubhIbAphepau UshadathotaNaDha Rcin-agha Lmobhaba eponadha airAyAmI RRtiNauDhiDa RRghashan-avocAchoyai ri Acarichauyaijimaijha niDhadhakeha iyojAmojhabhen~a aiDhadhaNadata ITaphaThepan-ina Ln-ivaicarIcha un~AbaTiphITha Un-aivAshacara Rhakhesoga

Both Test Image 1 and Test Image 2 are 24 point font. When I tried a more standard 12 point font with the same templates as those applied to the two Test Images, the program performed similarly but some characters that are similar in Devanagari, such as “p” and “sh,” “n-“ and “D” were confused.