Here I build upon the algorithm for reading Devanagari advanced in the Ingalls and Ingalls paper "The Mahabharata: Stylistic Study, Computer Analysis, and Concordance."1 While they developed an application capable of accurately reading hand-written Devangari, I will use a single standard font, the Devanagari subset of Arial Unicode MS, although the algorithm presented here is capable of learning a wide range of Devanagari fonts. Also because of time constraints and because hundreds of templates are necessary to cover the full breadth of Devanagari, I will be dealing with a representative subset of the script. The primary contribution of the approach presented here is removal of the need to shift segmented characters over templates looking for the best match. The algorithm here simply involves resizing the segmented character to the size of the template.
My paperIf you don't know Devanagari see:
Optical character recognition of Devanagari presents several difficulties. The most obvious is separation of compound consonants, like the ones below taken from the Omniglot site, into their constituent consonants. I will not be tacking this particular problem in this project because the more straightforward way to solve it involves simply creating many more templates, a fairly tedious and programmatically uninteresting task. The more difficult way—separating out the constituents—involves machine learning techniques not reasonably implemented in a three week project. A more reasonable difficulty is the integration of characters above the matra without those below it. This issue presents it self in the case of most vowels following consonants (e.g. कौ, की). कि (ki) has its vowel written before it, but the vowel is actually after it sequentially. A similar issue is the handling of characters with internal whitespace, the consonants ण, श, ग, ङ and vowels आ, ओ, औ. These integration issues cover all the cases I will consider in the following, and constitute most, but not all, cases that cannot simply be handled by augmenting the “alphabet” with most character templates. So, in summary, I will be dealing with a subset of Devanagari consisting of all non-composite consonants, vowels, and consonant-vowel pairs (except those written beneath consonants).
The method of Devanagari OCR presented here involves segmenting an image in four ways: into lines, into words, into words below and above the matra, and into characters. These segmented characters are then applied either in the creation of templates or the recognition of Devanagari characters in conjunction with previously created templates.
1.Because very few allowances for noise have been added to this application, input images are obtained with the print-screen function and cropped so that only whitespace surrounds the MS Arial Unicode Devanagari text. This also removes any need to rotate the text. 2.Convert images to binary form using the Matlab graythresh() method. This function uses the OTSU algorithm to compute the threshold between black and white pixels
3.Recover lines of text by looking for breaks between rows of pixels with maximal whitespace and those with less than maximal whitespace. A vertically aligned histogram of the image’s rows is used to perform this computation. 4.Recover words in a similar fashion by looking for breaks by analysis of the horizontal histogram of image columns. 5.Sequential breaks which are less than a certain number of pixels apart are both removed. a.Removing line breaks very near to each other performs some noise correction without reducing accuracy. b.Removing word breaks very near to each other corrects for those characters which are actually divided by whitespace, which occur with characters that break the matra (e.g. ध (dha), थ (tha), भ (bha), श (sha), अ (a) ).
6.The rows which designate the matra are recovered by taking all those pixels with a vertical histogram value less than 1.5 times the minimum histogram value (a black pixel has a value of 0). a.This constant of 1.5 is empirically derived and used because matra rows have far more black pixels than other rows, but not all the same histogram value because of characters that break the matra. A similar constant would work equally well. 7.To recover characters below and above the matra, breaks between maximal and sub-maximal horizontal histogram values are again used. This process is performed separately for the part of a word above and below the matra. No breaks are removed in this case.
8.Segmented characters can be applied either in creation of templates or character recognition. Template Creation: 1.Crop whitespace on the bottom of characters below the matra; crop whitespace on the top of characters above the matra. 2.Save the character images to a designated output directory. 3.Manually identify the output images by renaming the output files a.File names are the same as the mappings above under Problem Statement (plus .jpg extension) for characters below the matra, except: i.There is no “a” after the consonant (e.g. k.jpg, not ka.jpg). ii.Consonants with internal whitespace (श,ग,ण, ङ) only have a template for their first part (i.e. without the following vertical bar). iii.The vertical bar is A.jpg. b.Files names for characters above the matra have their vowel name with a suffix of “^” (e.g. ai^.jpg). i.The only exception is the top of ई () which has a name of I.jpg (without the ^ suffix). 4.Duplicate templates are deleted. 5.Characters above the matra go in sub-directory ./top . Characters below the matra go in a sub-directory ./bottom . a.This arrangement of templates improves efficiency.
1.Crop whitespace on the bottom of characters below the matra; crop whitespace on the top of characters above the matra.
2.For characters below the matra perform SSD with all templates in
The OCR program was run on two images not used to create template. Both covered the full range of consonants and vowels. While the second, larger image had many more consonant- vowel combination characters, both had at least several instances of each combined vowel.
On Test Image 1, the program preformed flawlessly.
Output from running Test Image 1:
Akakhagagha icachajajha ITaThaiDaDha auyiyI utathodaudha UpaphAbabha Rn-an~aNanama Lyiralava eshaSesaha aiyayai oyoyau auyiyI mIgasedhaNabha cagakaDa yalashathaucho jaTighaThoya okhagINaDa dhabhAyamoralashajhi chAcaSahilana RRditauDhaDaja Uthashavamabhayapo phadhan-achaitaNa Ajain~araucaTabhayaphadhavItaDha DisahavApabho Tajhajauchagakhakoga
On Test Image 2, the program performed recognition flawlessly on 90% of the lines. For the fourth line from the bottom, beginning with “ahakhi…,” the program outputted only an extra space and “ri.” I have not had the time to uncover why this difficulty occurred. There were also some issues with vowels followed by the characters for “n-“ or “D.” How I was outputting these characters caused Matlab to group characters together, causing the dimensions of the output character array to be incorrect. I was not able to remedy this, but why debugging it appeared the program identified the characters correctly nonetheless, even if the output could not be property represented.
Output from running Test Image 2:
akAkhaugI Agoghain-a cachAjaujhan~aiTa ItIthodadhanapiphabibhama ucachAja UpIpha R LNatetha RRshaSai ethida aiseha on~aTauTha aumiyerAla kacan-oghAga AphipInAdhodai An~IjhAghegaikha ishivaira IhasiSashauva ubhIbAphepau UshadathotaNaDha Rcin-agha Lmobhaba eponadha airAyAmI RRtiNauDhiDa RRghashan-avocAchoyai ri Acarichauyaijimaijha niDhadhakeha iyojAmojhabhen~a aiDhadhaNadata ITaphaThepan-ina Ln-ivaicarIcha un~AbaTiphITha Un-aivAshacara Rhakhesoga
Both Test Image 1 and Test Image 2 are 24 point font. When I tried a more standard 12 point font with the same templates as those applied to the two Test Images, the program performed similarly but some characters that are similar in Devanagari, such as “p” and “sh,” “n-“ and “D” were confused.