“Doing research, demonstrating and making a full-scale prototype is the work of the academics” 

Dr. Narayana Murthy, a professor at Hyderabad Central University, is an academic and researcher in computer software, particularly focusing on OCR, a technology that has a high industry potential. RNDIndia.info could get perspective on how the academia interact with industry and, what do the industry expect from the academic R&D, apart from the trials and travails of the product / process development in an academic world. He says, “Doing research, demonstrating and making a full-scale prototype is the work of the academics. Companies have to understand and absorb the technology at that level and convert it into a full-scale product by testing and evaluating.” 

--------------------------------------------------------------------------

Content development in regional languages has become a matter of urgency. What are your efforts in this direction? 

Indian languages have many peculiar features that are not well understood even today.  English and other European languages are alphabetic languages. On the other extreme is the script of Japanese, Chinese etc. which are hieroglyphs and pictures. Script in our languages is entirely syllabic in nature. These are not problems but as we approach our script through the standards developed for English it seems to be so. Our effort at Hyderabad Central University is to clarify the basic features and explain their importance. Much software developed so far for our local needs are just good enough for using computers as some kind of typewriter. Not for creation of knowledge base which can be processed, searched and memorized. This happens because of lack of understanding of our writing, sound system, the way it is stored etc. This is mainly related to standardization of characters. We do have national standard called ISCI for Indian language text encoding and there is now the emerging international standard - Unicode. There is a basic confusion at this level in the regional language use. Most go by what appears on the screen or what is stored in the file. In the end that is not what is important. Indian language text we see on the computer is not sustainable even when we highlight the text and change the font! Because what is being done presently for Indian scripts is a non-standard format. We are working on standardization and making these concepts clear to everybody, including standards institutions, right from operating systems, applications and up to contents level.  This is the reason we started working on Akshara multilingual text processing software. These problems are peculiar to us and we have to solve this.   

We need large-scale data to work with. It was not the problem with English anytime. In our languages we have to work from this level. We are now developing corpus and basic statistics that will be useful. We are working on OCR for Indian languages. We have also done some work on machine translation and populating databases in our languages, automatic summarization of documents, automatic categorization, automatic extraction of relevant pages. Our aim should be to give the entire computer functions in Indian language. Someone who does not know English at all must also be in a position to use that. 

The regional language computing and content creation is more challenging than a layman perceives it. Can you elaborate on other issues in the context? 

When we type A in English what is typed is A what stored is A and what is displayed is A.  In regional languages we work at the level of syllables not alphabets. For swa - tran - tyam there are three syllables. We cannot have keys and fonts for all of them. So we talk of consonants and vowels and a means of combining them.  Matra, otthu needs to be introduced, things other languages need not bother about.  We have a character-encoding scheme based on elements or components that are required to make the syllables but they are not syllables in themselves. There is a grammar for syllables. We have a large number of syllables we have a method of composing it. We have to store it and use this computational grammar at all levels. We cannot display a large number of syllables directly. Components used for display or different from the ones used for storing. For display we need a piece of graphical units. So we have fonts, which have small graphical units good for composing and printing and character encoding which goes into vowels, consonants and things like that from which we have to construct the syllables. If we have character based encoding it would be independent of fonts or operating systems.  When you have standard software working on it on which you can do any kind of operation. But it requires mapping from that encoding to a font encoding. We have several commercial fonts that are not standardized. So each font is different. We need a mapping system from character encoding to font encoding. Most of the efforts in regional language text are trying to bypass the font. So the web pages created is font specific. If the client has the same font the contents would be readable. But in real sense of the term font encoded text is not text at all. That is not a knowledge base. We cannot do any other operation - sorting etc. - with such text.  

We are working on the technology of standard character encoded pages, which will work across platforms. For this we have a Java applet, which will map it onto the font at the client side. This is not a plug in software that needs to be downloaded independently. Plug ins have their own problem, some of them are browser specific. Now we have a new technology - wbilio - web based Indian languages input and output. Now for the first time we have platform independent, standard text input for Indian languages that are truly international. So we can type in Telugu and get the Hindi meanings on the webpage, irrespective of your local settings. This technology is developed and demonstrated. Now we can create webpages using Akshara and edit anytime you want without violating standards. 

These are the technologies that in a way benefit the IT hardware and software firms. What is their contribution to these efforts? 

We are working closely with software and hardware companies at various levels. We transfer technology to these companies. Telugu spell checker is one recent example. Error detection and correction is quite challenging. We have an MOU with Modular Infotech, a Pune based company which made the the Telugu spell checker.  

Are the other big names like Wipro, Infosys in the picture? 

No, excepting HP Ltd. most of the big companies are not seriously working on Indian language computing. HP Labs in Bangalore is working on Indian fonts. Our developments have to be operating system independent. That imposes different constraints on us. Language processing is more challenging than the stereotype processing in MS Office and such packages.  

Do you get the required qualified human resources for the research work of this nature, considering the industry out there is economically more rewarding? 

This is a problem area. Just one outstanding person cannot make all the changes. We need motivated researchers as Ph. D. students. We do not normally get people from the industry due to absence of pay parity and various other issues. So the software engineers from the industry do not take these projects seriously. 

Does industry interests overlap with the work you are doing? 

There are at least a dozen organization like ours working on the basic research in regional languages. [refer to site] There are 13 resource centers for language technology assisted by Ministry of IT, GOI. But the sad thing is none of our industry is ready for technology transfer in this area. They want a salable product. Technology transfer is different from salable product. If I have a product I can as well sell it.  

Companies have to understand and absorb the technology at that level and convert it into a full-scale product by testing and evaluating. Doing research, demonstrating and making a full-scale prototype is the work of the academics. This is not what is happening today really.  For the OCR system we had to develop the benchmarking standards. We had to develop the test data, which is a tedious process in itself and certainly not academic work.  

So far we have not been successful in getting industry attention excepting getting some ph.d fellowships. 

Do you collaborate with the 13 centres working on this specialization? 

Yes, the OCR standardization has been done jointly by ISI, Calcutta and Hyderabad Central University. We are working closely with IIT, Kanpur for standardization issues. Our OCR system is to be used for Kannada and Malayalam. We meet frequently to exchange notes and tools.  

We have developed web technology and at the Central Institute of Indian Languages, Mysore Indian language teaching material is to be put on the web using our technology. We are linking up and spreading as far as possible.  

Content creation in regional languages has become a challenge. What are our efforts in making automatic translation possible? 

On a joint project with IIT, Kanpur, we have developed a product called Anusaraka, which is for translating from one Indian language to another. Particularly Telugu to Hindi. In the last two years researchers from IIT Kanpur would come and work here every other month. This project is in the most advanced stage. The heart and soul of the system, which is Telugu-Hindi dictionary and is now made more portable.  

You have also attempted automatic translation. What is the success rate in automatic translation? 

85-90 percent is what our software does, but it is not necessarily in acceptable form. Small features that can add value were never done. It is our efforts now to correct this. 

Are there any practical experiments in automatic translation? 

Yes, a project I did for government of Karnataka Finance department for translating budget speech from English to Kannada. I feel translation between English to Indian languages and vice versa have greater relevance than between the Indian languages.  

Budget speech keeps getting updated till the last minute and than it will have to be translated in an hour or so. Most difficult thing is translation. It was cautioned that high quality automatic translation is not to be expected. At the best it could be 60 percent correct. This is true of any pair of language in the world. Except when the domain of focus is small and the exact fit can be made between the words in different languages. It is possible to make in highly constrained domains. 

Interactive translation is also made possible. Here as and when the computer finds a word for which it is not sure of the meaning, grammatical syntax, it is made to stop. To the extent we can help the computer it can provide better translation. English to Kannada is 40-60 successful. Here the translation is done at sentence level - sentence is analyzed structurally and a translation is arrived at. Parsing itself is difficult. So wherever it is not sure it will not attempt translation. 

The flow of operation is somewhat like this:  the software takes the English sentence, it parses the same; looks at the structure and the corresponding structure of Kannada sentence is decided; goes from sentences to individual words, picks up the words from the English Kannada dictionary; does Kannada morphological generation to get the correct forms of the word. For we have gender or other appropriate forms to be considered. These are done in less than five minutes on the computer. Than comes the post processing level. Here the translated version is presented and if the translation is not attempted by the application, the English version is shown as it is. It has a very powerful post processor using this we can very quickly reconstruct with suggestions from the system. There is also a thesaurus for picking the alternate words, if need be. Changing root words with appropriate forms. 

There would be no typing mistake, yet an expert has to look at the final version for coherence and the contextual shades of meaning etc. The translation has been experimented but not put to real life test in a budget session. More because of logistics and operational problems 40-60 percent of time gives the correct sentence and sense. There is a small percentage of chance where it could be incorrect either because the word or the context is wrong. In the remaining cases the rendering would remain in English as the parser was not confident to translate or suggest a translation.  We are improving on this software with additional features with statistical processing.  

We have enough data to test now. New tools are available. We are trying to combine statistical component with the human judgement. Some thing that adopts learns as it goes on in the context. We have a good dictionary. We have frequency information on the word usage etc.  

Next best thing to content creation in regional languages is instant translation that happens on the web. What is the success rate for translation on the web? 

There is a distinction between a high quality output and quick and dirty translation. In case of high quality translation for some reason the expectation is very high. On the web people don’t expect much when they seek translation of web documents, because anything is better than nothing. There is a level of difference in acceptance. As this is a new requirement and people are getting used to what they get. We haven't done enough experiments with Indian languages. We don’t know what would be the acceptance level and whether it is achievable or not.  

How does the other oriental language translation fare. How do we compare there. 

Nothing is really comparable, as structures of the sentences are completely different. Unless there is a good syntactic mechanism we can't do good translation.  At least for English there is computational grammar but none of that exists for Indian languages. We have to work with in the limitations.  

OCR 

There are about seven centres working on OCR technology in various Indian schools. Out of them Telugu, Bangla, Devanagari, Tamil and Punjabi are the languages where considerable work has been done. For these languages the tools give over 96 per cent accuracy. In Telugu we get 97-98 per cent accuracy without post-processing for normal printed books. A dictionary or a spell checker could add to the accuracy. It can reject the doubtful words and pick up a second best alternative from the dictionary. With these tools we can improve the performance by 1 per cent. We have the interface for those things in Akshara system. But there is no Indian language OCR for handwritten materials or even for manuscripts. Yet what we have achieved through OCR is of practical use if we look at the other applications and its utility. Preserving the heritage is fine. Manuscripts can be preserved even as images. For working with manuscripts researchers have greater patience and put up with some difficulties. 

Which are the other universities working on these issues?

The main centres specializing in this technology inlcude IIT Kanpur, Bombay, Chennai, Kalkota.  None of the mainstream universities are on this research yet excepting MS University in Baroda where the work has just begun, Anna University in Chennai 

Which are the international agencies in this field of activity?  

There is an Indo-UK group called LESAL - Language Engineering in South Asian Languages. We had a workshop under these auspices last year. There is also an Indo-French Network on the topic. We have built a network of people and share expertise in these groups. We are now charting out the possible collaboration, joint projects, exchanges, etc.  

Are the Universities in other countries much ahead of us? 

There are things to learn from both sides. The features of our languages are not there is others. As we have made things work with these nuances there are things to learn for them at the basic level itself. It is so with us. We have a lot to learn from them. Tools can be exchanged.  France for instance has similar problems. They are very secretive for 15 years nothing was known about their experience. The situation was same in India. It is only now that we realize the need for collaboration.  

Speech synthesis software 

Speech synthesis technology as such is available in the public domain. Telugu text to speech system was a student project.  This technology is not limited to a set of words. For developing a higher quality software for this calls for some extent of linguistic understanding so where to put the stress, intonation etc. But speech recognition is  a different problem and it will take many years for us to have a quality software. In India we don’t even have a speech database, spoken corpus. There is no software that can tell which language the sound file is, not to speak of the dialect.  

We had discussions on this. We have a vision document for Indian language technology. We have set target for speech-to-speech translation - I speak in one language, you have to recognize that translate and regenerate that in another language. It is happening in a limited sense from English to French etc. Air traffic system does this. In this limited domain, they have large spoken database. For us this is more an ambition.   

What is the commercial proposition going with the work? Has this been patented, for instance? 

Right now there are no patents, there are a variety of possibilities. Certain things we want to simply give it free. Not just by putting it as a freeware for download. This may not be the best option with in India. We are thinking of how distribute in the best way. There are other things we want to make it available to limited circle of researchers. We have taken many things from IIT Kanpur and vise versa. There are certain products we want to transfer to industry - OCR is a fit case for this. And, there is a research angle and publication in journals is the right approach. There are certain critical things that we can look at for patenting. We are doing a project for government examiner of questioned documents. This is a department, which deals with doubtful cases in document validity - erasing, defacing etc. We have developed procedures to deal with these cases and establish the validity of the document. Software does this job.  We want to patent these things.  We want patents to protect the IPRs. We may still distribute the tool free with the patent with us. We are not interested in someone monopolising the technique and not putting it to use for general benefit. We are also exploring industry-university collaborative project as only industry can tell us what the public are looking for. They are in touch with them.  

What are the other engaging researches in software development apart from language? 

There are many areas - there are new models of computation and sw engineering, testing and verifying at a theoretical level. Data security is a big issue now. Online examination for instance needs validation. Biometrics is coming up now. Bioinformatics is the in thing now. Search, retrieval, compression are all being looked at with new vigor.  

Semantics and syntax, webmining. Richmedia documents - uniformly storing a variety of digital knowledge base. Database, embedded objects,  ppts, etc. etc. so they can be queried cutting across the formats etc.  

These some of the cutting edge research as I see it.

--