“Doing research, demonstrating and making a full-scale prototype is the work of the academics”
Dr. Narayana
Murthy, a professor at Hyderabad Central University, is an academic and
researcher in computer software, particularly focusing on OCR, a technology that
has a high industry potential. RNDIndia.info could
get perspective on how the academia interact with industry and, what do the
industry expect from the academic R&D, apart from the trials and travails of
the product / process development in an academic world. He says, “Doing
research, demonstrating and making a full-scale prototype is the work of the
academics. Companies have to understand and absorb the technology at that level
and convert it into a full-scale product by testing and evaluating.”
--------------------------------------------------------------------------
Content
development in regional languages has become a matter of urgency. What are your
efforts in this direction?
Indian
languages have many peculiar features that are not well understood even today.
English and other European languages are alphabetic languages. On the
other extreme is the script of Japanese, Chinese etc. which are hieroglyphs and
pictures. Script in our languages is entirely syllabic in nature. These are not
problems but as we approach our script through the standards developed for
English it seems to be so. Our effort at Hyderabad Central University is to
clarify the basic features and explain their importance. Much software developed
so far for our local needs are just good enough for using computers as some kind
of typewriter. Not for creation of knowledge base which can be processed,
searched and memorized. This happens because of lack of understanding of our
writing, sound system, the way it is stored etc. This is mainly related to
standardization of characters. We do have national standard called ISCI for
Indian language text encoding and there is now the emerging international
standard - Unicode. There is a basic confusion at this level in the regional
language use. Most go by what appears on the screen or what is stored in the
file. In the end that is not what is important. Indian language text we see on
the computer is not sustainable even when we highlight the text and change the
font! Because what is being done presently for Indian scripts is a non-standard
format. We are working on standardization and making these concepts clear to
everybody, including standards institutions, right from operating systems,
applications and up to contents level. This
is the reason we started working on Akshara multilingual text processing
software. These problems are peculiar to us and we have to solve this.
We
need large-scale data to work with. It was not the problem with English anytime.
In our languages we have to work from this level. We are now developing corpus
and basic statistics that will be useful. We are working on OCR for Indian
languages. We have also done some work on machine translation and populating
databases in our languages, automatic summarization of documents, automatic
categorization, automatic extraction of relevant pages. Our aim should be to
give the entire computer functions in Indian language. Someone who does not know
English at all must also be in a position to use that.
The
regional language computing and content creation is more challenging than a
layman perceives it. Can you elaborate on other issues in the context?
When
we type A in English what is typed is A what stored is A and what is displayed
is A. In regional languages we work
at the level of syllables not alphabets. For swa
- tran - tyam there are three syllables. We cannot have keys and fonts for
all of them. So we talk of consonants and vowels and a means of combining them.
Matra, otthu needs to be introduced, things other languages need not
bother about. We have a
character-encoding scheme based on elements or components that are required to
make the syllables but they are not syllables in themselves. There is a grammar
for syllables. We have a large number of syllables we have a method of composing
it. We have to store it and use this computational grammar at all levels. We
cannot display a large number of syllables directly. Components used for display
or different from the ones used for storing. For display we need a piece of
graphical units. So we have fonts, which have small graphical units good for
composing and printing and character encoding which goes into vowels, consonants
and things like that from which we have to construct the syllables. If we have
character based encoding it would be independent of fonts or operating systems.
When you have standard software working on it on which you can do any
kind of operation. But it requires mapping from that encoding to a font
encoding. We have several commercial fonts that are not standardized. So each
font is different. We need a mapping system from character encoding to font
encoding. Most of the efforts in regional language text are trying to bypass the
font. So the web pages created is font specific. If the client has the same font
the contents would be readable. But in real sense of the term font encoded text
is not text at all. That is not a knowledge base. We cannot do any other
operation - sorting etc. - with such text.
We
are working on the technology of standard character encoded pages, which will
work across platforms. For this we have a Java applet, which will map it onto
the font at the client side. This is not a plug in software that needs to be
downloaded independently. Plug ins have their own problem, some of them are
browser specific. Now we have a new technology - wbilio - web based Indian
languages input and output. Now for the first time we have platform independent,
standard text input for Indian languages that are truly international. So we can
type in Telugu and get the Hindi meanings on the webpage, irrespective of your
local settings. This technology is developed and demonstrated. Now we can create
webpages using Akshara and edit anytime you want without violating standards.
These
are the technologies that in a way benefit the IT hardware and software firms.
What is their contribution to these efforts?
We
are working closely with software and hardware companies at various levels. We
transfer technology to these companies. Telugu spell checker is one recent
example. Error detection and correction is quite challenging. We have an MOU
with Modular Infotech, a Pune based company which made the the Telugu spell
checker.
Are
the other big names like Wipro, Infosys in the picture?
No,
excepting HP Ltd. most of the big companies are not seriously working on Indian
language computing. HP Labs in Bangalore is working on Indian fonts. Our
developments have to be operating system independent. That imposes different
constraints on us. Language processing is more challenging than the stereotype
processing in MS Office and such packages.
Do
you get the required qualified human resources for the research work of this
nature, considering the industry out there is economically more rewarding?
This
is a problem area. Just one outstanding person cannot make all the changes. We
need motivated researchers as Ph. D. students. We do not normally get people
from the industry due to absence of pay parity and various other issues. So the
software engineers from the industry do not take these projects seriously.
Does
industry interests overlap with the work you are doing?
There
are at least a dozen organization like ours working on the basic research in
regional languages. [refer to site] There are 13 resource centers for language
technology assisted by Ministry of IT, GOI. But the sad thing is none of our
industry is ready for technology transfer in this area. They want a salable
product. Technology transfer is different from salable product. If I have a
product I can as well sell it.
Companies
have to understand and absorb the technology at that level and convert it into a
full-scale product by testing and evaluating. Doing research, demonstrating and
making a full-scale prototype is the work of the academics. This is not what is
happening today really. For the OCR
system we had to develop the benchmarking standards. We had to develop the test
data, which is a tedious process in itself and certainly not academic work.
So
far we have not been successful in getting industry attention excepting getting
some ph.d fellowships.
Do
you collaborate with the 13 centres working on this specialization?
Yes,
the OCR standardization has been done jointly by ISI, Calcutta and Hyderabad
Central University. We are working closely with IIT, Kanpur for standardization
issues. Our OCR system is to be used for Kannada and Malayalam. We meet
frequently to exchange notes and tools.
We
have developed web technology and at the Central Institute of Indian Languages,
Mysore Indian language teaching material is to be put on the web using our
technology. We are linking up and spreading as far as possible.
Content
creation in regional languages has become a challenge. What are our efforts in
making automatic translation possible?
On
a joint project with IIT, Kanpur, we have developed a product called Anusaraka,
which is for translating from one Indian language to another. Particularly
Telugu to Hindi. In the last two years researchers from IIT Kanpur would come
and work here every other month. This project is in the most advanced stage. The
heart and soul of the system, which is Telugu-Hindi dictionary and is now made
more portable.
You
have also attempted automatic translation. What is the success rate in automatic
translation?
85-90
percent is what our software does, but it is not necessarily in acceptable form.
Small features that can add value were never done. It is our efforts now to
correct this.
Are
there any practical experiments in automatic translation?
Yes,
a project I did for government of Karnataka Finance department for translating
budget speech from English to Kannada. I feel translation between English to
Indian languages and vice versa have greater relevance than between the Indian
languages.
Budget
speech keeps getting updated till the last minute and than it will have to be
translated in an hour or so. Most difficult thing is translation. It was
cautioned that high quality automatic translation is not to be expected. At the
best it could be 60 percent correct. This is true of any pair of language in the
world. Except when the domain of focus is small and the exact fit can be made
between the words in different languages. It is possible to make in highly
constrained domains.
Interactive
translation is also made possible. Here as and when the computer finds a word
for which it is not sure of the meaning, grammatical syntax, it is made to stop.
To the extent we can help the computer it can provide better translation.
English to Kannada is 40-60 successful. Here the translation is done at sentence
level - sentence is analyzed structurally and a translation is arrived at.
Parsing itself is difficult. So wherever it is not sure it will not attempt
translation.
The
flow of operation is somewhat like this: the
software takes the English sentence, it parses the same; looks at the structure
and the corresponding structure of Kannada sentence is decided; goes from
sentences to individual words, picks up the words from the English Kannada
dictionary; does Kannada morphological generation to get the correct forms of
the word. For we have gender or other appropriate forms to be considered. These
are done in less than five minutes on the computer. Than comes the post
processing level. Here the translated version is presented and if the
translation is not attempted by the application, the English version is shown as
it is. It has a very powerful post processor using this we can very quickly
reconstruct with suggestions from the system. There is also a thesaurus for
picking the alternate words, if need be. Changing root words with appropriate
forms.
There
would be no typing mistake, yet an expert has to look at the final version for
coherence and the contextual shades of meaning etc. The translation has been
experimented but not put to real life test in a budget session. More because of
logistics and operational problems 40-60 percent of time gives the correct
sentence and sense. There is a small percentage of chance where it could be
incorrect either because the word or the context is wrong. In the remaining
cases the rendering would remain in English as the parser was not confident to
translate or suggest a translation. We
are improving on this software with additional features with statistical
processing.
We
have enough data to test now. New tools are available. We are trying to combine
statistical component with the human judgement. Some thing that adopts learns as
it goes on in the context. We have a good dictionary. We have frequency
information on the word usage etc.
Next
best thing to content creation in regional languages is instant translation that
happens on the web. What is the success rate for translation on the web?
There
is a distinction between a high quality output and quick and dirty translation.
In case of high quality translation for some reason the expectation is very
high. On the web people don’t expect much when they seek translation of web
documents, because anything is better than nothing. There is a level of
difference in acceptance. As this is a new requirement and people are getting
used to what they get. We haven't done enough experiments with Indian languages.
We don’t know what would be the acceptance level and whether it is achievable
or not.
How
does the other oriental language translation fare. How do we compare there.
Nothing
is really comparable, as structures of the sentences are completely different.
Unless there is a good syntactic mechanism we can't do good translation. At least for English there is computational grammar but none
of that exists for Indian languages. We have to work with in the limitations.
OCR
There
are about seven centres working on OCR technology in various Indian schools. Out
of them Telugu, Bangla, Devanagari, Tamil and Punjabi are the languages where
considerable work has been done. For these languages the tools give over 96 per
cent accuracy. In Telugu we get 97-98 per cent accuracy without post-processing
for normal printed books. A dictionary or a spell checker could add to the
accuracy. It can reject the doubtful words and pick up a second best alternative
from the dictionary. With these tools we can improve the performance by 1 per
cent. We have the interface for those things in Akshara system. But there is no
Indian language OCR for handwritten materials or even for manuscripts. Yet what
we have achieved through OCR is of practical use if we look at the other
applications and its utility. Preserving the heritage is fine. Manuscripts can
be preserved even as images. For working with manuscripts researchers have
greater patience and put up with some difficulties.
Which
are the other universities working on these issues?
The
main centres specializing in this technology inlcude IIT Kanpur, Bombay, Chennai,
Kalkota. None of the mainstream
universities are on this research yet excepting MS University in Baroda where
the work has just begun, Anna University in Chennai
Which
are the international agencies in this field of activity?
There
is an Indo-UK group called LESAL - Language Engineering in South Asian
Languages. We had a workshop under these auspices last year. There is also an
Indo-French Network on the topic. We have built a network of people and share
expertise in these groups. We are now charting out the possible collaboration,
joint projects, exchanges, etc.
Are
the Universities in other countries much ahead of us?
There
are things to learn from both sides. The features of our languages are not there
is others. As we have made things work with these nuances there are things to
learn for them at the basic level itself. It is so with us. We have a lot to
learn from them. Tools can be exchanged. France
for instance has similar problems. They are very secretive for 15 years nothing
was known about their experience. The situation was same in India. It is only
now that we realize the need for collaboration.
Speech
synthesis software
Speech
synthesis technology as such is available in the public domain. Telugu text to
speech system was a student project. This
technology is not limited to a set of words. For developing a higher quality
software for this calls for some extent of linguistic understanding so where to
put the stress, intonation etc. But speech recognition is
a different problem and it will take many years for us to have a quality
software. In India we don’t even have a speech database, spoken corpus. There
is no software that can tell which language the sound file is, not to speak of
the dialect.
We
had discussions on this. We have a vision document for Indian language
technology. We have set target for speech-to-speech translation - I speak in one
language, you have to recognize that translate and regenerate that in another
language. It is happening in a limited sense from English to French etc. Air
traffic system does this. In this limited domain, they have large spoken
database. For us this is more an ambition.
What
is the commercial proposition going with the work? Has this been patented, for
instance?
Right
now there are no patents, there are a variety of possibilities. Certain things
we want to simply give it free. Not just by putting it as a freeware for
download. This may not be the best option with in India. We are thinking of how
distribute in the best way. There are other things we want to make it available
to limited circle of researchers. We have taken many things from IIT Kanpur and
vise versa. There are certain products we want to transfer to industry - OCR is
a fit case for this. And, there is a research angle and publication in journals
is the right approach. There are certain critical things that we can look at for
patenting. We are doing a project for government examiner of questioned
documents. This is a department, which deals with doubtful cases in document
validity - erasing, defacing etc. We have developed procedures to deal with
these cases and establish the validity of the document. Software does this job.
We want to patent these things. We
want patents to protect the IPRs. We may still distribute the tool free with the
patent with us. We are not interested in someone monopolising the technique and
not putting it to use for general benefit. We are also exploring
industry-university collaborative project as only industry can tell us what the
public are looking for. They are in touch with them.
What
are the other engaging researches in software development apart from language?
There
are many areas - there are new models of computation and sw engineering, testing
and verifying at a theoretical level. Data security is a big issue now. Online
examination for instance needs validation. Biometrics is coming up now.
Bioinformatics is the in thing now. Search, retrieval, compression are all being
looked at with new vigor.
Semantics
and syntax, webmining. Richmedia documents - uniformly storing a variety of
digital knowledge base. Database, embedded objects,
ppts, etc. etc. so they can be queried cutting across the formats etc.
These some of the cutting edge research as I see it.
--