Machine
Aided Translation(MAT) Systems
It is an important
application and has immense potential in the Indian
market. There being eighteen languages in the country,
translation from one language to another would yield
a large number of pairs. Keeping in view the maximum
correspondence in the language pairs, and keeping in
view the need for translation of official
correspondence
from English to Hindi, this pair has been identified
as the priority area for MAT system. Because of similarity
among Indian languages, the translation among Indian
languages is easier than translation from English
to Hindi. In view of the above, two areas for MAT, namely: MAT
systems for translation among Indian languages
and MAT for translation from English to Hindi have
been identified as potential areas for research.
Because of the complexity of the area, it is feasible
to develop only domain specific systems for
narrow domains.
Fully automatic
general purpose high quality machine translation systems
are extremely difficult to build. The difficulty arises
from the following reasons:-
 |
In any natural
language text, only part of the information
to be conveyed is explicitly expressed. It
is the human mind which fills up and supplements
the details using contextual understanding
of the world knowledge. |
 |
Different natural
languages adopt different conventions about
the type and amount of information to be communicated. |
In spite
of these difficulties, it is possible to employ the
computers for Machine Translation, although it
sounds paradoxical. The solution lies in separating
language based analysis of texts from knowledge and
inference based analysis. The former is left to the
machine and later is taken care of by the human readers.
Thus, the aspects which are difficult for the human
being are handled by the machine and easier aspects
are left to the human being. The aim is to minimise
the effort of the human being and thereby increase
his productivity, hence we refer to it as Machine
Aided Translation (MAT) not machine translation.
Internationally
it is a well accepted fact that it is impossible to
develop General Purpose Machine Translation Systems
however it is practical to achieve reasonable degree
of success in development of Domain Specific Machine
Translation Support Systems which will largely aid
the translators in doing their job faster. Indian
Languages are phonetic in nature and are close to
each other whereas English stands apart from the linguistic
considerations. Hence from development point of view,
it has been grouped into two major categories:
 |
Machine Aided Translation among Indian Languages: |
Demo system for a language pair Kannada to Hindi was
developed initially at IIT, Kanpur and this technology
was demonstrated at various forums and was termed
as ANUSARAKA. This technology
has now been extended to Telugu, Marathi, Bengali,
Punjabi into Hindi and is available for trial through
e-mail. This work has been carried out
jointly by IITK & University of Hyderabad, Hyderabad.
 |
Machine Aided Translation between English to
Hindi: |
The
specific domains identified are: English news stories;
Standard documents used for Public Health Campaign.
|
MAT system for Translation of English News stories
to Hindi : |
Most of the international
and National wire service agencies send news items
in English. Manual translation is slow and tedious.
The inflow of news items is not evenly distributed,
therefore there is burst of translation required just
before the newspaper is to go out. Because of these
problems, many news papers in regional languages publish
the old news. The project aims at human-aided machine
translation of English news stories to Hindi. The
news stories will be taken from a wire service agency,
simplified and translated, using human intervention
as appropriate. The output of the system is expected
to be post-edited by humans. A demo system for translation
of English News stories to Hindi has been developed.
|
MAT system for Translation of Standard documents
from English to Hindi : |
The documents/reports
used for the campaigns of Public Health are
mostly in English language. Translation of these documents
in Hindi will go a long way in order to achieve the
objectives of the respective campaigns. The system
uses the Anglabharati approach developed at IIT,Kanpur.
A demo system for translation of Public Health
Campaign documents has been developed. Keeping the
above in mind, two projects for E-H pair and one project
for other Indian Languages to Hindi were initiated
for specific domains.
Anusaraka technology
aims at providing access to any other Indian Language
to a person who knows Hindi. It will be particularly
important as the content in Indian languages becomes
available on the web or in digital form. It is jointly
being developed by IITK and University of Hyderabad,
Hyderabad. Angalabharati technology aims at machine-aided
translation from English to Hindi for specific domains.
It has been developed at IITK and adapted for PC platform
at ER&DCI, Noida. MAT technology is also
being developed at NCST, Bombay for translation English
news stories to Hindi and support it on the web.
 |
ANGLABHARTI: A MULTILINGUAL MACHINE AIDED
TRANSLATION METHODLOGY FOR TRANSLATION FROM
ENGLISH TO INDIAN LANGUAGES
- Prof. R.M.K. Sinha,
Indian Institute of Technology, Kanpur 208016
India |
| |
ANGLABHARTI represents
a machine-aided translation methodology specifically
designed for translating English to Indian languages.
English is a SOV language while Indian languages are
SVO and are relatively of free word-order. Instead
of designing translators for English to each Indian
language, Anglabharti uses a pseudo-interlingua approach.
It analyses English only once and creates an intermediate
structure with most of the disambiguation performed.
The intermediate structure is then converted to each
Indian language through a process of text-generation.
The effort in analyzing the English sentences is about
70% and the text-generation accounts for the rest
of the 30%. Thus only with an additional 30% effort,
a new English to Indian language translator can be
built. Some of the major design considerations in
design of Anglabharti have been aimed at:
- providing
a practical aid for translation wherein an attempt
is made to get 90% of the task
done by the machine and 10% left to the human post-editing;
- a system
which could grow incrementally to handle more complex
situations;
- an uniform
mechanism by which translation from English to majority
of Indian languages with
attachment of appropriate text generator modules;
and
- a human
engineered man-machine interface to facilitate both
its usage and augmentation.
Anglabharti
is a pattern directed rule based system with context
free grammar like structure for English (source language)
which generate a `pseudo-target' applicable to a group
of Indian languages (target languages). A set of rules
obtained through corpus analysis is used to identify
plausible constituents with respect to which movement
rules for the ‘pseudo-target' is constructed. The
idea of using `pseudo-target' is primarily to exploit
structural similarity to obtain advantages similar
to that of using interlingua approach. It also uses
an example-base to identify noun and verb phrasals
and resolve their ambiguities.
Indian languages
are verb ending, free word-group order language with
lot of structural similarity. Indian languages can
be classified into four broad groups according to
their origin. These are Indo-Aryan family (Hindi,
Bangla, Asamiya, Punjabi, Marathi, Oriya, Gujrati etc.);
Dravidian family (Tamil, Telugu, Kannada & Malayalam);
Austro-Asian family and Tibetan-Burmese family. Within
each group the languages exhibit a high degree of
structural homogeneity. The methodology exploits this
similarity to a great extent in its design. Paninian
framework based on Sanskrit grammar using Karak (similar
to ‘case’) relationship provides an uniform way of
designing the Indian language text generators using
selectional constraints and preferences.
The
lexical database is the fuel to the translation engine.
A number of ontological/semantic tags are used to
resolve sense ambiguity in the source language. We
use semantics to resolve most of the intra-sentence
anaphora/pronoun references. Alternative meanings
for the unresolved ambiguities are retained in the
pseudo target language. A text generator module for
each of the target languages transforms the pseudo
target language to the target language. These transformations
do lead to sentences which may be ill-formed. A corrector
for ill-formed sentences is used for each of the target
languages. Finally, a human-engineered post-editing
package is used to make the final corrections. The
post-editor needs to know only the target language.
The ANGLABHARTI
methodology was used to design a functional prototype
for English to Hindi on Sun system. Feasibility on
extending this for English to Telugu/Tamil was also
demonstrated.
. Thereafter, during
1995-97, the DOE/MIT TDIL programme funded a project
for porting the English to Hindi translation software
on a PC platform in Linux for translating English
Health Slogans into Hindi. Dr. Ajai Jain joined the
group of researchers at IIT Kanpur and ER & DCI
Lucknow/Noida was associated with the project for
field testing and packaging the software. In year
2000 the project received further funding for making
it more comprehensive. The outcome of this project
has been release of the first version of the software
named AnglaHindi (an English to Hindi version based
on Anglabharti approach) which accepts English text
with almost no constraints on its form. AnglaHindi
has also been web-enabled and is available for on-line
translation at URL: http//anglahindi.iitk.ac.in AnglaHindi
software technology has been transferred to two organizations
and is being made available on both the Linux and
Windows platforms.
The domain of
news stories is highly context sensitive, hence the
standard approaches of translation such as Direct
Translation, Transfer Approach, Interlingua are not
adequate. Therefore a Hybrid approach system Vaakya
has been developed at NCST, Bombay.The input text
is simplified using a pre-processor. Using the world
knowledge and heuristics, the topic of the news story
is identified. The processed text is analysed and
tagging of the parts of the speech is done. Lengthy
sentences are simplified using simplification rules.
The text is then transformed into a case-frame like
structures using the infitization rules. Then generation
of the target language is achieved by the parameterized
templates from the case-frame structures and the bi-lingual
Lexicon. The major components of the system are (Block
Schematic Diagram):
 |
Topic Identification |
 |
Parts-of-Speech Tagger |
 |
Heuristic Simplification |
 |
Knowledge based
Phrase Recognition |
 |
Parser |
 |
Lexicon |
 |
Infitization |
 |
Translation
& Generation |
Prototype
Vaakya system is now being enhanced and adapted for
providing web translation service to the news agencies.
This project
has been funded by Department of Official Languages
for specific domain of Government of India Appointment
Letters. The system is currently being tested at five
ministries. The system uses Tree Adjoining Grammar
(TAG) proposed by Shri Aravind Joshi in 1983 in the
University of Pennsylvania, USA. TAG is a tree re-writing
system. The system uses a TAG based Parser called
VYAKARTA. This parser uses sub-language concept for
it’s definition, is capable of parsing about 250 tree
families in English, Hindi, Gujarati and Sanskrit.
A comprehensive lexicon is built using TAG for various
complex phrases. Then a transfer lexicon which contains
the lexical structures/trees for both the source language
and the target language is used to get the TAG formalism
in the target language which is equivalent to the
input sentence.
The major
components of the system are:
 |
Lexicalised
TAGs |
 |
Synchronous
TAGs |
 |
Vyakarta, TAG
Parser |
 |
Transfer Lexicon |
 |
Generation
of Target Language |
Anusaraka
is a Language Accessor rather than a machine translation
system in true sense. It helps in overcoming the language
barrier by assisting the reader to access
information
from another language. Anusaraka analyses the source
language text and presents exactly the same information
in a language close to the target language. It tries
to preserve information from the input to the output
text. It is domain free system and has been adapted
from Paninian Grammar. It has been developed for translation
from Telugu, Tamil, Marathi, Bengali, Punjabi to Hindi.
The major components of the system are:
 |
Morphological
Analyser |
 |
Local Word
Grouper |
 |
Bi-lingual
Dictionaries |
 |
Mapper from
Source Language to Target Langauge |
 |
Word Synthesizer |
 |
Post-editing
interface |
The Anusaraka
has been made available in the public domain as an
E-mail server for translation service from Telugu,
Kannada, Marathi, Bengali & Punjabi to Hindi.
To run the Anusaraka on a given text, send the text
by e-mail to nandi@anu.uohyd.ernet.in with
the language name as subject such as ‘telugu’ for
getting the translation from Telugu to Hindi. This
will automatically run the Telugu to Hindi Anusaraka
and the output produced will be sent back to the sender.
A copy is kept by the machine for later study. The
text should be in 7-bit ISCII coding. Similarly help
by mail is available if mail is sent with subject
‘Help’.