Dept. of IT | Hindi SiteContact us | Sitemap
 

TDIL

 
 
 
 
 
 
 
 
 
 
 
 
 

Machine Aided Translation(MAT) Systems

Aim and Scope
Machine Translation Support Systems
R&D Activities
Technologies Developed by the respective Institutes/Organizations

Aim and Scope  

      It is an important application and has immense potential in the Indian market. There being eighteen languages in the country, translation from one language to another would yield a large number of pairs. Keeping in view the maximum correspondence in the language pairs, and keeping in view the need for translation of official correspondence from English to Hindi, this pair has been identified as the priority area for MAT system. Because of similarity among Indian languages, the translation among Indian languages is easier than translation from English to Hindi. In view of the above, two areas for MAT, namely: MAT systems for translation among Indian languages and MAT for translation from English to Hindi have been identified as potential areas for research.  Because of the complexity of the area, it is feasible to develop only domain specific systems for  narrow domains.
      Fully automatic general purpose high quality machine translation systems are extremely difficult to build. The difficulty arises from the following reasons:-

In any natural language text, only part of the information to be conveyed is explicitly expressed. It is the human mind which fills up and supplements the details using contextual understanding of the world knowledge.
Different natural languages adopt different conventions about the type and amount of information to be communicated.

        In spite of these difficulties, it is possible to employ the computers for Machine Translation, although it sounds paradoxical. The solution lies in separating language based analysis of texts from knowledge and inference based analysis. The former is left to the machine and later is taken care of by the human readers. Thus, the aspects which are difficult for the human being are handled by the machine and easier aspects are left to the human being. The aim is to minimise the effort of the human being and thereby increase his productivity, hence we refer to it as Machine Aided Translation (MAT) not machine translation.


Machine Translation Support Systems  

      Internationally it is a well accepted fact that it is impossible to develop General Purpose Machine Translation Systems however it is practical to achieve reasonable degree of success in development of Domain Specific Machine Translation Support Systems which will largely aid the translators in doing their job faster. Indian Languages are phonetic in nature and are close to each other whereas English stands apart from the linguistic considerations. Hence from development point of view, it has been grouped into two major categories:

Machine Aided Translation among Indian Languages:

  
       Demo system for a language pair Kannada to Hindi was developed initially at IIT, Kanpur and this technology was demonstrated at various forums and was termed as ANUSARAKA.   This   technology has now been extended to Telugu, Marathi, Bengali, Punjabi into Hindi and is available for trial through e-mail.  This work has been carried out  jointly by IITK & University of Hyderabad, Hyderabad.  

Machine Aided Translation between English to Hindi:

 
      The specific domains identified are: English news stories; Standard documents used for Public Health Campaign.

MAT system for Translation of English News stories to Hindi :

       Most of the international and National wire service agencies send news items in English. Manual translation is slow and tedious. The inflow of news items is not evenly distributed, therefore there is burst of translation required just before the newspaper is to go out. Because of these problems, many news papers in regional languages publish the old news. The project aims at human-aided machine translation of English news stories to Hindi. The news stories will be taken from a wire service agency, simplified and translated, using human intervention as appropriate. The output of the system is expected to be post-edited by humans. A demo system for translation of English News stories to Hindi has been developed.
      
MAT system for Translation of Standard documents from English to Hindi :

       The documents/reports used for the campaigns of Public Health  are mostly in English language. Translation of these documents in Hindi will go a long way in order to achieve the objectives of the respective campaigns. The system uses the Anglabharati approach developed at IIT,Kanpur.  A demo  system for translation of Public Health Campaign documents has been developed. Keeping the above in mind, two projects for E-H pair and one project for other Indian Languages to Hindi were initiated for specific domains.

R&D Activities  

       Anusaraka technology aims at providing access to any other Indian Language to a person who knows Hindi. It will be particularly important as the content in Indian languages becomes available on the web or in digital form. It is jointly being developed by  IITK and University of Hyderabad, Hyderabad. Angalabharati technology aims at machine-aided translation from English to Hindi for specific domains.  It has been developed at IITK and adapted for PC platform at ER&DCI, Noida.  MAT technology is also being developed at NCST, Bombay for translation English news stories to Hindi and support it on the web. 
 
Technologies Developed by the respective Institutes/Organizations  

ANGLABHARTI: A MULTILINGUAL MACHINE AIDED TRANSLATION METHODLOGY FOR TRANSLATION FROM ENGLISH TO INDIAN LANGUAGES
- Prof. R.M.K. Sinha, Indian Institute of Technology, Kanpur 208016 India
 

       ANGLABHARTI represents a machine-aided translation methodology specifically designed for translating English to Indian languages. English is a SOV language while Indian languages are SVO and are relatively of free word-order. Instead of designing translators for English to each Indian language, Anglabharti uses a pseudo-interlingua approach. It analyses English only once and creates an intermediate structure with most of the disambiguation performed. The intermediate structure is then converted to each Indian language through a process of text-generation. The effort in analyzing the English sentences is about 70% and the text-generation accounts for the rest of the 30%. Thus only with an additional 30% effort, a new English to Indian language translator can be built. Some of the major design considerations in design of Anglabharti have been aimed at:

       - providing a practical aid for translation wherein an attempt is made to get 90% of the           task done by the machine and 10% left to the human post-editing;
       - a system which could grow incrementally to handle more complex situations;
       - an uniform mechanism by which translation from English to majority of Indian languages           with attachment of appropriate text generator modules; and
       - a human engineered man-machine interface to facilitate both its usage and           augmentation.

         Anglabharti is a pattern directed rule based system with context free grammar like structure for English (source language) which generate a `pseudo-target' applicable to a group of Indian languages (target languages). A set of rules obtained through corpus analysis is used to identify plausible constituents with respect to which movement rules for the ‘pseudo-target' is constructed. The idea of using `pseudo-target' is primarily to exploit structural similarity to obtain advantages similar to that of using interlingua approach. It also uses an example-base to identify noun and verb phrasals and resolve their ambiguities.

      Indian languages are verb ending, free word-group order language with lot of structural similarity. Indian languages can be classified into four broad groups according to their origin. These are Indo-Aryan family (Hindi, Bangla, Asamiya, Punjabi, Marathi, Oriya, Gujrati etc.); Dravidian family (Tamil, Telugu, Kannada & Malayalam); Austro-Asian family and Tibetan-Burmese family. Within each group the languages exhibit a high degree of structural homogeneity. The methodology exploits this similarity to a great extent in its design. Paninian framework based on Sanskrit grammar using Karak (similar to ‘case’) relationship provides an uniform way of designing the Indian language text generators using selectional constraints and preferences.

         The lexical database is the fuel to the translation engine. A number of ontological/semantic tags are used to resolve sense ambiguity in the source language. We use semantics to resolve most of the intra-sentence anaphora/pronoun references. Alternative meanings for the unresolved ambiguities are retained in the pseudo target language. A text generator module for each of the target languages transforms the pseudo target language to the target language. These transformations do lead to sentences which may be ill-formed. A corrector for ill-formed sentences is used for each of the target languages. Finally, a human-engineered post-editing package is used to make the final corrections. The post-editor needs to know only the target language.


       The ANGLABHARTI methodology was used to design a functional prototype for English to Hindi on Sun system. Feasibility on extending this for English to Telugu/Tamil was also demonstrated.
.
      Thereafter, during 1995-97, the DOE/MIT TDIL programme funded a project for porting the English to Hindi translation software on a PC platform in Linux for translating English Health Slogans into Hindi. Dr. Ajai Jain joined the group of researchers at IIT Kanpur and ER & DCI Lucknow/Noida was associated with the project for field testing and packaging the software. In year 2000 the project received further funding for making it more comprehensive. The outcome of this project has been release of the first version of the software named AnglaHindi (an English to Hindi version based on Anglabharti approach) which accepts English text with almost no constraints on its form. AnglaHindi has also been web-enabled and is available for on-line translation at URL: http//anglahindi.iitk.ac.in AnglaHindi software technology has been transferred to two organizations and is being made available on both the Linux and Windows platforms.
Web based translation service for English news stories to Hindi - NCST, Bombay

       The domain of news stories is highly context sensitive, hence the standard approaches of translation such as Direct Translation, Transfer Approach, Interlingua are not adequate. Therefore a Hybrid approach system Vaakya has been developed at NCST, Bombay.The input text is simplified using a pre-processor. Using the world knowledge and heuristics, the topic of the news story is identified. The processed text is analysed and tagging of the parts of the speech is done. Lengthy sentences are simplified using simplification rules. The text is then transformed into a case-frame like structures using the infitization rules. Then generation of the target language is achieved by the parameterized templates from the case-frame structures and the bi-lingual Lexicon. The major components of the system are (Block Schematic Diagram):


Topic Identification
Parts-of-Speech Tagger
Heuristic Simplification
Knowledge based Phrase Recognition
Parser
Lexicon
Infitization
Translation & Generation

       Prototype Vaakya system is now being enhanced and adapted for providing web translation service to the news agencies.
MANTRA Machine Translation System for Officialese Domain - C-DAC, Pune

        This project has been funded by Department of Official Languages for specific domain of Government of India Appointment Letters. The system is currently being tested at five ministries. The system uses Tree Adjoining Grammar (TAG) proposed by Shri Aravind Joshi in 1983 in the University of Pennsylvania, USA. TAG is a tree re-writing system. The system uses a TAG based Parser called VYAKARTA. This parser uses sub-language concept for it’s definition, is capable of parsing about 250 tree families in English, Hindi, Gujarati and Sanskrit. A comprehensive lexicon is built using TAG for various complex phrases. Then a transfer lexicon which contains the lexical structures/trees for both the source language and the target language is used to get the TAG formalism in the target language which is equivalent to the input sentence.
       The major components of the system are:
Lexicalised TAGs
Synchronous TAGs
Vyakarta, TAG Parser
Transfer Lexicon
Generation of Target Language

ANUSARAKA Machine Translation System - Dr. Rajeev Sangal, IIIT, Hyderabad & University of Hyderabad

       Anusaraka is a Language Accessor rather than a machine translation system in true sense. It helps in overcoming the language barrier by assisting the reader to access information from another language. Anusaraka analyses the source language text and presents exactly the same information in a language close to the target language. It tries to preserve information from the input to the output text. It is domain free system and has been adapted from Paninian Grammar. It has been developed for translation from Telugu, Tamil, Marathi, Bengali, Punjabi to Hindi. The major components of the system are:
Morphological Analyser
Local Word Grouper
Bi-lingual Dictionaries
Mapper from Source Language to Target Langauge
Word Synthesizer
Post-editing interface

       The Anusaraka has been made available in the public domain as an E-mail server for translation service from Telugu, Kannada, Marathi, Bengali & Punjabi to Hindi. To run the Anusaraka on a given text, send the text by e-mail to nandi@anu.uohyd.ernet.in with the language name as subject such as ‘telugu’ for getting the translation from Telugu to Hindi. This will automatically run the Telugu to Hindi Anusaraka and the output produced will be sent back to the sender. A copy is kept by the machine for later study. The text should be in 7-bit ISCII coding. Similarly help by mail is available if mail is sent with subject ‘Help’.