Dept. of IT | Hindi SiteContact us | Sitemap
 

TDIL

 
 
 
 
 
 
 
 
 
 
 
 

Indian Language Processing(ILP) Resources

Aim and Scope
Applications of Corpora
Machine Readable Corpora
Sample Corpora
Maintenance and Distribution of Corpora
R&D Activities
Technologies Developed by the respective Institutes/Organizations
             
Aim and Scope

      The development of corpora of texts in machine readable form was envisaged as a basic research facility for linguists and computer scientists. Accordingly, the primary objective of the project was to put together collection of machine readable texts in all the constitutionally recognised Indian languages.  Simultaneously development of software tools for word level tagging of grammatical categories, word count, frequency count,spell checkers etc. was also envisaged.

 
Applications of Corpora

      Machine readable corpora of text in Indian languages are useful in wide varieties of applications. It provides authentic data on contemporary use of Indian languages to both computer scientists and linguists for their academic, research and the developmental activities. Corpus also provides the representative sample for the language style, usage of certain words etc. Linguists and computer experts can use the corpora for the following activities/areas:


      For Linguists:

Linguists working in the areas of language standardisation, computational linguists, lexicography and translation etc.
Linguistic analysis such as frequency of usage of certain characters/words, mophological analysis, syntactic-semantic analysis etc.

        For Computer Scientists:

For the development of Machine Translation systems the corpora provides a test bed to test morphological analysers,parsers,language generators etc.
Utility software development such as Electronic lexicons,sentence analysers/ language generators,spell checkers etc.

Machine Readable Corpora 

      About thirty lakh words of machine readable corpora have been developed in Hindi, English, Tamil, Telegu, Kannada, Malayalam, Marathi, Gujrati, Oriya, Bengali, Sanskrit, Urdu, Assamese, Punjabi, Kashmiri and . Software tools for word level tagging of grammatical categories, word count, frequency count have also been developed. 


Sample Corpora

      You will be required to install Devanagari Fonts for viewing the sample corpora.  Select FONTS button to download the Fonts and VIEW  for accessing the sample Hindi Corpora.


Maintenance and Distribution of Corpora

      The developed corpora in all these Indian Languages is being centrally maintained at  Central Institute of Indian Languages(CIIL), Ministry of Human Resource Development,  Department of Education,  Mansagangotri,   Mysore (Karnataka).  This corpora can be used for education and research purpose.


R&D Activities

      Corpora for Sindhi, Manipuri, Nepali & Konkani  and  Lexical Resources in Telugu, Tamil, Marathi, Bengali and Hindi are is under development at CIIL, Mysore.

Technologies Developed by the respective Institutes/Organizations

Machine Readable Corpora : Central Institute of Indian Languages, (CIIL) Mysore

      Corpora is a plural of corpus. Corpus of any language is an assorted collection of text words of written texts. Machine readable corpus is therefore a compilation of such texts, which can be stored, manipulated and retrieved as and when required with the help of computer. The steps involved in building a corpus are selection of texts, data entry, data validation and a set of tools for management and retrieval of data. Considering the richness of Indian Languages, it would be impractical to develop a corpus of whole and finite source, hence to start with 30 Lakh words in each of the fifteen constitutionally languages was targetted in 1991. Corpora can be used in a wide variety of applications since it provides authentic data on contemporary use of Indian Languages to the following category of users:

Linguists working in the area of Standardisation, pedagogy, lexicography, translation, linguistic analysis such as morphological analysis, syntactic/semantic analysis, sentence generation etc.
Computer Scientists working in the area of machine translation, utility software development such as building of Electronic Dictionaries, Computational lexicon, sentence analysis and generation, spell checkers etc.
As a test bed for most of the ILP applications, tools and solutions etc.

      The source of corpora is Printed Books, Journals, Magazines, Newspapers and Government Documents published during 1981-1990. It has been categorised into six main categories viz. Aesthetics, Social Sciences, Natural, Physical & Professional Sciences, Commerce, Official and Media Languages and Translated Material. Software Tools for word level tagging, Word Count, Letter Count, Frequency Count have also been developed. The Tag Set consists of Finite Verb (FV), Non-Finite Verb (NV), Noun (NN), Pronoun (PN), Adjective (AJ), Adverb (AV), Indeclinables (ID). Corpus Manager and KWIC Concordance s/ws have also been developed.Corpora of about 30 Lakh words in each of the Indian Languages viz. Hindi, Punjabi, English, Telugu, Malyalam, Tamil, Kannada, Sanskrit, Urdu, Kashmere, Marathi, Gujrati, Oriya, Assamese and Bengali has been developed at various centres and is now being centrally maintained at CIIL, Mysore. It is being distributed for educational and research purpose. Three more languages viz. Konkani, Manipuri & Nepali were later on added to the eighth schedule of the constitution, hence corpora development for these languages was also taken up.

Corpora of Konkani Language
has been completed at Asmitai Pratishtan, Goa. Thirty Lakh words of Konkani Corpora in machine readable form and s/w for tagging the corpora, word count and frequency count has been developed. Spell Checker for use in conjunction with corpora has also been developed. This will also be maintained at CIIL, Mysore and made available for distribution.

Corpora of Nepali Language
is under development at Centre for Computers and Communications Technology, Gangtok. Nepali Corpora of 1.2 Lakh words in machine readable form and s/w for tagging the corpora, word count and frequency count has been developed.

Corpora of Manipuri Language
has been undertaken in University of Manipur, Manipur. Data collection has already been completed for 25 lakh words and data entry is in progress.

Lexical Resources in machine readable form : Central Institute of Indian Languages, Mysore

      The lexical resources of a language contain the information like Head word, Stem alterants, Stem type, detailed grammatical information, syntactic information, all types of meanings, citation for each meaning, paradigms, derived words, cross reference for the derived words, compound words, synonyms, antonyms, idioms, encyclopeadic information, etymological information, statistical information. The lexical resource database will be useful to liguists and Computer Scientists who are working in linguistic research, machine translation, expert systems and Artificial Intelligence. It can be used for generation of learners dictionary, historical dictionary, machine readable grammatical dictionary, electronic dictionary, computational lexicon etc. Lexical Resources in five Indian Languages viz. Bengali, Hindi, Marathi, Tamil and Telugu is under advanced stage of development. Lexical Resources provide Lexical Information on the basis of concepts, more grammatical information, Syntactic and Semantic conditioning for the usage of lexical items, Synonym-set and their usages, Compound forms & Idioms. The categories for which Lexical Resources are being developed are Verb, noun, Adjective, Adverb and Function Word. The development steps are:

Collection and selection of Headwords 
Labelling of grammatical categories 
Syntactic Information 
Sense discrimination and suitable citation
Designing a structure and creating a database
Retrieval system for different purposes

        These can be used for research in the areas of Machine Translation systems during the Lexical Transfer Phase, Lexical Resource of Source Language in the analysis phase and Lexical Resource of Target Language in the synthesis phase etc.

Computer Courseware in Hindi : Banasthali Vidyapith, Banasthali

      DOEACC ‘O’ level courseware in machine readable form has been developed in Hindi. DOEACC is also financially participating in this project. Once completed, this material will be published in book form and with incremental effort, it can be published as CD-ROM and can also be made available on the web. The four modules covered in the syllabus are Information Technology, Cobol, PC Software, Programming in 'C', Business System. Manuscripts have been reviewed by the Experts and modifications based on their advise are being incorporated.