Indian Language Processing(ILP)
Resources
The development of
corpora of texts in machine readable form was envisaged
as a basic research facility for linguists and computer
scientists. Accordingly, the primary objective of the
project was to put together collection of machine readable
texts in all the constitutionally recognised Indian languages.
Simultaneously development of software tools for word
level tagging of grammatical categories, word count, frequency
count,spell checkers etc. was also envisaged.
Machine readable
corpora of text in Indian languages are useful in wide
varieties of applications. It provides authentic data
on contemporary use of Indian languages to both computer
scientists and linguists for their academic, research
and the developmental activities. Corpus also provides
the representative sample for the language style, usage
of certain words etc. Linguists and computer experts can
use the corpora for the following activities/areas:
For Linguists:
 |
Linguists working in the areas of language standardisation,
computational linguists, lexicography and translation
etc. |
 |
Linguistic analysis such as frequency of usage
of certain characters/words, mophological analysis,
syntactic-semantic analysis etc. |
For Computer
Scientists:
 |
For the development of Machine Translation systems
the corpora provides a test bed to test morphological
analysers,parsers,language generators etc. |
 |
Utility software development such as Electronic
lexicons,sentence analysers/ language generators,spell
checkers etc. |
About thirty
lakh words of machine readable corpora have been developed
in Hindi, English, Tamil, Telegu, Kannada, Malayalam,
Marathi, Gujrati, Oriya, Bengali, Sanskrit, Urdu, Assamese,
Punjabi, Kashmiri and . Software tools for word level
tagging of grammatical categories, word count, frequency
count have also been developed.
You will be
required to install Devanagari Fonts for viewing the sample
corpora. Select FONTS button to download the Fonts
and VIEW for accessing the sample Hindi Corpora.
The developed
corpora in all these Indian Languages is being centrally
maintained at Central Institute of Indian Languages(CIIL),
Ministry of Human Resource Development, Department
of Education, Mansagangotri, Mysore (Karnataka).
This corpora can be used for education and research purpose.
Corpora for
Sindhi, Manipuri, Nepali & Konkani and
Lexical Resources in Telugu, Tamil, Marathi, Bengali and
Hindi are is under development at CIIL, Mysore.
 |
Machine Readable Corpora : Central
Institute of Indian Languages, (CIIL) Mysore |
Corpora
is a plural of corpus. Corpus of any language is an assorted
collection of text words of written texts. Machine readable
corpus is therefore a compilation of such texts, which
can be stored, manipulated and retrieved as and when required
with the help of computer. The steps involved in building
a corpus are selection of texts, data entry, data validation
and a set of tools for management and retrieval of data.
Considering the richness of Indian Languages, it would
be impractical to develop a corpus of whole and finite
source, hence to start with 30 Lakh words in each of the
fifteen constitutionally languages was targetted in 1991. Corpora
can be used in a wide variety of applications since it
provides authentic data on contemporary use of Indian
Languages to the following category of users:
 |
Linguists working
in the area of Standardisation, pedagogy, lexicography,
translation, linguistic analysis such as morphological
analysis, syntactic/semantic analysis, sentence
generation etc. |
 |
Computer Scientists
working in the area of machine translation, utility
software development such as building of Electronic
Dictionaries, Computational lexicon, sentence
analysis and generation, spell checkers etc. |
 |
As a test bed
for most of the ILP applications, tools and solutions
etc. |
The source of corpora
is Printed Books, Journals, Magazines, Newspapers and
Government Documents published during 1981-1990. It has
been categorised into six main categories viz. Aesthetics,
Social Sciences, Natural, Physical & Professional
Sciences, Commerce, Official and Media Languages and Translated
Material. Software Tools for word level tagging, Word
Count, Letter Count, Frequency Count have also been developed.
The Tag Set consists of Finite Verb (FV), Non-Finite Verb
(NV), Noun (NN), Pronoun (PN), Adjective (AJ), Adverb
(AV), Indeclinables (ID). Corpus Manager and KWIC Concordance
s/ws have also been developed.Corpora of about 30 Lakh
words in each of the Indian Languages viz. Hindi,
Punjabi, English, Telugu, Malyalam, Tamil, Kannada, Sanskrit,
Urdu, Kashmere, Marathi, Gujrati, Oriya, Assamese and
Bengali has been developed at various centres
and is now being centrally maintained at CIIL, Mysore.
It is being distributed for educational and research purpose.
Three more languages viz. Konkani, Manipuri & Nepali
were later on added to the eighth schedule of the constitution,
hence corpora development for these languages was also
taken up.
Corpora of Konkani Language has been completed at
Asmitai Pratishtan, Goa. Thirty Lakh words of Konkani
Corpora in machine readable form and s/w for tagging the
corpora, word count and frequency count has been developed.
Spell Checker for use in conjunction with corpora has
also been developed. This will also be maintained at CIIL,
Mysore and made available for distribution.
Corpora of Nepali Language is under development at
Centre for Computers and Communications Technology, Gangtok.
Nepali Corpora of 1.2 Lakh words in machine readable form
and s/w for tagging the corpora, word count and frequency
count has been developed.
Corpora of Manipuri Language has been undertaken in
University of Manipur, Manipur. Data collection has already
been completed for 25 lakh words and data entry is in
progress.
 |
Lexical Resources
in machine readable form : Central Institute
of Indian Languages, Mysore |
The lexical resources
of a language contain the information like Head word,
Stem alterants, Stem type, detailed grammatical information,
syntactic information, all types of meanings, citation
for each meaning, paradigms, derived words, cross reference
for the derived words, compound words, synonyms, antonyms,
idioms, encyclopeadic information, etymological information,
statistical information. The lexical resource database
will be useful to liguists and Computer Scientists who
are working in linguistic research, machine translation,
expert systems and Artificial Intelligence. It can be
used for generation of learners dictionary, historical
dictionary, machine readable grammatical dictionary, electronic
dictionary, computational lexicon etc. Lexical Resources
in five Indian Languages viz. Bengali, Hindi, Marathi,
Tamil and Telugu is under advanced stage of development.
Lexical Resources provide Lexical Information on the basis
of concepts, more grammatical information, Syntactic and
Semantic conditioning for the usage of lexical items,
Synonym-set and their usages, Compound forms & Idioms.
The categories for which Lexical Resources are being developed
are Verb, noun, Adjective, Adverb and Function Word. The
development steps are:
 |
Collection
and selection of Headwords |
 |
Labelling of
grammatical categories |
 |
Syntactic Information |
 |
Sense discrimination
and suitable citation |
 |
Designing a
structure and creating a database |
 |
Retrieval system
for different purposes |
These
can be used for research in the areas of Machine Translation
systems during the Lexical Transfer Phase, Lexical Resource
of Source Language in the analysis phase and Lexical Resource
of Target Language in the synthesis phase etc.
 |
Computer Courseware in Hindi :
Banasthali Vidyapith, Banasthali |
DOEACC ‘O’ level
courseware in machine readable form has been developed
in Hindi. DOEACC is also financially participating in
this project. Once completed, this material will be published
in book form and with incremental effort, it can be published
as CD-ROM and can also be made available on the web. The
four modules covered in the syllabus are Information Technology,
Cobol, PC Software, Programming in 'C', Business System.
Manuscripts have been reviewed by the Experts and modifications
based on their advise are being incorporated.
|