|
|
Indian Language Processing Tools
 |
Java based Solutions : IIT Kanpur |
Displaying Web
Documents through Negotiation and Dynamic Rendering :
Web authors create documents in a variety of languages
using a variety of character sets and fonts. It is not
possible for the viewer of the document to have all those
fonts and character sets present on his system. Thus either
the client is required to download the fonts or install
these on his/her system or install some software on his
system to help in the process. For a truly portable solution,
the client need not specially install any fonts or software
on his system.
The Java-centric solution for
displaying the Devanagari documents has been developed.
Java Applet using public domain font ‘Bhagwan’ displays
the Devanagari documents in true type font. The applet
and the font related information is around 100k. The Server
Java Applet encodes the glyphs and sends it with the document
so that Hindi Font is not required at the client side
for browsing the document. Hindi Search engine has been
developed on Linux platform.
Automatic Font
Installer : The users are required to generally
download the fonts for viewing the non-roman web sites
or manually install the fonts, which is a cumbersome task.
A single executable has been developed to carry out the
process whenever the user chooses the font installation
option. The font installer program runs on the server
and installs the fonts on the client machine.
URL Addresses in
Devanagari : The current browsers don’t allow
URL and filling of forms in HTML pages to be entered in
Hindi. Therefore, the world wide web needs to be multi-lingualised.
It involves conversion of URL in the local language to
the internationally agreed format, such as UNICODE, with
UTF-8 encoding. The user will be able to fill the forms
in the document fetched in a local language. Further,
the user is allowed to choose between a set of languages
in which to view the document, if the document is available
in various languages. The solution is based on the Swing
components in JDK1.2 and assumes the font to be present
on the local machine.
Indian Language
Search Engine : The search engine should
allow indexing and searching of Devanagari HTML documents.
The basic components are gatherer, indexer, and Search
processor. Indexer and Search processor are being designed
as these two modules deal with syntax and semantics of
the language of the text. Indexer will perform processing
such as keyword extraction, stop word removal, stemming
(handling different forms of a word), handling of word
synonyms, and term weight calculation. Search Processor
looks up the index to find the documents containing the
query keywords, calculate a relevance score, and ranks
them according to the score. This search engine will also
search the keywords occurred in a composite word (combined
according to ‘SANDHI’ rules, for example s/w will give
a match for keyboard ‘ram’ if it finds ‘rameshwar’ in
the document). It is assumed that the documents are in
ISCII.
Heritage Website :
The website has been developed containing material
concerning traditional Indian texts centred around the
‘Upnishads and the Bhaagwadgita’. The functionality provided
is:
There are font-downloading
applications (.exe) files. Users can download any Indian
language font and install the font.
The commentaries
include some technical words, which have Sanskrit origin.
The definitions
for such technical words are displayed with a mouse click.
Links between related
shlokas within the same Upnishad/Gita for cross referencing
Each hyperised text
has a search mechanism by which user can locate the occurrence
of any word and view that particular shlokas with it’s
translation and commentary.
User selected language
for viewing the Mool Shlokas
CD Authoring Tools
for Indian Language Documents : The technologies
that are being used for publishing on the www viz. HTML,
XML, Java, Javascript etc. are also being increasingly
used for document delivery over CDs. These days the entire
computer related documentation is accessible using the
web browsers. It is expected that this trend will pervade
into non-technical publishing also. The development of
Indian Language CD Publishers ToolBox, ‘site management’
tools and searches integrated with a dictionary are underway.
 |
ActiveX based Solutions - C-DAC, Pune |
Web based E-mail
:Hindi e-mail service, has been developed which uses
advanced ActiveX technologies available with Internet
Explorer 4.0 and later versions of browsers for enabling
the keyboarding and fonts for Indian languages on the
client PC. This service provides a facility to type the
text in Hindi language for sending an e-mail in Hindi
which gets converted into HTML format. This converted
Hindi text in HTML and with font codes is delivered at
the Email address of the user, who can just place it on
any Web Page using any standard HTML editor like Netscape
Composer.
The
software components namely ActiveX Controls and Hindi
Fonts get downloaded and installed on the client’s computer
when the user first time accesses the system. Every time
the user accesses the e-mail server, a check is made for
the installed components.
To
be able to send/view mail, the user must first create
an account on the system by defining login name and password.
An account holder on the system can send a message to
any other account holder. The user types in the message
using the Inscript keyboard overlay. The message is stored
in a database. The data on the server is stored in ISCII.
When a request for reading an e-mail message is received,
the server retrieves the message from the database and
creates a HTML file containing the message with the Hindi
font information on the fly and delivers it.
Microsoft
Visual Interdev is the IDE, which uses the power of ASP
(Active Server Pages) to make web pages and connect to
the back end Database using ODBC drivers for MS SQL server.
Using ASP, queries have been made to the database from
the webpage. ActiveX technology based Hindi e-mail, search
engine and Bulletin Board System has been developed. Hindi
e-mail also stores documents in ISCII format.
Hindi Bulletin
Board System : It is under development. This
web based application allows users to create topics for
discussion and maintains threads within a topic.
Hindi Search Engine :
Development is underway which involves the following:
Manually surf the
net and build indexes for documents in Hindi
Invite the Hindi
language document creator to submit web pages URL with
page description
and keyword in Hindi to the search engine i.e. build a
web based application
to collect data in Hindi
Build special search
techniques for Hindi based on word morphology/thesaurus/sandhi
etc.
Deliver HTML document
index description in Hindi for search result.
Define standards
for Meta-tags etc. for Indian languages such that future
spiders can
retrieve documents for a particular language
Multi-lingul E-mail
Client : A working prototype has been developed
to facilitate the clients for sending and receiving e-mails
in Hindi without having need to have Internet connection
provided sender and receiver both have this s/w. The application
will use technologies like MAPI, Extended MAPI and COM
to communicate with the interfaces provided by MS Exchange.
The application downloads all the mails received via the
POP3 server and stores them locally on the machine. This
storage of mails is taken care of by the MS-Exchange.
The application provides access to various folders like
InBox, Sent Mails, Deleted Mials and OutBox for convenience
of the user.
 |
DESIKA - Centre for Development of Advanced
Computing (C-DAC), Bangalore |
The Software
package, DESIKA is a Natural Language Understanding
System for Sanskrit. This software incorporates language
generation and analysis modules for plain and accented
written Sanskrit texts. It is based on the principles
of ancient Indian Sciences. DESIKA aims to process
all the words of Sanskrit, includes generation and analysis
(parsing), has an exhaustive database based on Amarakosha,
the most popular Sanskrit lexicon, rule base using the
grammar rules of Panini's Ashtadhyayi and heuristics based
on Nyaya & Mimamsa sastras for semantic and contexual
processing. This software can also analyse Vedic (scriptural)
texts.
The highlight
of DESIKA is the analysis module which is a general
purpose Sanskrit parser currently being extended to handle
compound and combined word forms dissolution and identification.
Vedic analysis is also under way. Rigveda and Taittiriya
branch of Krishna Yajur Veda analysis using Taittiriya
pratishakya and Vaidika Prakriya of Ashtadhyayi.
The DESIKA
software helps in understanding a natural language input
(typically an isolated sentence) through paraphrasing,
voice change, query answering or summarising, to develop
a language-independent knowledge representation scheme
based on ancient Indian Sciences, to develop tools for
linguistic analysis and to assist in analysis & presentation
of scriptural (accented text) knowledge, phonetic and
language research, teaching etc., It was developed on
DOS platform and has now been ported on Windows platform.
 |
Sanskrit Authoring System - C-DAC, Bangalore
|
Sanskrit word
processor is under development which will even handle
special Sanskrit conjucts. The requirements which will
be catered by this s/w are:
 |
Word Processing
in Sanskrit |
 |
Statistical
Tools like concordance, thesauri, electronic dictionaries
etc. |
 |
Transliteration Facility |
 |
Search/Sort
Algorithms |
 |
Word Split
Programs for Sandhi and Samasa |
 |
Fonts for various
scripts, web access, web hosting, publishing etc. |
 |
Poetry Analysis
(Textual/metric/statistical) |
 |
Manual content
for Amarakosha, Grammar rules, Derivations, Quotes
from Vedas (scriptures) |
 |
Epic like Ramayana,
Mahabharata, other Puranas, Shastraic texts in
sutra and Authentic Reference |
 |
On-line readers/primers
of Indian Shastraic texts |
 |
Tools for morphological,
syntactic and semantic analysis |
 |
Tools for linguistic
analysis like tagging, lemmatising, statistical
studies etc. |
 |
Syntactic and Semantic Analysis of Sanskrit
Sentences – Academy of Sanskrit Research, Melkote
|
Software for syntactic and semantic
analysis of Sanskrit sentences has been developed on DOS
platform with GIST card and is being ported to Windows
platform. The sentence has been considered the basic unit
for analysis since it is the backbone of verbal communication
between the human beings. The importance of words will
be known only when the meaning of sentence is known. Systematic
classification of words and a robust grammar can help
in deriving the knowledge from Sanskrit and build a system
which will help in the development of Natural Language
Processing Systems.The various modules of the system are:
Subanta: It
can handle generation and analysis of all the case inflected
forms of more than 26,000 stems.
Tinanta: It
can handle the conjugational forms of roots, in two voices,
ten lakaras and three modes viz. Kevala Tiganta, Nijanta
and Sannanta.
Krdanta:
It is capable of handling generation analysis and identification
of case inflected forms of 11 types of krdantas of 150
roots.
Databases:
690 Avyayas, 26, 000 Nominal stems, 600 Verbal roots,
krdanata forms of 600 verbal roots, 5 Taddhita suffixes.
The parts of speech handled for analysing are nouns, pronouns,
adjectives, participles, Indeclinables, Indeclinable participle
and verbs. Sentences with multiple adjectives and participles
can also be analysed. Sentences constructed by picking
up any words from the database can be syntactically analysed.
But semantic analysis is done within a limited domain.
For handling the semantic analysis, a matrix has been
prepared which consists of 52 sets of nouns with their
synonyms amounting to 300 nouns, 27 actions denoted by
nearly 200 verbs. Syntactic and semantic analysis of simple
passage consisting of not more than 10 simple sentences
has been done successfully.
 |
Computer Assisted Sanskrit Teaching &
Learning Environment (CASTLE) – Jawahar
Lal Nehru University, New Delhi |
CASTLE s/w on DOS
platform with GIST card has been developed for Sanskrit
teaching and learning as a stand-alone application.
Under this project, the synthesis aspect of Sanskrit
phonology and word morphology has been handled. The
various modules developed under this system are:
Pratyahara:
It deals with the sound classes of Paninian grammar.
It may be described as a shorthand notation to refer
to a group of items.
Sandhi: Euphonic
combination relating to sound units is called ‘sandhi’.
It is a common module for various types of word formation.
A sandhi type depends on the final phoneme of the first
word and the initial phoneme of the second word. It
also includes a program for internal sandhi called Natva-satva
vidhana.
Subanta: This module
deals with the nominal inflexion. The System inputs
are noun base with its attributes and the output is
the 21+3 inflected forms of the noun.
Tiganta: Verbal
conjugation is called tiganta. This module takes the
verb and lakara (tense/mood) as inputs, and generates
9 conjugated forms of the verb in each pada.
Kridanta: The primary
derivatives are called kridanta. The inputs to the system
are the semantic condition, verb root and krit suffix.
The kridanta form is the output.
Taddhita: The
secondary derivatives are called taddhita. System inputs
are the semantic condition, noun base and taddhita suffix.
Taddhita form is the output.
Samasa: Compound
formation is known as samasa. Two or more words are
joined to form a new word. The inputs to the system
are two or more noun bases, which are characterized
by a semantic condition, and the normal suffixes.
Sri-pratyayas:
These suffixes are added to primary verbal roots to
derive secondary verbal roots. The derived verb is again
sent to the tiganta module to generate 9 conjugated
forms of the verb in each pada.
Following
Demonstrative modules for learning/teaching of Sanskrit
have also been developed:
Teaching Varnmala:
This module deals with the teaching of Sanskrit alphabet
alongwith their
characteristics. Exercises for testing knowledge of
Varnmala have also been prepared.
Sandhi Viccheda:
The system takes a word as input, and returns the constituent
words.
Subanta Viccheda:
The input word is split into the root word and suffix.
Besides, the grammatical attributes associated with
the root word, i.e. the noun-base are also displayed.
Tiganta Viccheda:
The input word is split into the root word and suffix.
Besides, the grammatical attributes associated with
the root, which is a verbal root, are also displayed.
Sanskrit Authoring
System including a Sanskrit word processor for use by
Sanskrit scholars in text processing etc is being developed
at C-DAC, Bangalore.
This
Software package is a Natural Language Understanding
System for Sanskrit, developed at Indian Heritage Group
of the Centre for Development of Advanced Computing (C-DAC),
Bangalore, a Scientific Society of the Ministry of Information
Technology, Government of India, Ramanashree Plaza, 2/1,
Brunton Road,Bangalore (Karnataka). This software
incorporates language generation and analysis modules
for plain and accented written Sanskrit texts. It is based
on the principles of ancient Indian Sciences. DESIKA aims
to process all the words of Sanskrit, includes generation
and analysis (parsing), has an exhaustive database based
on Amarakosha, a the most popular Sanskrit lexicon, rule
base using the grammar rules of Panini's Ashtadhyayi and
heuristics based on Nyaya & Mimamsa sastras for semantic
and contexual processing. This software can also analyse
Vedic (scriptural) texts.
Shabdhabodha
is an interactive application built to analyse the semantic
and syntactic structure of Sanskrit sentences.
It works on MS-DOS Platform version 6.0 or higher with
GIST shell. It has been developed at ASR, Melkote.
Spell checkers
are useful for word processing and are mostly integrated
with the word processing softwares. Spell checkers in
few Indian Languages are available. The development
of Spell checkers is covered within the scope of the current
projects for corpora development.
Punjabi Spell-checker has been developed has been developed
at CEDTI, Mohali.
| Special requirements for Indian Language Processing (ILP) |
|
India is a large multilingual society with
as many as eighteen constitutionally recognised languages
including English and the National language is Hindi. There
are multiple scripts for these languages. With increase
in trade and development across the country it becomes necessary
for the people to communicate in more than one language.
In such circumstances, Information Technology(IT) appears
to be a promising tool for the development of ILP systems
which aim at overcoming the language barrier. These ILP
tools could be designed using many approaches such as :
|
|