Language technology researches computer systems, which
understand and/or synthesize spoken and written human
languages. Included in this area are speech processing
(recognition, understanding, and synthesis), information
extraction, handwriting recognition, machine translation,
text summarization, and language generation.
Computational linguistics (CL) is a discipline between
linguistics and computer science which is concerned
with the computational aspects of the human language
faculty. It belongs to the cognitive sciences and overlaps
with the field of artificial intelligence (AI), a branch
of computer science that is aiming at computational
models of human cognition. There are two components
of CL: applied and theoretical. The applied component
of CL is more interested in the practical outcome of
modelling human language use. The goal is to create
software products that have some knowledge of human
language.
Natural language interfaces enable the user to communicate
with the computer in German, English or another human
language. Some applications of such interfaces are database
queries, information retrieval from texts and so-called
expert systems. Current advances in recognition of spoken
language improve the usability of many types of natural
language systems.
Much older than communication problems between human
beings and machines are those between people with different
mother tongues. One of the original goals of applied
computational linguistics was fully automatic translation
between human languages. Computational linguists have
created software systems which can simplify the work
of human translators and clearly improve their productivity.
Even though the successful simulation of human language
competence is not to be expected in the near future,
computational linguists have numerous immediate research
goals involving the design, realization and maintenance
of systems which facilitate everyday work, such as grammar
checkers for word processing programs.
Theoretical CL takes up issues in formal theories. It
deals with formal theories about the linguistic knowledge
that a human needs for generating and understanding
language. Computational linguists develop formal models
simulating aspects of the human language faculty and
implement them as computer programmes. These programmes
constitute the basis for the evaluation and further
development of the theories. In addition to linguistic
theories, findings from cognitive psychology play a
major role in simulating linguistic competence. Within
psychology, it is mainly the area of psycholinguistics
that examines the cognitive processes constituting human
language use. The special attraction of computational
linguistics lies in the combination of methods and strategies
from the humanities, natural and behavioural sciences,
and engineering.
There is a very comprehensive Linguistic Annotation Tools
web page provided by the Linguistic Data Consortium, at
http://www.ldc.upenn.edu/annotation .
It concentrates on speech but also covers resources for
working with text.
Speech synthesis programs convert written input to spoken
output by automatically generating synthetic speech. Speech
synthesis is often referred to a "Text-to-Speech" conversion
(TTS).
There are several algorithms. The choice depends on
the task they're used for. The easiest way is to just
record the voice of a person speaking the desired phrases.
This is useful if only a restricted volume of phrases
and sentences is used, e.g. messages in a train station,
or schedule information via phone. The quality depends
on the way recording is done. More sophisticated but
worse in quality are algorithms which split the speech
into smaller pieces. The smaller those units are, the
less are they in number, but the quality also decreases.
An often used unit is the phoneme, the smallest linguistic
unit. Depending on the language used there are about
35-50 phonemes in western European languages, i.e. there
are 35-50 single recordings. The problem is combining
them as fluent speech requires fluent transitions between
the elements. The intellegibility is therefore lower,
but the memory required is small.
A solution to this dilemma is using diphones. Instead
of splitting at the transitions, the cut is done at
the center of the phonemes, leaving the transitions
themselves intact. This gives about 400 elements (20*20)
and the quality increases. The longer the units
become, the more elements are there, but the quality
increases along with the memory required. Other units
which are widely used are half-syllables, syllables,
words, or combinations of them, e.g. word stems and
inflectional endings. The Museum of Speech Analysis
and Synthesis has pictures of artificial speech systems
going back over 150 years: worth a visit. (http://mambo.ucsc.edu/psl/smus/smus.html)
Bureau of Indian Standards formed a standard known as
ISCII (Indian Script Code for Information Interchange)
for the use in all computer and communication media,
which allows usage of 7 or 8 bit characters. In an 8
bit environment, the lower 128 characters are the same
as defined in IS10315:1982 (ISO 646 IRV) 7 bit coded
character set for information interchange also known
as ASCII character set. The top 128 characters cater
to all the Indian Scripts based on the ancient Brahmi
script. In a 7-bit environment the control code SI can
be used for invocation of the ISCII code set and control
code SO can be used for reselection of the ASCII code
set.
There are 15 officially recognized languages in India.
Apart from Perso-Arabic scripts, all the other 10 scripts
used for Indian languages have evolved from the ancient
Brahmi script and have a common phonetic structure,
making a common character set possible. An attribute
mechanism has been provided for selection of different
Indian script font and display attributes. An extension
mechanism allows use of more characters along with the
ISCII code. The ISCII Code table is a super set of all
the characters required in the Brahmi based Indian scripts.
For convenience, the alphabet of the official script
Devnagari has been used in the standard. The standard
number IS1319:1991 issued by Bureau of Indian Standards
is the latest Indian Standard for Information Interchange,
and is being widely used for development of IT products
in Indian Languages.
Alphabetic Code for Information Interchange (Pronounced
as "Ae-Kee). This is a 8-bit code, containing the ASCII
character set in the bottom half. The top half contains
the ACII characters. PC-ACII Script code is the version
of ACII script code where the characters are split in
the upper-half for compatibility with IBM PC. This splitting
is necessary in order to keep intact the Line Drawing
characters which are located in middle of the upper-half
of the character set.
Following are the entities required for ensuring proper
representation of complex scripts:
ACII- Alphabetic code for Information Interchange
This is a computer code by which the basic alphabet
of a script is represented. The basic letters and signs
needed in most of scripts (leaving aside ideographic
scripts like Chinese) are less than 96. All the possible
shapes in a script can be expressed through combinations
of these basic letters. The ACII code can be typed through
an ACII keyboard overlay. The ACII keyboard overlay
fits on a standard English keyboard. Each ASCII character
has a unique position on the keyboard overlay.
ISFOC- Intelligence Based Script Font Code ISFOC
is a coded character set containing all the basic shapes
required for rendering a script. These shapes can be
overlapped linearly to compose any word in the script.
Each of the ISFOC characters is like a piece of a jigsaw
puzzle; it may not be a complete letter by itself. Each
ISFOC set can contain a maximum of 188 characters. This
is adequate for most of the scripts. However, some require
more.
ISFA- Intelligence Based scripts to Font Algorithm
A word is always typed in terms of its basic ACII characters.
It however, has to be displayed using the basic ISFOC
shapes. An algorithm is required for converting the
ACII codes to the appropriate ISFOC code. This is the
ISFA algorithm.
ACII (Alphabet code for Information Interchange)
code contains all the basic characters available on
the ACII keyboard. For example, The ACII Indian code
and keyboard accommodates the requirements for the 10
Indian scripts: Assamese, Bengali, Devanagri, Gujrati,
Kannada, Malayalam, Oriya, Punjabi, Tamil and Telugu.
The basic characters are ordered such that direct sorting
gives results, which are almost the same as that for
any of the scripts. The ACII codes have to be converted
to ISFOC for display purpose. This is done through an
ISFA algorithm for the selected script. An ACII text
can be displayed in any of the scripts. Transliteration
to another script can be achieved by merely selecting
that script. ACII code is used in communication media,
like telex, for optimal transfer of text. ALP word processor
uses the ACII code internally to allow proper editing
at alphabetic level and unique representation of spellings.
The existing window applications are unable to handle
ACII directly, as it requires an intelligent algorithm
for handling the display. They can, however handle the
ISFOC codes, which were made for this purpose. Thus,
conversion is necessary between ACII and ISFOC whenever
text has to be transferred from ALP to a window application.
It is possible to type ISFOC text directly within a
windows application using the ACII keyboard. This is
done through a custom keyboard driver who does ACII
to ISFOC conversion internally.
Script character set This is the primary character
set containing most of the language characters and a
set of symbols and numerals, which are frequently used.
This set of symbols will be common across all the ISFOC
character sets, with a few exceptions.
The matching English Character set This is a companion
character set for matching English fonts containing
ASCII characters in the bottom half, and accent characters
for Roman Transliteration in the upper half.
The supplemental character set The supplemental
character set is an extended set to the basic script
character set containing conjuncts and symbols, which
are not required for normal usage.
This chapter list the basic philosophy required for
rendering complex scripts.
Script Rendition Philosophy
It is intuitive and logical to type in a word in terms
of its spelling.
The spelling of a word consists of the basic alphabet
in the order of their pronunciation.
The basic alphabet of a script along with necessary
special symbols and punctuations constitute the ACII
(Alphabet Code for Information Interchange). The letters
in the ACII are arranged according to their alphabetical
sorting order. ACII also contains the ASCII character
set.
· A word can be composed by linearly combining the basic
shapes available in a script.
· ISFOC foe a script contains these basic shapes. These
can be too unwieldy for direct typing.
· An intelligent script to Font Algorithm (ISFA) can
interpret the ACII spelling and generate an ISFOC code
sequence required for displaying the word.
· For simple scripts like that of English the ASCII
code for itself suffices for both ACII and ISFOC.
· However most complex non-linear scripts, like the
Indian scripts, require a separate code for ACII, ISFOC,
and an ISFA algorithm.
ISFOC Standards
· Script standards for basic shapes and their composition
facilitate designing of fotns.
· ISFOC represents the modern rendition style of a
script by defining the necessary basic shapes.
· The basic shapes are chosen such that they can represent
a wide variety of fonts styles in the script.
· ISFOC for a script is associated with an ISFA, which
defines the standard way for composing a word using
the basic shapes.
· All the fonts developed for a script are mutually
compatible. A user can view a text in a font of his
choice.
· Since ISFOC fonts are linearly composed, they can
be used along with the existing English applications
and printed on existing Laser printers and Typesetters.
· ISFOC provides the code set for inclusion of complex
scripts in graphics-oriented environments like MS-Windows
and Macintosh.
· ISFOC provides the neatest script rendition, while
allowing an intuitive human interface through an ACII
keyboard.
Unicode is increasing
being accepted as a standard for Information Interchange
worldwide as most of the major IT Companies have declared
their support for it. Unicode for Indian Languages use
ISCII-88 and not ISCII-91 which is the latest official
standard. It was felt necessary that Indian Government
should represent UNICODE Consortium for necessary modification
in the code pertaining to Indian languages script and
hence Department of Information Technology became full
member of Unicode Consortium with voting right.
16 Bit (2 Byte) UNICODE
Unicode standard is the Universal character encoding
standard, used for representation of text for Computer
Processing. Unicode standard provides the capacity to
encode all of the characters used for the written languages
of the world. The Unicode standards provide information
about the character and their use. Unicode Standards
are very useful for Computer users who deal with multilingual
text, Business people, Linguists, Researchers, Scientists,
Mathematicians and Technicians. Unicode uses a 16 bit
encoding that provides code point for more than 65000
characters (65536). Unicode Standards assigns each character
a unique numeric value and name. The Unicode standard
and ISO10646 Standard provide an extension mechanism
called UTF-16 that allows for encoding as many as a
million. Presently Unicode Standard provide codes for
49194 characters.
Unicode consortium has
laid down certain policy regarding character encoding
stability by which no character deletion or change in
character name is possible only annotation update is
possible
1. Once a character is encoded, it will not be moved
or removed.
2. Once a character is encoded, its character name will
not be changed.
3. Once a character is encoded, its canonical combining
class and decomposition (either canonical or compatibility)
will not be changed in a way that would affect normalization.
4. Once a character is encoded, its properties may still
be changed, but not in such a way as to change the fundamental
identity of the character.
5. The structure of certain property values in the Unicode
character database will not be changed.
Unicode uses a 16 bit encoding that provides code
point for more than 65000 characters (65536). Unicode
Standards assigns each character a unique numeric value
and name. Unicode standard provides the capacity to
encode all of the characters used for the written languages
of the world.
ISCII uses 8 bit code which is an extension of the
7 bit ASCII code containing the basic alphabet required
for the 10 Indian scripts which have originated from
the Brahmi script. There are 15 officially recognized
languages in India. Apart from Perso-Arabic scripts,
all the other 10 scripts used for Indian languages have
evolved from the ancient Brahmi script and have a common
phonetic structure, making a common character set possible.
The ISCII Code table is a super set of all the characters
required in the Brahmi based Indian scripts. For convenience,
the alphabet of the official script Devnagari has been
used in the standard.
There are 3 different keyboard
layouts.
1. Romanised Layout:
In Romanised layout, phonetic English mappings are used
to compose the Hindi Text. For example, the key raamaa
(or rAmA) can be used to type 'Rama'.
2. Typewriter Layout: This layout is similar
to the Hindi typewriter layout & useful for Hindi
typists & other people familiar with Hindi Typewriter
layout. Typewriter Layout & Key Sequence Charts
3. DOE Phonetic: This layout is standardized
by the Department Of Electronics (DOE), Govt. Of India.
The advantage of this layout is that the layout remains
identical for all Indian Languages. For example, the
key 'k' is used to represent the letter 'ka' in all
Indian Languages. The Keyboard Layout and the Key Sequence
Charts can be used to find the correct key combinations.