Language Technology at GIST
C-DAC GIST has always been at the forefront
of the development of new tools and technologies. A
leader in the area, the GIST Labs have carved their
expertise with technologies as varied as Natural Language
Processing (NLP), Video, Embedded Systems, Word-processing
to name only a few. This tradition of cutting-edge technologies
is continually upheld at the GIST Labs where new tools
compatible with the needs and requirements of today's
fast developing digital world are being developed.
Some of the major technologies, which
underlie the development of new tools and applications,
are showcased below. The areas are varied and have been
classified based on their foci of interest.
Natural Language Processing Technologies
The new Web is based on Natural
Language Processing, which aims to bring humans and
the digital world closer. Doing away with statistical
tools that at best could emulate Human Machine Interface
in a narrow manner, Natural Language Processing (NLP)
is the new area where the major developments of W3C
will be undertaken. To ensure that Indian Languages
are on this new platform, exciting and new technologies
are being developed.
C-DAC GIST MT (Machine Translation) Evaluation Tool
Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another. One of the very earliest pursuits in computer science, MT has proved to be an elusive goal, but today a number of systems are available which produce output which, if not perfect, is of sufficient quality to be useful in a number of specific domains." A definition from the European Association for Machine Translation (EAMT), "an organization that serves the growing community of people interested in MT and translation tools, including users, developers, and researchers of this increasingly viable technology"
Click to view details and snaphots >>
GIST Visual Thesaurus
GIST Visual Thesaurus represents thesaurus based word map for the entered word in interactive and easily explorable manner. Its unique and attractive graphical visualization of word map and word net makes it easy tool to use and increases the learning thrust. It allows the word inputting in Unicode for Indian languages. Currently Hindi and Gujarati are supported. It is targeted to one and all who want to replace an idea with a word and also to those who want to explore and learn language.
GIST has to its credit the development of the first
Indian Languages spell-checkers both under DOS and WINDOWS.
The next generation of spell-checkers and algorithms
are a new and dynamic algorithm permitting for a faster
and more efficient spell-check. The existing dictionaries
have been upgraded and the new dictionaries are richer
and have more words and have been updated to suit the
requirements of the present day world where spell-checkers
are needed for the Web.
Spell-checkers are available as a stand-alone
utility and will accept data in 8 Bit ISCII/PASCII as
well as Unicode, UTF8, Big-Endian and Little-Endian
Click for more info >>
Grammar checkers are a must in India and can be used
not only to validate incorrect grammar within text but
also and more importantly, permit the user to ensure
that the correct grammatical forms have been used. The
tool can also be used by school children to master the
intricacies of Indian language grammar.
The checker handles the following cases:
1. Concord at the N.P. Level
|Variable Adjectives of Quality
2. Concord at the V.P. level
The verb Group admits two types of
1.An Intra verb checker: which handles
relations within the verbal string and will be indifferent
to the nature of the relations between the VP and the
2.VP-NP checker : which applies concord
rules between the Verb and the Noun. The VP-NP checker
will handle all relational issues between the Verb and
the N.P. it governs
3. Concord between NP (Subject)
The checker validates Number, Person, Case, Gender rules
between the Subject and the Verb
4. Stylistic features which try to trap the most common errors committed by
the native user
5. Fragments and Run-ons
A statistical analysis of readability in terms of Fleisch-Kincaid
Index as well as statistical tools is also provided.
A prototype of a first-ever Grammar-checker
for Hindi has been developed. The design of the checker
allows for easy adaptation to other languages.
The Grammar checker accepts data in
8 Bit ISCII/PASCII as well as Unicode (Big-Endian and
Little-Endian) and UTF8
Thesauri which provide much more semantic information
than dictionaries are a vital tool for search-engines,
data-mining and information retrieval. GIST has started
work on the creation of a Thesaurus Building
Engine, which will ensure that the structure
of the thesaurus with its hyponyms and hyperonyms is correctly indexed
permitting fast and quick information retrieval.
In a country like India where languages use scripts
belonging to the LATIN (Konkani), PERSO-ARABIC (Sindhi,
Kashmiri, Urdu), BRAHMI (a majority of Indo-Aryan and
all Dravidian Scripts), transfer of content from one
base to another, especially names is a requirement for
E-Governance, Election Commission etc.
The first attempt to convert
names types in English to Brahmi based scripts was in
the form of NTRANS where a dictionary supported by strong
heuristics allowed for transliteration of names from
English. UTRANS permitted conversion of Hindi and Punjabi
to Transform English applications and websites into
Indian Language at the click of a button
Click for more info >>
Deploying a strong genetic algorithm
and statistical tools a set of Transliteration Utilities
are under development, especially to bridge the gap
between Latin, Brahmi and Perso-Arabic platforms as
shown in the table below:
Click for more info >>
Name Conversion Utilities
These utilities allow for conversion of names
from one language platform to another.
These converters allow the user to transliterate
a text typed in Urdu to Hindi or to see a text typed
in Roman as Urdu
Dictionaries are a valuable database in a country like
India where Cross-Lingual Information Querying systems
are urgently needed. They are also needed in areas such
as E-Governance or Teaching
Systems or Search-Engines. GIST has started work on developing dictionaries in
joint collaboration with the Language Boards and Academies
of the particular linguistic region. The dictionary
database can be in the shape of a mono-lingual or bi-lingual
database or it can be a dictionary of synonyms or antonyms
or idiomatic expressions common to the language.
Since dictionaries are often made by
hand using traditional indexes, a dictionary
validation and building tool has been created
to ensure that the dictionaries are properly indexed
and that the maximum information within the dictionary
The Homophone Engine is a sophisticated tool which searches
for look-alikes in Indian languages as well as in Indian
English. The problems treated here are mainly pertinent
to Indian names as written both in English as well as
in Indian scripts. However they could also be extended
to all alphabets and some examples show lacunae in script
systems other than Indian.
Homophone Engine - Problem Statement
A few of the major lacunae in existing English based
solutions are listed below:
a. Letter to Sound
With only 26 English Letters. It does not support any
characters beyond basic 26 characters in English. Extended
character sets are not supported hence names with unusual
letters (like é) may not be retrieved correctly.
Thus the name Barve will yield Barwe but not Barwé
b. First Character
Algorithms based on English depend on the first letter
of the "tokenized word" to generate the key.
Someone looking for Firoze or Fali will not get Phiroze
or Phali. Not to mention instances of names generated
under the influence of numerology such as KKarishma
There would be a lot of False Negatives in these cases.
Typos and noise are a fact of system data input. If
the operator typed "Katrik" instead of "Kartik"
using the Key-based approach it will not be possible
to fetch the "Kartik" that we are looking
d. Name Variants
Existing English based systems cannot handle either
the multiple ways in which a name can be spelled. Thus
Chaudhary is spelled in around 34 different ways, Soundex
at best can trap around 14-15 and fail on the rest.
e. Homophonic names
which are not homographs
Soundex and NYSIIS/Metaphone fail for names that use
silent letters and silent sounds. Some examples would
f. False Correct Results
Compare the Soundex code for "Sunil". Over
100 other names will show up. All Soundex derived algorithms
end up with these precision problems.
g. Name Sequence Variation
The British "First Name", "Middle Initial",
"Last Name" style is not followed in the entire
world. Name sequence variation is a cultural phenomenon
and is widely spread in India. Some cultures have last
name first and first name last. Other keep only the
geographical name as their name and the "First
name" is stored as an Initial.
Diverse Name Databases
A name spelled one way in one state is spelled and pronounced
very differently in the neighbouring state. These problems
exist within different cultures living in the same state.
The problem is compounded by system user or operator
who already knows a third spelling of the name. Thus
whereas Oriya and a majority of Dravidian Languages
will show the absence of the implicit vowel by a Halanta
sign, Hindi or Gujarati does not use this notation but
prefers that the final consonant has an implicit "a"
which is not pronounced.
i. Abbreviated Name
The Soundex Codes for "Bandopadhyaya" and
"Banerjee" are not the same. Existing English
Algorithms fail do retrieve these equivalent names.
Similarly nicknames commonly used such as Vainu for
Vainateya will not be mapped under a Soundex search.
For example, the name Mohammad can be abbreviated as
Md., Mmd., Mhd. or Mohd. There are such numerous examples
j. Titles, Qualifiers
may occur at much higher frequency in such scenarios
the key-based approach becomes over-whelming. Dr. Prof.
k. Hyphenated name
A Soundex based algorithmic search for hyphenated names
will not yield exact results:
Thus Abd-al-Razzaq ~ Abdul Razzaq ~ Abd-ur-Razzq will
not be displayed in Soundex as variants of the same
Homophone Engine - Solution
The Solution developed by C-DAC tries to attack the
problem from not only a homophonic approach but also
from a Context Bound Name Grammar approach. Contextual
rules adjuncted to Homophonic rules ensure that the
result is neither over generative nor under-generative
but provides at best a right fit. This ensures that
Sunil does not map to the possibilities listed above
but maps to Suneel, Soonil, Sooneel , Sunneil Suneil
. Only exact and correct homophones/homographs including
abbreviations, name variants are provided.
Below are given examples to showcase
the application which at present is in a beta stage
of testing: We have three options in place: Results
for each are given below for two words: Chaudhury and
The HOMOPHONE ENGINE can be deployed
in a large number of applications including Spell-checkers,
Name Translation Utilities, Data mining applications
(such as Election Commission, Telephone Directory search),
IT databases where homographs need to be detected.
Lemmatisers are a must for higher-level Natural Language
Processing (NLP), especially if the word has to be correctly
tagged as to its categorical class. Lemmatisers have
a wide range of applications in areas as diverse as
Translation, Semantic Web, Data Mining, Natural Query
Systems to name only a few.
In addition coupled with the spell-checker
a typo is corrected and the lemmatized form of the typo
The Lemmatiser is available as a stand-alone utility
and accepts data in 8 Bit ISCII/PASCII as well as Unicode
(Big-Endian and Little-Endian) and UTF8.
Conversion of data to storage and vice-versa has been
a requirement for the complex scripts of India. Even
with the advent of Unicode, the need for Converters
is strong. More so in the area of embedded tools and
technologies where memory and speed are a must, converters
are still needed. GIST has been at the forefront of
this area and has to its credit the creation of ISFOC.
Ongoing research on script grammars
has resulted in converters which are bi-directional
and can move from storage to display and conversely
with a single DLL. A single generic engine handles all
the converters, which are extremely tiny in size. This
allows for easy and fast conversion especially for embedded
devices, which are memory hungry.
third-party fonts are also available. The converter
rule file can be written by the font developer and is
extremely easy to build thanks to a user-friendly GUI,
which guides the user through writing the converter.