GIST contributions towards Standardization
in Indian Language Computing - An Overview
Need for standards
- Basic Hardware systems and / or Software applications
are designed and developed even today with only English
in mind. To proliferate the acceptance and usage of
Indian languages, the Indian language implementation
/ flavour needs to sit on top of existing applications
and hardware frameworks. GIST has a focus on all 22
official Indian languages. Of these - Assamese, Bengali,
Bodo, Dogri, Hindi, Gujarati, Kannada, Konkani, Marathi,
Malayalam, Maithili, Nepali, Oriya, Punjabi, Santhali,
Tamil, Telugu use a left to right writing style while
Urdu, Sindhi and Kashmiri are mostly used in right to
left mode. There are several overlaps wherein one language
may use multiple scripts (eg: Konkani may be written
in Devanagari, Kannada or Roman) as well as having one
script like Devanagari cater to multiple languages.
In order for any application to reach the masses of
India it is important to support Information Technology
in various languages of India.
On the Web and Mobile
platforms GIST has researched various aspects of the
W3C recommendations and submitted the findings related
to various languages including the right to left scripts.
This activity is especially important in order to bridge
the digital divide and proliferate the use of Indian
languages on various modern media including television,
handheld PDAs, Information access points, etc.
C-DAC GIST has participated
in various standardization activities pertaining to
language technology. It is also involved in standardization
of heritage scripts of India.
Wide Web Consortium)
Under the aegis of DIT, C-DAC GIST has come up with
a draft report on the representation of the seven
languages catering to the various recommendations
of the W3C. Of these, four belong to the Brahmi family
and use Left To Right (LTR) mode to display the characters:
Gujarati, Marathi, Konkani and Dogri. While Sindhi,
Kashmiri and Urdu which are Perso-Arabic use the Right
To Left (RTL) mode of visual display. C-DAC GIST has
extensively researched the various aspects related
to Localization (l10n) and Internalization (i18n).
The broad areas of research and recommendations include:
- Representation of Indian Languages on the World
- Character encoding issues for Languages such
as Marathi, Konkani, Gujarati, Sindhi, Kashmiri,
Dogri, Urdu, etc. in UNICODE
- Language Names - RFC-3066
- CSS and Text Formatting
- CLDR Common Language Data Repository
- Mobile Web Initiative - Recommendations for correct
representation of Indian Language on Mobile PDA
- Internationalization Tag Set - XML Tags, which
help in localization like translate, ruby, direction,
Dynamic CSS Tester
Dynamic CSS Tester is a comparison tool for comparing effects of various CSS as they are applied on UTF-8 data. It allows you to easily preview and compare different CSS side by side with various CSS applied on them. This case-study aims to investigate issues related to rendering or display of Indian language content in UTF8 and the effects of various CSS styles on it. To use it, all you have to do is to simply enter the text you would like to preview, then modify the various styles until you find a style set you want. If see any problem with the applied CSS, take a screenshot of a problem and send it along with the mail that you can send us with the help of the Feedback link given on the same page. In case if you feel that your data is not correctly rendered in the mail, just click GetSample button on the page, copy the code generated by it in the text box next to the button, and paste that in the Mail. The feedback that you send will be verified by GIST and consolidated and forwarded for further action to the W3C.
Internationalized Domain Names
In this age of Information Technology (IT) with the entire Globe being integrated into a web-linked village with the knowledge as the sole differentiator, development of convivial Access Technology has gained prime importance. Especially for India, with its diverse and multi-lingual heritage and culture, the Internet is expected to play dominant integrating role for integrating almost all aspects of social and economic endeavor.
Click here for more info »
GIST undertook research and study of various RFC and
their applicability vis-à-vis Indian Languages under
the guidance of the DIT.
The research is focused on Domain Names in Indian
languages for Hindi, Gujarati, Urdu, etc. and included
- NamePrep and StringPrep Profile - RFC-3492
- PunyCode: Bootstring encoding - RFC-3454
- StringPrep - RFC-3987, Path of IDN, etc. GIST
has submitted recommendations and reports related
to possible pitfalls, phishing, (online fraud arising
from similar urls), etc. whilst implementing IDN
in Indian Languages.
GIST has contributed to recommendations
related to the entire lifecycle of developing Indian
language compliant e-governance applications.
These recommendations arise from
C-DAC GIST's expertise in Indian languages and use of
GIST tools and technologies in various large-scale,
Indian language data-centric e-governance projects.
C-DAC GIST Tools have been used
in several turnkey G2C (Government to Citizen) applications
both at state and central level. GIST has also assisted
several agencies in implementing various medium and
It also participates in various
forums for standardizations of the languages of India.
GIST is working towards standardization
of Storage, Inputting and Display standards for Bodo,
Santhali, Dogri, Maithili, etc. which have been added
recently to the list of official languages.
Linguistic Formats and Heritage
- Dictionary creation tools formats and tagging
tools have been recommended for various languages.
These formats will streamline creation of digital
- C-DAC GIST is highly involved in standardization
of all Information Technology related aspects of
Heritage scripts such as Vedic, SamaVedic, Grantha,
- Several C-DAC GIST Tools have been used in various
Digital Library related projects.
- UNICODE - ISCII 88 based - Today UNICODE is the most widely accepted
and supported encoding for Indian Language support.
C-DAC GIST has contributed to the representation
of Indian Languages in Storage standards such as
UNICODE. Some recommendations especially for scripts
such as URDU, Sindhi, Kashmiri, Dogri, Bodo, Santhali
and Maithili are at various stages of review and
finalization. UNICODE is a character encoding system.
In UNICODE 0x600 onwards for PersoArabic Scripts
of India and 0x900 for Devanagari (Hindi, Marathi,
etc.) represent maximum languages of India.
- UNICODE consortium has come up with an evolving
standard currently at version 5. Changes for representation
of several Indian languages is still in progress.
Need for Normalisation (eg: multiple representation
of characters with Nukta), Need for a collation
sequence and sort order, ZWJ and ZWNJ issues, issues
related to Internationalized Domain Names (IDN)
are being looked into by GIST R&D.
- Like with most standards and recommendations,
compliance issues are a major concern. In the absence
of a certifying authority, inadequate or faulty
support in applications , rendering and display
engines , slow or expensive updates are a major
bottleneck to proliferation of Indian Languages.
The Bureau of Indian Standards (BIS) has adopted
it as the ISCII - Indian Script Code for Information
Interchange (IS 13194:1991). The 8-bit flavour is
the most commonly used standard and it has minimal
requirements of CPU, memory. ISCII is a ‘character
based encoding system’. These standards define common
phonetic alphabetic set (character set).
- PASCII: PersoArabic Script
Code for Information Interchange. For Urdu, Sindhi
and Kashmiri - C-DAC GIST introduced standards for
these languages when there were none available.
Under the TDIL Initiative, various standards for
storage, representation and entry were recommended.
Several of these today find a place in the current
industry standards such as UNICODE.
- ISCLAP: Standard for Pager communication:
In 1997, Pager Technology was considered mature
for adopting Indian Languages. Motorola took the
lead and requested standardization of a coding scheme
for Devnagari and Gujarati. The Telecom Engineering
Center (TEC), C-DAC, DoE, and the Pager manufacturers
agreed to the formulation of "Indian Standard
Code for Language Paging". This was done, keeping
the compatibility to ISCII in mind such that data
inter conversion at sending end from Terminals and
at receiving end would be based on a simple formulae.
Keyboard Layout - INSCRIPT is a part of
the ISCII standard. Supported by major OS vendors
and applications. INSCRIPT is based on the phonetic
nature of Indian scripts. The BIS ISCII document
(IS 13194:1991) also describes the keyboard layout
for each script. This traditional keyboard is widely
accepted and supported by most of the Multinational
Companies (MNC) who support Indian languages.
- Traditional INSCRIPT layout is very scientific.
It supports consonants on right and vowels on the
left. It also has phonetic base with higher consonants
and higher vowels on the shift of the same key thereby
increasing the speed.
- GIST has also developed Limited Keys Input mechanism,
prediction algorithms and smart writing systems,
which reduce the number of keys required for Indian
OPEN TYPE FONTS - For UNICODE support in various applications, GIST
Labs has developed Open Type Fonts for various scripts
including Urdu (Naskh as well as Nastaleeq/Nastaliq),
Sindhi and Kashmiri. Various modern OS today support
OT Fonts for viewing UNICODE data. Several GIST Tools
have also been upgraded to support the OT-Font technology.
ISFOC - Intelligence
based Script FOnt Code:
The primary rule of thumb for typography is - If the text does not look good we do not
feel like reading it. Good typography is characterized
by well-structured letterforms in a particular font,
pleasant inter-letter spacing, ideal word spacing and
healthy line spacing. Emphasis has been placed on text
compositions (horizontal as well as vertical) and final
reproduction on output devices such as screen and printers,
aesthetic rendering and display for True Type Fonts.
- Bilingual font - Allows representation
of English as well the Indian language of choice.
Supports bare minimum features of any script. Ideal
for developing applications having bilingual data,
because it supports English as well as one Indian
- Monolingual font - They represent
a lot more combinational characters as compared
to bilingual fonts. Recommended if you are looking
at a pure representation of the script.
- Bilingual-Web font / Monolingual-Web font C-DAC GIST has recommended the use of web-font types.
The non-web fonts are supported only for backward
compatibility. Using these font types makes applications
more robust and immune to problems related to the
display of Indian languages. For some scripts like
Tamil, GIST recommends using only these types of
- ISO - 8859 Compliant fonts for
use with specialized applications have also been
developed and deployed for various Indian languages.
ISO fonts are used for Linux and some windows applications
(eg: oracle d2k, 9iAS, crystal reports, .Net, etc.)
- Note: Typing is independent of
Font Type in use.
Naming conventions for GIST TRUE TYPE (TT
1. A. Mnemonics :
Assamese (AS), Bengali (BN),
Devanagari (DV - catering to Hindi, Marathi, etc.),
Gujarati (GJ), Kannada (KN), Malayalam (ML), Manipuri
(MN), Oriya (OR), Punjabi (PN), Tamil (TM), Telugu (TL)
1. B. Corresponding Bilingual : ASB, BNB, DVB,
GJB, KNB, MLB, MNB, ORB, PNB, TMB, TLB
1. C. Corresponding Bilingual Web : ASBW, BNBW,
DVBW, GJBW, KNBW, MLBW, MNBW, ORBW, PNBW, TMBW, TLBW
1. D Corresponding Monolingual Web : ASW, BNW,
DVW, GJW, KNW, MLW, MNW, ORW, PNW, TMW, TLW
2. Followed by hyphen
3. TT - indicating True Type Font
4. Name of font Surekh, Yogesh, Mukta,
5. Numerals EN English numerals (optional)
Tamil, Telugu and Malayalam support only English numerals
- Tamil99 has Mnemonics - TAB for
bilingual and TAM for monolingual
Example : "GJBW-TTAvantikaEN" is
- Font for Gujarati Bilingual ISFOC data.
- It is a Web-Font
- It is a True-Type font
- It is identified or named as Avantika
- EN indicates that the font has English Numerals.