Characteristics of Indian Languages

Only available on StudyMode
  • Download(s): 208
  • Published: January 27, 2013
Read full document
Text Preview
CHARACTERISTICS OF INDIAN LANGUAGES
MADHAVI VARALWAR and NIXON PATEL Bhrigus Inc. Hyderabad,India {madhaviv@bhrigus.com, npatel@bhrigus.com} A text to speech system often requires simple information such as language of the input text; voice-gender (male/female) to be used, pronunciation of a telephone number as isolated digits etc. A raw input text could be embedded with such information using XML like tags often referred to as Speech Synthesis Markup Language (SSML) which aims to produce a better content by a TTS in various contexts. In this positional paper, we discuss some of the possible SSML extensions keeping in the view of Indian language scripts and the corresponding TTS systems. 1. INTRODUCTION : Bhrigus Inc. is actively involved in developing TTS and ASR for Indian languages, and is currently developing unit selection voice for Telugu. The goal is to build high quality voices and speech recognition for many of the Indian languages and interface them with computertelephony applications. Some of these applications include verticals such as entertainment, health care, financial in the context of India. In this paper, we describe the nature of the Indian languages and describe and discuss our proposal where we feel the requirements of some more SSML elements to improve the rendering of Indian languages. FEATURES OF INDIAN LANGUAGES AND SCRIPTS Some of the features of Indian languages and the scripts used to express them are : PHONEME SET : Indian languages have a more sophisticated notion of a character unit or akshara that forms the fundamental linguistic unit. An akshara consists of 0, 1, 2, or 3 consonants and a vowel. Words are made up of one or more aksharas. Each akshara can be pronounced independently as the languages are completely phonetic. Aksharas with more than one consonants are called samyuktaksharas or combo-characters. The last of the consonants is the main one in a samyuktakshara. All Indian languages have essentially the same alphabet derived from the Sanskrit alphabet. This common alphabet contains 33 consonants and 15 vowels in common practice. Additional 34 consonants and 2-3 vowels are used in specific languages or in the classical forms of others. This difference is not very significant in practice. Individual consonants and vowels form the basic letters of the alphabet.

DIFFERENT GRAPHEME'S : The commonality in the alphabet does not extend the graphic forms used to express them in print. Each language uses different scripts consisting of dissimilar grapheme's for printing. Thus, printed matter in other scripts are inaccessible to readers of one script. There are 10-12 major scripts in India. The Devanagari script is the widest used one, being used to write Hindi (the most spoken language), Marathi, Konkani, and Nepali, the language of the neighboring Nepal. Different scripts use different philosophies for the individual grapheme's and their combinations. Some have a head-line or shirorekha that persists for a whole word. Others have non-touching grapheme's. The grapheme of one of the consonants is usually at the heart of the printed akshara. The vowel appears as a matra or vowel modifier. These can appear to the left,right, above or below it or in combinations. The supporting consonants of a samyuktashara also appear as modifier grapheme's to the left, right, above, or below of the main one. These modifiers could be truncated or scaled down forms of the basic consonant, but could also be completely different. They may touch each other or the main consonant in some cases or may be separate. These rules are not consistent even within a script and certainly not across scripts. REDUPLICATION : All languages employ reduplicated form in varying degrees and for different functions, extensive use of reduplication is a particular characteristic of Indian Languages. In this section we present an overview of the different kinds of re duplicative expressions found in the subcontinent . This sets the...
tracking img