Top-Rated Free Essay
Preview

An Efficient Tamil Text Compaction System

Better Essays
1258 Words
Grammar
Grammar
Plagiarism
Plagiarism
Writing
Writing
Score
Score
An Efficient Tamil Text Compaction System
An Efficient Tamil Text Compaction System
N.M..Revathi, G.P.Shanthi, Elanchezhiyan.K, T V Geetha, Ranjani Parthasarathi & Madhan Karky Tamil Computing Lab (TaCoLa), College of Engineering Guindy, Anna University, Chennai. haisweety18@gmail.com, jijutodo@gmail.com, madhankarky@gmail.com

Abstract
Tamil is slowly becoming the online language and mobile text messaging languages for many Tamils around the world. Social networks and mobile platforms now extensively support Unicode and applications for keying Tamil text. The number of characters in a text message is limited in some social nets and mobile text messages. The need for compacting the text becomes essential as it translates to saving online storage space, cost and many more factors. The paper proposes a text compaction system for Tamil, a first of its kind in Tamil. The system proposed in this paper handles common Tamil words, acronyms/abbreviations and numbers. Morphological analyzer [1] and Morphological generator are used to stem inflexion words and replace them to compact using a mapping repository. The proposed work is tested with over 10,000 words and it is found that the final result is reduced to 40% of the original text. The paper concludes by discussing possible extensions to this system.

1. Introduction:
In all languages, using compact or short form of words in text messages, emails, and blogs is rapidly increasing. It is particularly popularly amongst young urbanities as it allows for voiceless communication, useful in noisy environment that would defeat a voice conversation and also buffered communication since the message the sender wants to convey can be accessed by the receiver at any time. Compacting text is thus necessary because of limited message length in blog sites and tiny user interface of mobile phone. Getting the shortest word has no rule and it is mainly aimed at understanding. That is, those words should be understood by everyone. We can obtain the compact words by omitting letters, replacing prefix and suffix of through suitable symbols and numbers. This causes the compacted system to be credited with creating a language. The paper proposes a Text Compaction system for Tamil, the primogenital in Tamil..

2. Background:
Tamil is perhaps the only classical language, whose glorious literatures date back to the pre-Christian era, has remained in continuous use for more than many millennia now. Due to the untiring efforts of scholars, researches and enthusiasts, it has also evolved creatively over the years to the extent that it is also used today profusely in computers, internet, mobile phone etc. Diverse creative efforts have been taking place that would pave the way for a quantum jump in the usage of Tamil in Information Technology. “Tamil Virtual University”, “Centre for Research and Applications of Tamil in Internet”,

267

“Tamil Software Development Fund” is to quote a few. These efforts paved the way for the motivation of proposing Tamil compaction system in Tamil. Many compaction systems have been developed for English and other languages. Lee Ming Fung in [2] proposed a Short form Identification and Categorization model based on maximum entropy to identify short forms from actual words and acronyms/abbreviations and categorize the short forms into the short forms formed from letter omission and those formed through phonetic substitution of parts of words. In the proposed system the compact words are formed in a diverse variety of ways such as omission, truncation and phonetic substitution. Acronym Identification and detection has been much researched. Acrophile in [3] automatically searches acronyms from acronym-expansion pairs from domain specific databases. By acronyms expansion pairs, we refer to a pairs each containing acronyms and their full expanded form or meaning. The paper makes use of acronym expansion pairs to replace the full expanded form with the acronyms.

3. Text Compaction Framework:
The figure below presents the various components of the framework.

3.1 Input Processing The input text is tokenized based on a delimiter and is passed on to the Morphological Analyzer. The analyzer removes the suffix (if present) added to the word and delivers the root word (RW). For example if the input to the analyzer is கணி ெபாறியி 3.2 Identification of the type The proposed paper handles three categories of words; common Tamil words, Abbreviations /acronyms, numbers. Now, the category to which the RW belongs is to be identified. The RW is checked to decide the category of abbreviations /acronyms. This is done by comparing the root word with the keys of the hash map (2.3). If the comparison results are true then the RW is considered as the abnormal word (AW) i.e. it belongs to the category of acronyms/abbreviations, else, it is treated as the normal word (NW) i.e. it belongs to either the first or third category. the output is given as கணி ெபாறி.

268

3.3 Extraction of the compact word If the word is identified as a normal word, it is passed to a tree which is built dynamically from the set of words that has already been stored in the dictionary. The NW is then searched in the binary search tree. On finding the NW in the binary search tree, the compact word is retrieved with an efficient mapping algorithm that maps each of the normal word with its compact word. Say suppose the word is an abnormal word, its compact word is retrieved in the following manner. A linked hash map is built for all the abbreviated words. The hash map uses the first word the abbreviated word as its key. Again with the help of an efficient mapping algorithm, the compact word is retrieved. In case the NW is a number name it is replaced with the numerals based on the place value system. 3.4 Output Processing The compact word that is being extracted is passed on the Tamil tool Morphological Generator to add the suitable suffix to cater to the rules of the language.

4. Results and Analysis:
The paper proposes the following layout for displaying the results to the user. It has two text areas: the one on the left is for entering the input text and the other on the right for displaying the output. The user can also view the no of characters that have been reduced in the output text.

Efficiency of the system can be calculated as (no of characters in the input text / no of characters in the output text) X 100%. The proposed work is tested with over 10,000 words and it is found that the final result is reduced to 40% of the original text.

269

5. Conclusion and Future work:
The paper describes the Tamil Compaction System, a framework for shrinking the text such that its meaning remains the same. Different subsystems and components of the framework are described in detail. Results from the implementation of this Tamil compaction system framework is provided and is compared against the compacting third party applications of social networking sites that are implemented for English language. Improving the mapping for words which are frequently used, conceptual reducing, integrating numerical analyser will take this system to its next level.

References:
Anandan, R. Parthasarathi, and T.V. Geetha, Morphological Analyser for Tamil. ICON 2002, 2002. Fung, L. M. (2005). SMS short form identification and codec. Unpublished master’s thesis, National University of Singapore, Singapore Acrophile (LSLarkey, P Ogilvie, MA Price, B Tamilio, 2000) a system that automatically searches acronym expansion pairs. Short Message Service (SMS) Texting Symbols: A Functional Analysis of 10,000 Cellular Phone Text Messages by Robert E. Beasley, Franklin College.

270

References: Anandan, R. Parthasarathi, and T.V. Geetha, Morphological Analyser for Tamil. ICON 2002, 2002. Fung, L. M. (2005). SMS short form identification and codec. Unpublished master’s thesis, National University of Singapore, Singapore Acrophile (LSLarkey, P Ogilvie, MA Price, B Tamilio, 2000) a system that automatically searches acronym expansion pairs. Short Message Service (SMS) Texting Symbols: A Functional Analysis of 10,000 Cellular Phone Text Messages by Robert E. Beasley, Franklin College. 270

You May Also Find These Documents Helpful

  • Good Essays

    Cited: Tim Nott. (2008, August). Words and pictures. Personal Computer World. Retrieved September 17, 2010, from ProQuest Computing. (Document ID: 1495329211).…

    • 289 Words
    • 2 Pages
    Good Essays
  • Good Essays

    Standardizing words: Sometimes words are not in proper formats. Simple rules and regular expressions can help solve these cases.…

    • 522 Words
    • 3 Pages
    Good Essays
  • Satisfactory Essays

    Considering that we are currently in the Net Generation and electronic communication has substantially taken over a lot of the linguistic communication, as well as electronic communication shortcuts have also become quite popular, such as textism and instant messaging. The research was conducted to verify if the current net communication shortcuts have an effect on the quality of writing.…

    • 505 Words
    • 3 Pages
    Satisfactory Essays
  • Satisfactory Essays

    Part 4: How does context change the way we text? Do we text different people in different ways? Talk about features of text language – does the use of a feature depend on the audience and purpose of the message? Do you feel that you adapt your way of texting for specific purposes?…

    • 371 Words
    • 2 Pages
    Satisfactory Essays
  • Powerful Essays

    Text language has evolved rapidly over recent years with trillions of text messages sent each year. Until recently, text messages were relatively expensive to send and so users have developed various techniques to reduce the number of characters per text to ensure they are paying as little as possible. This report will investigate the effects of these devices with text messages as well as trying establish whether there is a link between the way people text and they way they speak. Finally, I will also explore some of the public attitudes to texting.…

    • 1130 Words
    • 5 Pages
    Powerful Essays
  • Good Essays

    In social media, we all can admit that it takes a different direction when coming to language in its event that will change the way people using the social media to use and recognize the language, sentence structure and punctuation in a different way. This particular essay 's purpose will seize and analyze − also interpret it in my own way − the…

    • 1346 Words
    • 6 Pages
    Good Essays
  • Powerful Essays

    By shortening words are we working faster or just being bone idle? The online communications through social networking sites such as Yahoo messenger, Msn messenger, Skype, Facebook and IM chat have revolutionized the way communicate and causing rapid change in linguistics. The use of these sites threatens the education system and the appropriate prescriptive grammar.…

    • 2116 Words
    • 9 Pages
    Powerful Essays
  • Better Essays

    Multi-Modal Essay

    • 1036 Words
    • 5 Pages

    Over the past ten to fifteen years, there has been a major change in the way people communicate to each other due to the development of the internet. Because of this, there has been a massive effect on the amount of socialising between friends and family; using technology in online social messaging websites such as Bebo, MSN, Facebook and many more. Through these social messaging websites, more and more people can interact with their friends and family all over the world. An expert professor David Crystal has supported the idea of web-based messages and disagrees with the view that slang and contractions leads to a lower English standard of language. Although this is a benefit to most people, it has been abused by some people by overusing it and different ‘language’ while typing, which cause differentiation in the way we speak and the way we write. The current views are pointing at the fact that this is causing a pejoration in this generation’s language which could badly affect their and our future; destroying the conventions of Standard English. This essay will evaluate the similarities and differences between spoken language and web-based messaging such Facebook and many more.…

    • 1036 Words
    • 5 Pages
    Better Essays
  • Better Essays

    With technology rife in today’s society are the boundaries between spoken and written language becoming ever nearer? It seems that young children, teenagers, adults and even the elderly are all turning to mobile devices as an aid of communication. The frequent use of texting has brought about new features, such as clipping, that are unique to the texting world, this is thought to be putting a strain on our abilities to use correct Standard English. In this essay I will explore the variety of language and text specific features used within text messaging. I will also analyse the various attitudes towards texting and finally give my own opinion.…

    • 1352 Words
    • 6 Pages
    Better Essays
  • Best Essays

    Speech to Inform: Twitter

    • 1831 Words
    • 8 Pages

    In the last century, the world has been introduced to many new modes of communication, some more revolutionary than others. In the past twenty years we have seen several that have really changed the way we communicate in our daily life. In 1991 we were introduced to the mobile phone, that now 82% of American’s have. In 1996 and 1997 e-mail and instant messaging become a popular way of communication, especially to avoid expensive international shipping charges. Instead documents were sent electronically. In the mid -2000’s came the popularity of short message service/ more commonly called text messaging. As this new decade has just begun, a brand new, revolutionary mode of communication has entered with it. In 140 characters or less I will tell you that- Twitter is changing the way we live. (web: switched.com)…

    • 1831 Words
    • 8 Pages
    Best Essays
  • Good Essays

    english coursework

    • 1089 Words
    • 5 Pages

    Texting has rapidly become one of the most popular ways of communication in the modern day, with the language and general rules of texting easy to learn. I will be looking at a collection of personal texts in order to gain a better understanding of the situations certain devices are used, and by what kinds of people use specific devices. Contrary to belief, texting actually supports the rules of language, Grices Maxims are embedded in texts, for example the maxim of quantity, referring to the message being as long as needs be and not waffling on. Texting supports this maxim greatly, as it’s the quickest and most to the point means of communication.…

    • 1089 Words
    • 5 Pages
    Good Essays
  • Satisfactory Essays

    Texting Ruining Language

    • 401 Words
    • 2 Pages

    Texting is constantly changing our language. It is a relatively new worldwide phenomenon that is an example of language in evolution. The use of abbreviations, digits and the general absence of any vowels has changed the way we can communicate with people and how we use the written word by mobile phone.There are critics however such as author John Humphrys who wrote I h8 texting he believes that texting is ruining our language and that it makes people lazy with how they write. By exploring and comparing two differently opinionated pieces and conducting a survey of randomly chosen people think, will give us an overview of how texting has changed our language and if people truly believe it has changed the way we communicate.Less than a decade…

    • 401 Words
    • 2 Pages
    Satisfactory Essays
  • Satisfactory Essays

    Twitter is one of the most popular social media website in this era. Many people use twitter to be connected to others, business, or just to get along with trend. Behind the popularity of twitter, it is also gave effect in language field where it brings out new words which once only used for the site then become popular among the social media world. In this paper, I want to explain the some new words from twitter.…

    • 523 Words
    • 3 Pages
    Satisfactory Essays
  • Powerful Essays

    Losing Our Language

    • 1487 Words
    • 6 Pages

    Modern technological advances and changes are causing the English language to deteriorate or degrade rapidly. The recent expansion of e-mail, chat room, and messaging communication has quickly left harmful effects on the way people communicate to one another. Communications between users occur at a fast pace since they are attempting to keep up with the incoming information in real time. The feeling of anxiety to receive, process, and react to a message causes the person to respond with brief and shorten words. Shorter phrases are usually preferred as well as words because they are easier to spell and less time consuming. In the article “SOS: Written English in Trouble” Joyce Lynn Garrett states, “text speak, emoticons, and the more casual language of e-mail have found their way into everyday writing” (8). In other words, e-mails, messaging, and chatting have…

    • 1487 Words
    • 6 Pages
    Powerful Essays
  • Good Essays

    Innovation have made possible for the operations of the computer easy enough in processing record systems such as, creation of data records, storing, filing and retrieval of data. Short Message service (SMS) is a text messaging service component of phone, web, or mobile communication systems. It uses standardized communication protocols to allow fixed line or mobile phone services to exchange short text messages.…

    • 867 Words
    • 3 Pages
    Good Essays