Mobile Based Application

Manual Annotation of Amharic News Items with Part-of-Speech Tags and its Challenges* Abstract Since September 2005, the Ethiopian Languages Research Center of Addis Ababa University has been engaged in a project called "The Annotation of Amharic News Documents". The project was meant to tag manually each Amharic word in its context with the most appropriate parts-of-speech. This paper presents the POS tagset developed for annotating the news documents, the problems encountered in the process of tagging the news documents and the procedures followed to manually tag them. The major output of the work contains 1065 Amharic news documents (that constitute 210,000 prosodic words) annotated manually with part-ofspeeches and a new tagset for the language derived from the 1065 news item. The outcome of the POS tagging project is assumed to have great contribution for future works in natural language processing of Amharic, including the development of probabilistic part-of-speech taggers (a software which uses a lexicon as a component for automatically assigning words with appropriate part-of-speech and a central component for higher level NLP tools such as parsers), a noun-phrase chunker (a software tool that seeks to identify noun phrases in a text) and for works in speech synthesis, speech recognition, information retrieval, word sense disambiguation, corpus analysis and computational lexicography of Amharic. 1. Introduction In this paper we present a recently completed project work by the Ethiopian Languages Research Center that deals with the parts-of-speech (POS) tagging of Amharic news items. The project was conducted since September 2005 for four months. POS tagging is the process of assigning a POS or other lexical class marker to each word in a corpus (Jurafsky 2005). The project was initiated or stems from understanding the need for lack of basic Amharic

References: Alemu, A. and Asker, L. 2005. "Web Mining for an Amharic -English Bilingual Corpus", in Proceedings of the 1st International Conference on Web Information Systems and Technologies (WEBIST 2005), Miami. Baye Yimam 1987. E.C. yamari��a s�wasiw (Amharic Grammar). Addis Ababa: EMPDA. Demeke, Girma A. (forthcoming). Amharic Word Classes. WCAL 5, August 2006, Addis Ababa University. Jurafsky, D. and James, H. 2000. Speech and Language Processing. Prentice Hall: Mersehazen Wolde Kirkos. 1935 E.C. Amharic Grammar (text in Amharic). Addis Ababa: Artistic Priniting Press. Yacob, D. (1996). System for Ethiopic Representation in ASCII (SERA). http://www.abyssiniacybergateway.net/fidel/. Addresses of the authors: Girma A. Demeke Ethiopian Languages Research Center, Director Addis Ababa University Email: girmaad@gmail.com & Mesfin Getachew Faculty of Informatics, Department of Information Science Addis Ababa University Email: mesgetachew@yahoo.com 16

Mobile Based Application

You May Also Find These Documents Helpful

Pt1420 Unit 1 Assignment

Pt1420 Unit 1 Assignment

Nt1310 Unit 1 Data Analysis

Nt1310 Unit 1 Data Analysis

An Mrp Solution for Riordan Manufacturing

An Mrp Solution for Riordan Manufacturing

Cultural Competence in the Emerging Somali Population in Arizona

Cultural Competence in the Emerging Somali Population in Arizona

Did Women and Men Benefit Equally from the Renaissance?

Did Women and Men Benefit Equally from the Renaissance?

Rel 133: Elements Of Religious Traditions

Rel 133: Elements Of Religious Traditions

Relative Isolation of Sub-Saharan Africa

Relative Isolation of Sub-Saharan Africa

Outline for Common Elements in African Societies

Outline for Common Elements in African Societies

253285362 SOCIAL CLASS AND CLASS STRUGGLE IN SUZANNE COLLINS S THE HUNGER GAMES

253285362 SOCIAL CLASS AND CLASS STRUGGLE IN SUZANNE COLLINS S THE HUNGER GAMES

Rhetorical Analysis of Barack Obamas Inaugural Address

Rhetorical Analysis of Barack Obamas Inaugural Address

A crowdsourcing framework for the production and use of film and television data

A crowdsourcing framework for the production and use of film and television data

Literary Devices in Pride and Prejudice

Literary Devices in Pride and Prejudice

Natural Language Toolkit Case Study

Natural Language Toolkit Case Study

Learning Outcomes

Learning Outcomes

Bound Morphemes In Arabic And English

Bound Morphemes In Arabic And English

Related Topics