Abstract Since September 2005, the Ethiopian Languages Research Center of Addis Ababa University has been engaged in a project called "The Annotation of Amharic News Documents". The project was meant to tag manually each Amharic word in its context with the most appropriate parts-of-speech.
This paper presents the POS tagset developed for annotating the news documents, the problems encountered in the process of tagging the news documents and the procedures followed to manually tag them. The major output of the work contains 1065 Amharic news documents (that constitute 210,000 prosodic words) annotated manually with part-ofspeeches and a new tagset for the language derived from the 1065 news item.
The outcome of the POS tagging project is assumed to have great contribution for future works in natural language processing of Amharic, including the development of probabilistic part-of-speech taggers (a software which uses a lexicon as a component for automatically assigning words with appropriate part-of-speech and a central component for higher level NLP tools such as parsers), a noun-phrase chunker (a software tool that seeks to identify noun phrases in a text) and for works in speech synthesis, speech recognition, information retrieval, word sense disambiguation, corpus analysis and computational lexicography of Amharic. 1. Introduction
In this paper we present a recently completed project work by the Ethiopian Languages Research Center that deals with the parts-of-speech (POS) tagging of Amharic news items. The project was conducted since September 2005 for four months.
POS tagging is the process of assigning a POS or other lexical class marker to each word in a corpus (Jurafsky 2005). The project was initiated or stems from understanding the need for lack of basic Amharic resources that enable the construction of new resources by researchers (Alemu 2005, Getachew 2001). One such basic resource is a large corpus in the language annotated with POS information. In this, the prime objective of the project was to manually tag the 210,000 prosodic words that occur in the 1065 Amharic news documents with appropriate POSs or morpho-syntactic categories.
The news documents were donated by Walta Information Center and preprocessed by Dr. Lars Asker and Atelach Alemu of the University of Stockholm and provided in electronic copy to the Center by Dr. Lars Asker. Details of the pre-processing are available in Alamu (2005). Nine people, most of them from the Center, were involved in the actual manual tagging of the 1065 news documents. Besides, one technical assistant and four other administrative support staff were also involved at various levels during the project.
As it was unlikely to use the tagset for English by directly projecting English tags, a new POS tagset, derived from the 210,000 prosodic words of the Amharic news documents, was developed for use to tag the news documents mentioned. Tagsets of other languages (English, Korean, Arabic, and French), the nature of Amharic language itself and other constraints (like available fund, qualification of the annotators and time assigned for undertaking the project) were carefully considered while identifying the tags for inclusion in the tagset.
The outcome of the POS tagging project is assumed to have great contribution for future works in Natural Language Processing of Amharic including the development of probabilistic part-of-speech taggers (a software which uses a lexicon as a component for automatically assigning words with appropriate part- of-speech and a central component for higher level NLP tools such as parsers), a noun-phrase chunker (a software tool in computational linguistics that seeks to identify noun phrases in a text) and...