Development of Bengali Language Stemmer

Only available on StudyMode
  • Topic: Information retrieval, Natural language processing, Word stem
  • Pages : 17 (3732 words )
  • Download(s) : 755
  • Published : May 17, 2012
Open Document
Text Preview
Development of Bengali Language Stemmer
Project Report
Submitted in Partial Fulfillment of the Requirements for the Degree of

Bachelor of Technology

Submitted by

Barnan Das
&

Tanmoy Pal

Under the guidance of

Dr. Pabitra Mitra
Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

CERTIFICATE

This is to certify that the project report entitled “Development of Bengali Language Stemmer” is a record of bona fide work carried out by Mr. Barnan Das & Mr. Tanmoy Pal of Bengal College of Engineering and Technology, Durgapur under my supervision and guidance, as part of their Final Year Project 2009, at the Indian Institute of Technology, Kharagpur.

Dr. Pabitra Mitra Date: Place: Kharagpur Dept. of Computer Science and Engineering Indian Institute of Technology, Kharagpur

ABSTRACT

Since the day man started realizing the importance of information it became necessary for archiving those information in such a way that they become easy to retrieve in the future. The advent of computers made it possible to store large amounts of data or information and thus retrieving those data became a necessity. The area of Information Retrieval (IR) was born in 1950s and since then several IR systems are being developed and used everyday by millions of people all over the world. English being a widely accepted language all over the world, most of the IR systems, web based or stand alone systems, are developed for English documents or contents. A little has been done for Bengali documents. Bengali is the fourth largest language of the world. There is great need for developing technology for processing Bengali language text. A particularly important task is that of developing a search engine for Bengali documents. Many technologies required for this is yet to be developed in Bengali. The goal of this project is to develop the technologies for Bengali and the focus is primarily on developing algorithms for stemming. Stemming is the process of reducing word to its stem or root form in order to increase the recall rate and hence are used widely in information retrieval tasks. Most popular stemmers like Porter’s and Lovin’s encode a large number of language-specific rules built over a length of time. But, Bengali being a highly inflectional language, where one root may have more than 20 morphological variants, it becomes difficult and time consuming to formulate a set of stemming rules. In this project we try to implement a clustering based stemming algorithm proposed by P. Majumder, M. Mitra, S.K. Parui, G. Kole, P. Mitra and K. Dutta in their paper YASS: Yet another suffix stripper, ACM Translations on Information Systems, Vol. 25, No. 4, pp. 18-38, October 2007.

ACKNOWLEDGEMENTS

We take this opportunity to express our deep sense of gratitude to our guide Dr. Pabitra Mitra for his guidance, support and inspiration throughout the duration of the work.

Also, we would like to thank Dr. (Prof.) Subrata Dasgupta, Head of the Department, Computer Science and Engineering, Bengal College of Engineering and Technology, Durgapur for being a constant source of inspiration.

Finally, we would like to thank the research scholars of IIT Kharagpur and our friends at BCET for their invaluable suggestions and comment that led to the successful completion of the project.

Barnan Das Tanmoy Pal

List of Figures
Types of Stemming ……………………………………………………………. 5

Control Flow Diagram …………………………………………………………. 8 Logical Flow Diagram ………………………………………………………… 9

CONTENTS
Abstract………………………………………………………………………………... Acknowledgement………………………………………………………………………. List of Figures…………………………………………………………………………… 1. Introduction……………………………………………………………………… 1.1 Brief History……………………………………………………………. 1.2 Information Retrieval…………………………………………………... 1.3 Performance Measurement…………………………………………….. 1.4 Stages of IR…………………………………………………………….. Stemming…………………………………………………………………………. 2.1...
tracking img