1.Project goal: Compilation of the prothom-alo corpus, converting the corpus to Unicode format as well as producing a document on the approach to compilation of a Bangla corpus. 2.Corpora and its importance in language research: A corpus, most simply can be defined as a collection of texts, which can be of a particular language or of more than one language. It is as important a resource as any other resources in linguistic research. Natural language processing has always been an interesting research area and computational linguistics is one important part of it. The key resource to any linguistic research is a trained, annotated corpus that can enhance the language processing capability such as automatic part-of-speech tagging, information extraction etc. As an example many lexicographers have found that they can more effectively create dictionaries by studying word usage in a very large linguistics corpora. Corpora have significantly affected research in linguistic discipline and have succeeded to open a new area of research. The corpus being such an important resource has made linguistic researchers produce corpora of their language. English language has many corpora varying in size, genres and purposes. The first English corpus is the Brown corpus which was created by W. Nelson Francis and Henry Kucera in the early 1960’s. There are many other English corpora available such as The British National Corpus, London Lund corpus, Penn TreeBank corpus, the International corpus of English and many more . Unfortunately in Bangla we do not have any corpus available. The goal of the project was to convert the available text collected from the on line version of Prothom-alo and convert it to Unicode format to make it usable for further research. 3. Compilation procedure : The corpus has been created in two phases. These phases are: 1) Collecting raw text from “Prothom-alo” website. 2) Converting to Unicode.
3.1 Collection of text: The raw text for the corpus was collected from the prothom-alo home page- WWW.Prothom-alo.net. This was done using a web crawler program that surfed through the website of prothom-alo and downloaded all the news available for the year of 2005- including magazines, periodicals published by them. The crawler crawled for one night to collect all the text, which were in html of course. After that using the Linux script all the files were converted to text files.
decemb er day1 day12 day 31
News categories 1,2,....15,38
Fig: Prothom-alo corpus 3.2 Unicode convertioin: The second part in creating the corpus was to convert all the files to unicode format. This was needed because Unicode support for Bangla is much more rich than any other format. Prothom-alo uses two types of fonts, namely “Bansi Alpona” and “Prothoma”. The previous was in use up to 2005 and currently they are using “Prothoma” for the on lone version of the newspaper. So a Java application was written which recursively searched the folders and sub folders and convert all the text files to Unicode.
4. Processing the text: Categorizing the news: After converting to Unicode the corpus is now ready for any further processing required. An important and useful processing could be categorizing the news. Prothom-alo presents news in 27 different categories. Each category has a category id; i.e. category 1 is for “prothom pata”, category 2 is for “sesh pata”, etc. So if all the news that belong to the same category can be merged together it will enable us to analyze and carry out some research on like text categorization etc. A Java application is used which surfs through the news of all the days and collects news of the same category in one file. The corpus is also available as a single text file. 5. Analysis : We now have a corpus which is: • •
318 Mb in size 12 million word/token count
which is a big one. Some basic statistical analysis is now on offer which...
Please join StudyMode to read the full document