3.1 Collection of text: The raw text for the corpus was collected from the prothom-alo home page- WWW.Prothom-alo.net. This was done using a web crawler program that surfed through the website of prothom-alo and downloaded all the news available for the year of 2005- including magazines, periodicals published by them. The crawler crawled for one night to collect all the text, which were in html of course. After that using the Linux script all the files were converted to text files.
decemb er day1 day12 day 31
News categories 1,2,....15,38
Fig: Prothom-alo corpus 3.2 Unicode convertioin: The second part in creating the corpus was to convert all the files to unicode format. This was needed because Unicode support for Bangla is much more rich than any other format. Prothom-alo uses two types of fonts, namely “Bansi Alpona” and “Prothoma”. The previous was in use up to 2005 and currently they are using “Prothoma” for the on lone version of the newspaper. So a Java application was written which recursively searched the folders and sub folders and convert all the text files to Unicode.
4. Processing the text: Categorizing the news: After converting to Unicode the corpus is now ready for any further processing required. An important and useful processing could be categorizing the news. Prothom-alo presents news in 27 different categories. Each category has a category id; i.e. category 1 is for “prothom pata”, category 2 is for “sesh pata”, etc. So if all the news that belong to the same category can be merged together it will enable us to analyze and carry out some research on like text categorization etc. A Java application is used which surfs through the news of all the days and collects news of the same category in one file. The corpus is also available as a single text file. 5. Analysis : We now have a corpus which is: • •
318 Mb in size 12 million word/token count
which is a big one. Some basic statistical analysis is now on offer which...