Dennis Ramdass & Shreyes Seshasai
6.863 Final Project
May 18, 2009
In many real-world scenarios, the ability to automatically classify documents into a ﬁxed set of categories is highly desirable. Common scenarios include classifying a large amount of unclassiﬁed archival documents such as newspaper articles, legal records and academic papers. For example, newspaper articles can be classiﬁed as ’features’, ’sports’ or ’news’. Other scenarios involve classifying of documents as they are created. Examples include classifying movie review articles into ’positive’ or ’negative’ reviews or classifying only blog entries using a ﬁxed set of labels.
Natural language processing o↵ers powerful techniques for automatically classifying documents. These techniques are predicated on the hypothesis that documents in di↵erent categories distinguish themselves by features of the natural language contained in each document. Salient features for document classiﬁcation may include word structure, word frequency, and natural language structure in each document. Our project looks speciﬁcally at the task of automatically classifying newspaper articles from the MIT newspaper The Tech. The Tech has archives of a large number of articles which require classiﬁcation into speciﬁc sections (News, Opinion, Sports, etc). Our project is aimed at investigating and implementing techniques which can be used to perform automatic article classiﬁcation for this purpose. At our disposal is a large archive of already classiﬁed documents so we are able to make use of supervised classiﬁcation techniques. We randomly split this archive of classiﬁed documents into training and testing groups for our classiﬁcation systems (hereafter referred to simply as classiﬁers). This project experiments with di↵erent natural language feature sets as well as di↵erent statistical techniques using these feature sets and compares the performance in each case. Speciﬁcally, our project involves experimenting with feature sets for Naive Bayes Classiﬁcation, Maximum Entropy Classiﬁcation, and examining sentence structure di↵erences in di↵erent categories using probabilistic grammar parsers. The paper proceeds as follows: Section 2 discusses related work in the areas of document classiﬁcation and give an overivew of each classiﬁcation technique. Section 3 details our approach and implementation. Section 4 shows the results of testing our classiﬁers. In Section 5, we discuss possible future extensions and suggestions for improvement. Finally, in Section 6, we discuss retrospective thoughts on our approach and high-level conclusions about our results.
Related Work and Overview of Classiﬁcation Techniques
There have been variety of supervised learning techniques that have demonstrated reasonable performance for document classiﬁcation. Some of these techniques includes k-nearest neighbor , support vector machines , boosting  and rule learning algorithms [4, 5].
For this project, we focus on related work in the areas of Naive Bayes classiﬁcation [6, 7, 8], Maximum Entropy classiﬁcation  and probabilistic grammar classiﬁcation .
Naive Bayes Classiﬁcation
This subsection cites material from  extensively to explain the basics of Naive Bayes Classiﬁcation. Bayesian classiﬁers are probabilistic approaches that make strong assumptions about how the data is generated, and posit a probabilistic model that embodies these assumptions. Bayesian classiﬁers usually use 1
supervised learning on training examples to estimate the parameters of the generative model. Classiﬁcation on new examples is performed with Bayes’ rule by selecting the category that is most likely to have generated the example.
The naive Bayes classiﬁer is the simplest of these classiﬁers, in that it assumes that all features of the examples are independent of each other given the context of the category....