Newspaper Article Classifier

Only available on StudyMode
  • Topic: Parsing, Machine learning, Natural language processing
  • Pages : 27 (6617 words )
  • Download(s) : 9
  • Published : March 22, 2013
Open Document
Text Preview
Document Classification for Newspaper Articles
Dennis Ramdass & Shreyes Seshasai
6.863 Final Project
Spring 2009
May 18, 2009



In many real-world scenarios, the ability to automatically classify documents into a fixed set of categories is highly desirable. Common scenarios include classifying a large amount of unclassified archival documents such as newspaper articles, legal records and academic papers. For example, newspaper articles can be classified as ’features’, ’sports’ or ’news’. Other scenarios involve classifying of documents as they are created. Examples include classifying movie review articles into ’positive’ or ’negative’ reviews or classifying only blog entries using a fixed set of labels.

Natural language processing o↵ers powerful techniques for automatically classifying documents. These techniques are predicated on the hypothesis that documents in di↵erent categories distinguish themselves by features of the natural language contained in each document. Salient features for document classification may include word structure, word frequency, and natural language structure in each document. Our project looks specifically at the task of automatically classifying newspaper articles from the MIT newspaper The Tech. The Tech has archives of a large number of articles which require classification into specific sections (News, Opinion, Sports, etc). Our project is aimed at investigating and implementing techniques which can be used to perform automatic article classification for this purpose. At our disposal is a large archive of already classified documents so we are able to make use of supervised classification techniques. We randomly split this archive of classified documents into training and testing groups for our classification systems (hereafter referred to simply as classifiers). This project experiments with di↵erent natural language feature sets as well as di↵erent statistical techniques using these feature sets and compares the performance in each case. Specifically, our project involves experimenting with feature sets for Naive Bayes Classification, Maximum Entropy Classification, and examining sentence structure di↵erences in di↵erent categories using probabilistic grammar parsers. The paper proceeds as follows: Section 2 discusses related work in the areas of document classification and give an overivew of each classification technique. Section 3 details our approach and implementation. Section 4 shows the results of testing our classifiers. In Section 5, we discuss possible future extensions and suggestions for improvement. Finally, in Section 6, we discuss retrospective thoughts on our approach and high-level conclusions about our results.


Related Work and Overview of Classification Techniques

There have been variety of supervised learning techniques that have demonstrated reasonable performance for document classification. Some of these techniques includes k-nearest neighbor [1], support vector machines [2], boosting [3] and rule learning algorithms [4, 5].

For this project, we focus on related work in the areas of Naive Bayes classification [6, 7, 8], Maximum Entropy classification [9] and probabilistic grammar classification [10].


Naive Bayes Classification

This subsection cites material from [7] extensively to explain the basics of Naive Bayes Classification. Bayesian classifiers are probabilistic approaches that make strong assumptions about how the data is generated, and posit a probabilistic model that embodies these assumptions. Bayesian classifiers usually use 1

supervised learning on training examples to estimate the parameters of the generative model. Classification on new examples is performed with Bayes’ rule by selecting the category that is most likely to have generated the example.

The naive Bayes classifier is the simplest of these classifiers, in that it assumes that all features of the examples are independent of each other given the context of the category....
tracking img