Text Mining

Only available on StudyMode
  • Topic: Data mining, Text mining, Natural language processing
  • Pages : 37 (11040 words )
  • Download(s) : 54
  • Published : October 27, 2012
Open Document
Text Preview
ininJSS

Journal of Statistical Software
March 2008, Volume 25, Issue 5. http://www.jstatsoft.org/

Text Mining Infrastructure in R
Ingo Feinerer Kurt Hornik David Meyer

Wirtschaftsuniversit¨t Wien Wirtschaftsuniversit¨t Wien Wirtschaftsuniversit¨t Wien a a a

Abstract During the last decade text mining has become a widely used discipline utilizing statistical and machine learning methods. We present the tm package which provides a framework for text mining applications within R. We give a survey on text mining facilities in R and explain how typical application tasks can be carried out using our framework. We present techniques for count-based analysis methods, text clustering, text classification and string kernels.

Keywords: text mining, R, count-based evaluation, text clustering, text classification, string kernels.

1. Introduction
Text mining encompasses a vast field of theoretical approaches and methods with one thing in common: text as input information. This allows various definitions, ranging from an extension of classical data mining to texts to more sophisticated formulations like “the use of large online text collections to discover new facts and trends about the world itself” (Hearst 1999). In general, text mining is an interdisciplinary field of activity amongst data mining, linguistics, computational statistics, and computer science. Standard techniques are text classification, text clustering, ontology and taxonomy creation, document summarization and latent corpus analysis. In addition a lot of techniques from related fields like information retrieval are commonly used. Classical applications in text mining (Weiss et al. 2004) come from the data mining community, like document clustering (Zhao and Karypis 2005b,a; Boley 1998; Boley et al. 1999) and document classification (Sebastiani 2002). For both the idea is to transform the text into a structured format based on term frequencies and subsequently apply standard data mining techniques. Typical applications in document clustering include grouping news articles or information service documents (Steinbach et al. 2000), whereas text categorization methods are

2

Text Mining Infrastructure in R

used in, e.g., e-mail filters and automatic labeling of documents in business libraries (Miller 2005). Especially in the context of clustering, specific distance measures (Zhao and Karypis 2004; Strehl et al. 2000), like the Cosine, play an important role. With the advent of the World Wide Web, support for information retrieval tasks (carried out by, e.g., search engines and web robots) has quickly become an issue. Here, a possibly unstructured user query is first transformed into a structured format, which is then matched against texts coming from a data base. To build the latter, again, the challenge is to normalize unstructured input data to fulfill the repositories’ requirements on information quality and structure, which often involves grammatical parsing. During the last years, more innovative text mining methods have been used for analyses in various fields, e.g., in linguistic stylometry (Gir´n et al. 2005; Nilo and Binongo 2003; o Holmes and Kardos 2003), where the probability that a specific author wrote a specific text is calculated by analyzing the author’s writing style, or in search engines for learning rankings of documents from search engine logs of user behavior (Radlinski and Joachims 2007). Latest developments in document exchange have brought up valuable concepts for automatic handling of texts. The semantic web (Berners-Lee et al. 2001) propagates standardized formats for document exchange to enable agents to perform semantic operations on them. This is implemented by providing metadata and by annotating the text with tags. One key format is RDF (Manola and Miller 2004) where efforts to handle this format have already been made in R (R Development Core Team 2007) with the Bioconductor project (Gentleman et al. 2004, 2005). This...
tracking img