Journal of Statistical Software
March 2008, Volume 25, Issue 5. http://www.jstatsoft.org/
Text Mining Infrastructure in R
Ingo Feinerer Kurt Hornik David Meyer
Wirtschaftsuniversit¨t Wien Wirtschaftsuniversit¨t Wien Wirtschaftsuniversit¨t Wien a a a
Abstract During the last decade text mining has become a widely used discipline utilizing statistical and machine learning methods. We present the tm package which provides a framework for text mining applications within R. We give a survey on text mining facilities in R and explain how typical application tasks can be carried out using our framework. We present techniques for count-based analysis methods, text clustering, text classiﬁcation and string kernels.
Keywords: text mining, R, count-based evaluation, text clustering, text classiﬁcation, string kernels.
Text mining encompasses a vast ﬁeld of theoretical approaches and methods with one thing in common: text as input information. This allows various deﬁnitions, ranging from an extension of classical data mining to texts to more sophisticated formulations like “the use of large online text collections to discover new facts and trends about the world itself” (Hearst 1999). In general, text mining is an interdisciplinary ﬁeld of activity amongst data mining, linguistics, computational statistics, and computer science. Standard techniques are text classiﬁcation, text clustering, ontology and taxonomy creation, document summarization and latent corpus analysis. In addition a lot of techniques from related ﬁelds like information retrieval are commonly used. Classical applications in text mining (Weiss et al. 2004) come from the data mining community, like document clustering (Zhao and Karypis 2005b,a; Boley 1998; Boley et al. 1999) and document classiﬁcation (Sebastiani 2002). For both the idea is to transform the text into a structured format based on term frequencies and subsequently apply standard data mining techniques. Typical applications in document clustering include grouping news articles or information service documents (Steinbach et al. 2000), whereas text categorization methods are
Text Mining Infrastructure in R
used in, e.g., e-mail ﬁlters and automatic labeling of documents in business libraries (Miller 2005). Especially in the context of clustering, speciﬁc distance measures (Zhao and Karypis 2004; Strehl et al. 2000), like the Cosine, play an important role. With the advent of the World Wide Web, support for information retrieval tasks (carried out by, e.g., search engines and web robots) has quickly become an issue. Here, a possibly unstructured user query is ﬁrst transformed into a structured format, which is then matched against texts coming from a data base. To build the latter, again, the challenge is to normalize unstructured input data to fulﬁll the repositories’ requirements on information quality and structure, which often involves grammatical parsing. During the last years, more innovative text mining methods have been used for analyses in various ﬁelds, e.g., in linguistic stylometry (Gir´n et al. 2005; Nilo and Binongo 2003; o Holmes and Kardos 2003), where the probability that a speciﬁc author wrote a speciﬁc text is calculated by analyzing the author’s writing style, or in search engines for learning rankings of documents from search engine logs of user behavior (Radlinski and Joachims 2007). Latest developments in document exchange have brought up valuable concepts for automatic handling of texts. The semantic web (Berners-Lee et al. 2001) propagates standardized formats for document exchange to enable agents to perform semantic operations on them. This is implemented by providing metadata and by annotating the text with tags. One key format is RDF (Manola and Miller 2004) where eﬀorts to handle this format have already been made in R (R Development Core Team 2007) with the Bioconductor project (Gentleman et al. 2004, 2005). This...