Preview

Random-walk Term Weighting for Improved Text Classi

Better Essays
Open Document
Open Document
6409 Words
Grammar
Grammar
Plagiarism
Plagiarism
Writing
Writing
Score
Score
Random-walk Term Weighting for Improved Text Classi
Random-Walk Term Weighting for Improved Text Classification
Samer Hassan and Rada Mihalcea and Carmen Banea Department of Computer Science University of North Texas samer@unt.edu, rada@cs.unt.edu, carmenb@unt.edu

Abstract
This paper describes a new approach for estimating term weights in a document, and shows how the new weighting scheme can be used to improve the accuracy of a text classifier. The method uses term co-occurrence as a measure of dependency between word features. A random-walk model is applied on a graph encoding words and co-occurrence dependencies, resulting in scores that represent a quantification of how a particular word feature contributes to a given context. Experiments performed on three standard classification datasets show that the new random-walk based approach outperforms the traditional term frequency approach of feature weighting.

1

Introduction

Term frequency has long been used as a major factor for estimating the probabilistic distribution of features in a document, and it has been employed in a broad spectrum of tasks including language modeling [18], feature selection [29, 24], and term weighting [13, 20]. The main drawback associated with the term frequency method is the fact that it relies on a bag-of-words approach. It implies feature independence, and disregards any dependencies that may exist between words in the text. In other words, it defines a ”random choice,” where the weight of the term is proportional to the probability of choosing the term randomly from the set of terms that constitute the text. Such an approach might be effective for capturing the relevance of a term in a local context, but it fails to account for the global effect that the term’s existence exerts on the entire text segment. We argue that the bag-of-words model may not be the best technique to capture term importance. Instead, given that relations in the text could be preserved by maintaining the structural representation of the text, a method



References: [1] I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, G. Paliouras, and C. D. Spyropoulos. An evaluation of naive bayesian anti-spam filtering. In Proceedings of the workshop on Machine Learning in the New Information Age, 2000. [2] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1–7), 1998. [3] C. Buckley, G. Salton, J. Allan, and A. Singhal. Automatic query expansion using smart: Trec 3. In Proceedings of the Text Retrieval Conference, 1994. [4] P. D. Ciya Liao, Shamim Alpha. Feature preparation in text categorization. In Oracle Corporation, 2002. [5] R. Collobert and S. Bengio. SVMTorch: Support vector machines for large-scale regression problems. Journal of Machine Learning Research, 1:143–160, 2001. [6] P. Dai, U. Iurgel, and G. Rigoll. A novel feature combination approach for spoken document classification with support vector machines, 2003. [7] F. Debole and F. Sebastiani. Supervised term weighting for automated text categorization. In SAC ’03: Proceedings of the 2003 ACM symposium on Applied computing, pages 784–788, New York, NY, USA, 2003. ACM Press. [8] B. Dom, I. Eiron, A. Cozzi, and Y. Shang. Graph-based ranking algorithms for e-mail expertise analysis. In Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, San Diego, California, 2003. [9] G. Erkan and D. Radev. Lexrank: Graph-based centrality as salience in text summarization. Journal of Artificial Intelligence Research, December 2004. [10] G. Grimmett and D. Stirzaker. Probability and Random Processes. Oxford University Press, 1989. [11] T. Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In Proceedings of the 14th International Conference on Machine Learning, Nashville, US, 1997. [12] A. Klautau. Speech recognition based on discriminative classifiers. In Proceedings of the Simposio Brasileiro de Telecomunicacion-SBT, Rio de Janeiro, Brazil, 2003. [13] M. Lan, C. Tan, H. Low, and S. Sungy. A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. In Proceedings of the 14th international conference on World Wide Web, pages 1032–1033, 2005. [14] E. Leopold and J. Kindermann. Text categorization with support vector machines. how to represent texts in input space? In Machine Learning, volume 46, pages 423–444, Hingham, MA, USA, 2002. Kluwer Academic Publishers. [15] R. Mihalcea and P. Tarau. TextRank – bringing order into texts. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), Barcelona, Spain, 2004. [16] A. Moschitti. A study on optimal paramter tuning for Rocchio text classifier. In Proceedings of the European Conference on Information Retrieval, Pisa, Italy, 2003. [17] K. Papineni. Why inverse document frequency? In NAACL ’01: Second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies 2001, pages 1–8, Morristown, NJ, USA, 2001. Association for Computational Linguistics. [18] J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Research and Development in Information Retrieval, pages 275–281, 1998. [19] M. Radovanovic and M. Ivanovic. Document representations for classification of short web-page descriptions. In DaWaK, pages 544–553, 2006. [20] R. Robertson and K. Sparck-Jones. Simple, proven approaches to text retrieval. Technical report, 1997. [21] S. Robertson. Understanding inverse document frequency: on theoretical arguments for idf. Journal of Documentation, 5:503–520, 2004. [22] M. Sahami. Learning limited dependence bayesian classifiers. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, pages 335–338, 1996. [23] K. Schneider. A new feature selection score for multinomial naive bayes text classification based on kl-divergence. In The Companion Volume to the Proceedings of 42st Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, July 2004. [24] H. Schutze, D. A. Hull, and J. O. Pedersen. A comparison of classifiers and document representations for the routing problem. In Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, Seattle, Washington, 1995. [25] K. Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1):11–21, 1972. [26] S. Tan, X. Cheng, M. M. Ghanem, B. Wang, and H. Xu. A novel refinement approach for text categorization. In CIKM ’05: Proceedings of the 14th ACM international conference on Information and knowledge management, pages 469–476, Bremen, Germany, 2005. [27] V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995. [28] Y. Yang and X. Liu. A reexamination of text categorization methods. In Proceedings of the 22nd ACM SIGIR Conference on Research and Development in Information Retrieval, 1999. [29] Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning, Nashville, US, 1997. 8

You May Also Find These Documents Helpful

  • Good Essays

    Nt1310 Unit 3 Study Essay

    • 3921 Words
    • 16 Pages

    |Term-Document Matrix |A frequency matrix created from digitized and organized documents (the corpus) where the columns…

    • 3921 Words
    • 16 Pages
    Good Essays
  • Satisfactory Essays

    Pt1420 Unit 1 Assignment

    • 303 Words
    • 2 Pages

    IBM Multimedia Analysis and Retrieval System [8]. The service enabled users to train new classifiers in December 2015.…

    • 303 Words
    • 2 Pages
    Satisfactory Essays
  • Good Essays

    * RI.3.5. Use text features and search tools (e.g., key words, sidebars, hyperlinks) to locate information relevant to a given topic efficiently.…

    • 4807 Words
    • 20 Pages
    Good Essays
  • Satisfactory Essays

    But, in this research work, we have built a sentiment analysis and trained it using natural language processing that has resulted in a very high trustee rate. This sentiment rate which is obtained is slightly to be higher than other algorithms that were previously proposed.…

    • 596 Words
    • 3 Pages
    Satisfactory Essays
  • Powerful Essays

    Automatic Sentence Generator

    • 3412 Words
    • 14 Pages

    automatically. To do so, models are first created based on contexts of interest. These models incorporate word histories that are detected in a context dependent training set of sentences. Not only will we be able to automatically generate sentences associated with the theme being modeled, but we will also be able to help recognize phrases and sentences. In other words, this is a module which could be part of an automatic speech recognition system, so that proposed recognized word sequences can be validated according to acceptable contexts. The system is adaptive and incremental, since models can be modified with additional training sentences, which would expand a previously established capacity.…

    • 3412 Words
    • 14 Pages
    Powerful Essays
  • Satisfactory Essays

    Belonging

    • 281 Words
    • 2 Pages

    * Take into account context, purpose and register, text structures, stylistic features, grammatical features and vocabulary.…

    • 281 Words
    • 2 Pages
    Satisfactory Essays
  • Satisfactory Essays

    Text analysis is difficult because you have to quantify text as well as deal with other challenges such as dimensionality and data dispersal. The real challenge for HP was combining structured data with unstructured data. HP was able to utilize SAS Text Miner’s technique called singular value decomposition.…

    • 410 Words
    • 2 Pages
    Satisfactory Essays
  • Powerful Essays

    Paper Sharock 0

    • 5577 Words
    • 26 Pages

    Chang, K., Chen, I., & Sung, Y. (2002). The effect of concept mapping to enhance text…

    • 5577 Words
    • 26 Pages
    Powerful Essays
  • Best Essays

    It Essay - Data Mining

    • 1998 Words
    • 8 Pages

    He, J. (2009). Advances in Data Mining: History and Future. Third International Symposium on Intelligent . Retrieved November 1, 2012, from http://ieeexplore.ieee.org.ezproxy.lib.ryerson.ca/stamp/stamp.jsp?tp=&arnumber=5370232&tag=1…

    • 1998 Words
    • 8 Pages
    Best Essays
  • Powerful Essays

    The Apostolate

    • 8252 Words
    • 34 Pages

    Ranking techniques are used to evaluate natural-language queries on text databases. Text databases are an important component of digital libraries. Effective ranking can be costly in memory and time: the database may contain millions of documents and queries can contain large numbers of terms. These information retrieval systems must access large volumes of text, often divided into several collections that may be held on separate machines. In many environments, such as current desktop computers, standard CPU speeds and volumes of mem- ory are more than adequate to rapidly resolve queries, even on databases of many gigabytes of text. Techniques for locating answers to queries must therefore consider identification of probable collections as well as identification of documents that are probable answers, to avoid the situation in which all queries must be answered in full by all servers. In other environ- ments, however, both memory and time are limited: examples include Internet search engines, corporate data servers, online product databases, and, at the other extreme, handheld com- puters with PCIMIA-slot disk drives. In this paper we show that use of centralised blocked indexes, expressly designed for a multi-collection environment, can meet these objectives and simultaneously reduce overall query processing costs.…

    • 8252 Words
    • 34 Pages
    Powerful Essays
  • Best Essays

    Access to Health Care

    • 2651 Words
    • 11 Pages

    Uzma R., Mitchell T., Day, T., and Hardin, M. (2008). Text mining in healthcare applications…

    • 2651 Words
    • 11 Pages
    Best Essays
  • Satisfactory Essays

    Semantic Studies

    • 669 Words
    • 3 Pages

    According to “The Introduction Of Social Studies Vocabulary By Semantics Features Analysis: Using a Microcomputer Database Program” by Michael P. French and Nancy Cook (University of Wisconsin), they conducted the studies on the results of using microcomputer program adapted with semantics features theory. This program was created to study if semantics features help the students learn various words, basing on the theory by Johnson and Pearson (1984), Semantic feature analysis is a strategy that draws upon a student's prior knowledge about words and places the emphasis on the relationship of concepts within categories. In this method, the student explores the ways in which the meanings of words differ. These relationships (sameness or difference) is shown by placing (+) and (-) signs in a table referred to as a semantic feature grid. The students could effectively learn new vocabularies and categorized them correctly. Therefore, it could be concluded that semantics features was effective strategy for learning various kinds of words.…

    • 669 Words
    • 3 Pages
    Satisfactory Essays
  • Satisfactory Essays

    Recount Assessment Task

    • 611 Words
    • 5 Pages

    Write either a personal or an imaginary recount. You can choose your own topic or use one of the following:…

    • 611 Words
    • 5 Pages
    Satisfactory Essays
  • Good Essays

    Cyril W. Cleverdon’s 1991 article, “The Significance of the Cranfield Tests on Index Languages,” outlines the series of experimental tests into information retrieval the author took part in through the 1950s and 1960s while working as a librarian at the Cranfield College of Aeronautics in England. In the post-war midst of lifted security restrictions on many scientific and technical reports, the need for an accurate and efficient method of indexing was acute, and many new techniques were idealized in the ‘50s. Noting the lack of firm data supporting or opposing these methods, Celverdon set out to systematically compare them and their parts, in laboratory-based tests that became known as Cranfield 1 and Cranfield 2.…

    • 748 Words
    • 3 Pages
    Good Essays
  • Powerful Essays

    Bullet Screen Case Study

    • 1149 Words
    • 5 Pages

    While the use of natural language processing on “Bullet Screen” is beneficial to store and organize natural language, as authors provide users with search function to help users find the wonderful part of the video. According to Alexa, in January 2017, in all integrated authors site all over the world, Bilibili (http://search.bilibili.com/) ranked the 224th, while the number of its visitors ranked the 286th. It is the most active Chinese large-scale “Bullet Screen” video site. In this paper, research object is presented and related works about videos with “Bullet Screen” are detailed in 3. Research problems of the “Bullet Screen” retrieval system are discussed. Basic algorithms and the proposed ISB methods are demonstrated. Finally, the conclusions are provided. In fact, different types of “Bullet Screen” are applied in situations and have various characters. Subtitle is the result of the subtitle group (a small number of users) editing the video data. Most users use subtitle information rather than participate in the creation itself, which is similar to authorsb1.0. “Bullet Screen” is similar to Authorsb2.0, in which the users can be more interactive. In this process, the users can participate in the creation of the “Bullet Screen”. To some degree, texts “Bullet Screen” in live site have an impact on the broadcast itself. The anchor can communicate with the users through “Bullet Screen”. In this way, different “Bullet Screen” applications actually conform to the development of the Internet. Their development does not have a reciprocal relationship to each other. Whether it is subtitles, barrage or live barrage, it is not gone but applied in different situations. During the video viewing, the density of the “Bullet Screen” commentary is significantly correlated with the importance of the video…

    • 1149 Words
    • 5 Pages
    Powerful Essays