A Parameterized Approach to Spam-Resilient Link Analysis of the Web

Only available on StudyMode
  • Pages : 56 (13573 words )
  • Download(s) : 88
  • Published : April 11, 2013
Open Document
Text Preview
1422

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,

VOL. 20,

NO. 10,

OCTOBER 2009

A Parameterized Approach to Spam-Resilient
Link Analysis of the Web
James Caverlee, Member, IEEE, Steve Webb, Member, IEEE,
Ling Liu, Senior Member, IEEE, and William B. Rouse, Fellow, IEEE Abstract—Link-based analysis of the Web provides the basis for many important applications—like Web search, Web-based data mining, and Web page categorization—that bring order to the massive amount of distributed Web content. Due to the overwhelming reliance on these important applications, there is a rise in efforts to manipulate (or spam) the link structure of the Web. In this manuscript, we present a parameterized framework for link analysis of the Web that promotes spam resilience through a source-centric view of the Web. We provide a rigorous study of the set of critical parameters that can impact source-centric link analysis and propose the novel notion of influence throttling for countering the influence of link-based manipulation. Through formal analysis and a large-scale experimental study, we show how different parameter settings may impact the time complexity, stability, and spam resilience of Web link analysis. Concretely, we find that the source-centric model supports more effective and robust rankings in comparison with existing Web algorithms such as PageRank.

Index Terms—Internet search, information search and retrieval, information storage and retrieval, information technology and systems, distributed systems, systems and software, Web search, general, Web-based services, online information services.

Ç
1

INTRODUCTION

T

HE Web is arguably the most massive and successful
distributed computing application today. Millions of
Web servers support the autonomous sharing of billions of
Web pages. From its earliest days, the Web has been the
subject of intense focus for organizing, sorting, and understanding its massive amount of data. One of the most popular and effective Web analysis approaches is linkbased analysis for considering the number and nature of hyperlink relationships among Web pages. Link analysis

powers many critical Web applications, including Web
crawling, Web search and ranking, Web-based data mining,
and Web page categorization.
Since Web link analysis plays a central role in so many
critical Web applications, Web spammers spend a considerable effort on manipulating (or spamming) the link structure of the Web to undermine the link-based algorithms that drive these applications (like the PageRank algorithm for Web page ranking). This manipulation is a serious problem, and more

and more incidents of Web spam are observed, experienced,
and reported [1], [2], [3]. In this manuscript, we focus on three prominent types of link-based vulnerabilities we have

. J. Caverlee is with the Department of Computer Science, Texas A&M University, 3112 TAMU, College Station, TX 77843-3112.
E-mail: caverlee@cs.tamu.edu.
. S. Webb is with the Purewire, 14 Piedmont Center NE, Suite 850, Atlanta, GA 30305. E-mail: swebb@purewire.com.
. L. Liu is with the College of Computing, Georgia Institute of Technology, 266 Ferst Dr., Atlanta, GA 30332-0765. E-mail: lingliu@cc.gatech.edu. . W.B. Rouse is with the Tennenbaum Institute, Georgia Institute of Technology, 760 Spring Street, NW, Atlanta, GA 30332-0210.

E-mail: bill.rouse@ti.gatech.edu.
Manuscript received 20 Sept. 2008; revised 21 Aug. 2008; accepted 26 Sept. 2008; published online 7 Oct. 2008.
Recommended for acceptance by Y. Pan.
For information on obtaining reprints of this article, please send e-mail to: tpds@computer.org, and reference IEEECS Log Number TPDS-2007-09-0327. Digital Object Identifier no. 10.1109/TPDS.2008.227.

1045-9219/09/$25.00 ß 2009 IEEE

identified in Web ranking systems. These vulnerabilities
corrupt link-based ranking algorithms like HITS [4] and
PageRank [5] by making it appear that a reputable page is...
tracking img