IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
Truth Discovery with Multiple Conflicting Information Providers on the Web Xiaoxin Yin, Jiawei Han, Senior Member, IEEE, and Philip S. Yu, Fellow, IEEE Abstract—The World Wide Web has become the most important information source for most of us. Unfortunately, there is no guarantee for the correctness of information on the Web. Moreover, different websites often provide conflicting information on a subject, such as different specifications for the same product. In this paper, we propose a new problem, called Veracity, i.e., conformity to truth, which studies how to find true facts from a large amount of conflicting information on many subjects that is provided by various websites. We design a general framework for the Veracity problem and invent an algorithm, called TRUTHFINDER, which utilizes the relationships between websites and their information, i.e., a website is trustworthy if it provides many pieces of true information, and a piece of information is likely to be true if it is provided by many trustworthy websites. An iterative method is used to infer the trustworthiness of websites and the correctness of information from each other. Our experiments show that TRUTHFINDER successfully finds true facts among conflicting information and identifies trustworthy websites better than the popular search engines. Index Terms—Data quality, Web mining, link analysis.
Example 2 (Authors of books). We tried to find out who wrote the book Rapid Contextual Design (ISBN: 0123540518). We found many different sets of authors from different online bookstores, and we show several of them in Table 1. From the image of the book cover, we found that A1 Books provides the most accurate information. In comparison, the information from Powell’s books is incomplete, and that from Lakeside books is incorrect. The trustworthiness problem of the Web has been realized by today’s Internet users. According to a survey on the credibility of websites conducted by Princeton Survey Research in 2005 , 54 percent of Internet users trust news websites at least most of time, while this ratio is only 26 percent for websites that offer products for sale and is merely 12 percent for blogs. There have been many studies on ranking web pages according to authority (or popularity) based on hyperlinks. The most influential studies are Authority-Hub analysis , and PageRank , which lead to Google.com. However, does authority lead to accuracy of information? The answer is unfortunately no. Top-ranked websites are usually the most popular ones. However, popularity does not mean accuracy. For example, according to our experiments (Section 4.2), the bookstores ranked on top by Google (Barnes & Noble and Powell’s books) contain many errors on book author information. In comparison, some small bookstores (e.g., A1 Books) provide more accurate information. In this paper, we propose a new problem called the Veracity problem, which is formulated as follows: Given a large amount of conflicting information about many objects, which is provided by multiple websites (or other types of information providers), how can we discover the true fact about each object? We use the word “fact” to represent something that is claimed as a fact by some website, and such a fact can be either true or false. In this paper, we only study the facts that are either properties of Published by the IEEE Computer Society
HE World Wide Web has become a necessary part of our lives and might have become the most important information source for most people. Everyday, people retrieve all kinds of information from the Web. For example, when shopping online, people find product specifications from websites like Amazon.com or ShopZilla.com. When looking for interesting DVDs, they get information and read movie reviews on websites such as NetFlix.com or IMDB. com. When they want...