Google News Personalization: Scalable Online Collaborative Filtering

Only available on StudyMode
  • Topic: Recommender system, Cluster analysis, Google
  • Pages : 37 (10455 words )
  • Download(s) : 433
  • Published : November 8, 2010
Open Document
Text Preview
WWW 2007 / Track: Industrial Practice and Experience

May 8-12, 2007. Banff, Alberta, Canada

Google News Personalization: Scalable Online Collaborative Filtering Abhinandan Das
Google Inc. 1600 Amphitheatre Pkwy, Mountain View, CA 94043

Mayur Datar
Google Inc. 1600 Amphitheatre Pkwy, Mountain View, CA 94043

Ashutosh Garg
Google Inc. 1600 Amphitheatre Pkwy, Mountain View, CA 94043

abhinandan@google.com

mayur@google.com Shyam Rajaram
University of Illinois at Urbana Champaign Urbana, IL 61801

ashutosh@google.com

rajaram1@ifp.uiuc.edu ABSTRACT
Several approaches to collaborative filtering have been studied but seldom have studies been reported for large (several million users and items) and dynamic (the underlying item set is continually changing) settings. In this paper we describe our approach to collaborative filtering for generating personalized recommendations for users of Google News. We generate recommendations using three approaches: collaborative filtering using MinHash clustering, Probabilistic Latent Semantic Indexing (PLSI), and covisitation counts. We combine recommendations from different algorithms using a linear model. Our approach is content agnostic and consequently domain independent, making it easily adaptable for other applications and languages with minimal effort. This paper will describe our algorithms and system setup in detail, and report results of running the recommendations engine on Google News. Categories and Subject Descriptors: H.4.m [Information Systems]: Miscellaneous General Terms: Algorithms, Design Keywords: Scalable collaborative filtering, online recommendation system, MinHash, PLSI, Mapreduce, Google News, personalization me something interesting. In such cases, we would like to present recommendations to a user based on her interests as demonstrated by her past activity on the relevant site. Collaborative filtering is a technology that aims to learn user preferences and make recommendations based on user and community data. It is a complementary technology to content-based filtering (e.g. keyword-based searching). Probably the most well known use of collaborative filtering has been by Amazon.com where a user’s past shopping history is used to make recommendations for new products. Various approaches to collaborative filtering have been proposed in the past in research community (See section 3 for details). Our aim was to build a scalable online recommendation engine that could be used for making personalized recommendations on a large web property like Google News. Quality of recommendations notwithstanding, the following requirements set us apart from most (if not all) of the known recommender systems: Scalability: Google News (http://news.google.com), is visited by several million unique visitors over a period of few days. The number of items, news stories as identified by the cluster of news articles, is also of the order of several million. Item Churn: Most systems assume that the underlying item-set is either static or the amount of churn is minimal which in turn is handled by either approximately updating the models ([14]) or by rebuilding the models ever so often to incorporate any new items. Rebuilding, typically being an expensive task, is not done too frequently (every few hours). However, for a property like Google News, the underlying item-set undergoes churn (insertions and deletions) every few minutes and at any given time the stories of interest are the ones that appeared in last couple of hours. Therefore any model older than a few hours may no longer be of interest and partial updates will not work. For the above reasons, we found the existing recommender systems unsuitable for our needs and embarked on a new approach with novel scalable algorithms. We believe that Amazon also does recommendations at a similar scale. However, it is the second point (item churn) that distinguishes us significantly from their system. This paper describes our approach and...
tracking img