Recommended Systems using Collaborative Filtering and Classification Algorithms in Data Mining Dhwani Shah 2008A7PS097G
Mentor – Mrs. Shubhangi Gawali
1 BITS – Pilani, K.K Birla Goa
INDEX S. No. 1. 2. 3. 4. 5. 6. 7. 8. 9. Topic Introduction to Recommended Systems Problem Statement Apriori Algorithm Pseudo Code Apriori algorithm Example Classification Classification Techniques k-NN algorithm Determine a good value of k References Page No. 3 5 5 7 14 16 19 24 26
1. Introduction to Recommended Systems
Recommended Systems form a specific type of information filtering system technique that attempts to recommend information items (movies, TV program/show/episode, video on demand, music books, news, images, web pages, scientific literature such as research papers etc.) that are likely to be of interest to the user.. Recommendations can be based on demographics of the users, overall top selling items, or past buying habit of users as a predictor of future items.
Collaborative Filtering (CF)
It is the most successful recommendation technique to date. The basic idea of CF-based algorithms is to provide item recommendations or predictions based on the opinions of other like-minded users. The opinions of users can be obtained explicitly from the users or by using some implicit measures. Collaborative filtering techniques collect and establish profiles, and determine the relationships among the data according to similarity models. The possible categories of the data in the profiles include user preferences, user behavior patterns, or item properties Everyday Examples of Collaborative Filtering... • • • • Bestseller lists Top 40 music lists The “recent returns” shelf at the library Many weblogs
Challenges of collaborative filtering. • The lack of the information would affect the recommendation results. For the relationship mining, new items not-yet-rated or not-yet-labeled can be abandoned in the recommendation processes. • Collaborative filtering does not cover the extreme case. If the scales of the user profiles are small or the users have unique tastes, similarity decisions are unable to be established. • If any new information of users has to be included in the recommendation processes in real time, data latency will increase the waiting time for the query result. The complexity of the computation for the recommendation affects the waiting time of the user directly. • Synchronization is another issue of the profile updates in the system. When hundreds of users query the system within a very short time period.
Explicit vs. Implicit Data Collection In order to make any recommendations, the system has to collect data. The ultimate goal of collection the data is to get an idea of user preferences, which can later be used to make predictions on future user preferences. There are two ways to gather the data. The first method is to ask for explicit ratings from a user, typically on a concrete rating scale (such as rating a movie from one to five stars). The second is to gather data implicitly as the user is in the domain of the system - that is, to log the actions of a user on the site. Explicit data gathering is easy to work with. Assumedly, the ratings that a user provides can be directly interpreted as the user's preferences, making it easier to make extrapolations from data to predict future ratings. However, the drawback with explicit data is that it puts the responsibility of data collection on the user, who may not want to take time to enter ratings. On the other hand, implicit data is easy to collect in large quantities without any extra effort on the part of the user. Unfortunately, it is much more difficult to work with since the goal is to convert user behavior into user preferences. Of course, these two methods of gathering data are not mutually exclusive. A combination of the two have the possibility for the best overall results - one could gain the advantages of explicit...
References: Agrawal R, Imielinski T, Swami AN. "Mining Association Rules between Sets of Items in Large Databases."SIGM OD. June 1993 Agrawal R, Srikant R. "Fast Algorithms for Mining Association Rules" 1994, Chile, ISBN 1-55860-153-8. Implementation of Web Usage Mining Using APRIORI and FP Growth Algorithms, B.Santhosh Kumar Department of Computer Science, C.S.I. College of Engineering, K.V.Rukmani Department of Computer Science, C.S.I. College of Engineering. Mannila H, Toivonen H, Verkamo AI. "Efficient algorithms for discovering association rules."AAAI Workshop on Knowledge Discovery in Databases (SIGKDD). July 1994, Seattle. Fabrizio Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, Tom Mitchell, Machine Learning. McGraw-Hill, 1997. Yiming Yang & Xin Liu, A re-examination of text categorization methods. Proceedings of SIGIR, 1999. Evaluating and Optimizing Autonomous Text Classification Systems (1995) David Lewis. Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Han, Jiawei and Kamber, Micheline. Data Mining: Concepts and Techniques. Lifshits, Yury. Algorithms for Nearest Neighbor. Steklov Insitute of Mathematics at St. Petersburg. April 2007 Cherni, Sofiya. Nearest Neighbor Method. South Dakota School of Mines and Technology.
I would like to thank Mrs. Shubhangi Gawali for being an excellent mentor and a patient guide throughout this whole learning process
Please join StudyMode to read the full document