Data mining is the nontrivial process of identifying valid novel, potentially useful, and ultimately understandable patterns in data – Fayyad. The most commonly used techniques in data mining is artificial neural networks, decision trees, genetic algorithm, nearest_neighbour method, and rule induction. Data mining research has drawn on a number of other fields such as inductive learning, machine learning and statistics etc. Machine learning – is the automation of a learning process and learning is based on observations of environmental statistics and transitions. Machine learning examines previous examples and their outcomes and learns how to reproduce these make generalizations about new uses. Inductive learning – Induction means inference of information from data and Inductive learning is a model building process where the database is analyzed to find patterns. Main strategies are supervised learning and unsupervised learning. Statistics: used to detect unusual patterns and explain patterns using statistical models such as linear models. Data mining models can be a discovery model – it is the system automatically discovering important information hidden in the data or verification model – takes an hypothesis from the user and tests the validity of it against the data. The web contains collection of pages that includes countless hyperlinks and huge volumes of access and usage information. Because of the ever-increasing amount of information in cyberspace, knowledge discovery and web mining are becoming critical for successfully conducting business in the cyber world. Web mining is the discovery and analysis of useful information from the web. Web mining is the use of data mining techniques to automatically discover and extract information from web documents and services (content, structure, and usage). Two different approaches were taken in initially defining web mining: i. Process_centric View – Web mining as a sequnce of tasks ii. Data_centric view – web mining as a web data that was being used in the mining process. The important data mining techniques applied in the web domain include Association Rule, Sequential pattern discovery, clustering, path analysis, classification and outlier discovery. i. Association Rule Mining: Predict the association and correlation among set of items “where the presence of one set of items in a transaction implies (with a certain degree of confidence) the presence of other itms. That is, 1) discovers the correlations between pages that are most often referenced together in a single server session/user session. 2) provide the information: i. What are the set of pages frequently accessed together by web users? ii. What page will be fetched next? iii. What are paths frequently accessed by web users?. 3) Associations and correlations: i. Page association from usage data – user sessions, user transactions ii. Page associations from content data – similarity based on content analysis iii. page associations based on structure – link connectivity between pages. Advantages: a) Guide for web site restructuring – by adding links that interconnect pages often viewed together. B) Improve the system performance by prefetching web data. ii. Sequential pattern discovery: Applied to web access server transaction logs. The purpose is to discover sequential patterns that indicate user visit patterns over a certain period. That is, the order in which URLs tend to be accessed. Advantage: a) useful user trends can be discovered b) predictions concerning visit pattern can be made c) to improve website navigation d) personalize advertisements e) dynamically reorganize link structure and adopt web site contents to individual client requirements or to provide clients with automatic recommendations that best suit customer profiles.. iii. Clustering: Group together items (users, pages, etc.,) that have similar characteristics. a) Page clusters: groups of pages that seem to...
Please join StudyMode to read the full document