Ghanshyam Verma, Shruthi Varadhan
Computer Technology Department
Abstract— Data mining is the process that results in the
discovery of new patterns in large data sets. Data mining involves six common classes of tasks: Anomaly detection, Association
Rule Mining, clustering, classification, regression and
summarization. Association rule learning is a popular and well researched method for discovering interesting relations between variables in large databases. Association rules are employed today in many application areas including Web usage mining,
intrusion detection and bioinformatics. Apriori is a classic algorithm for learning association rules. Apriori is designed to operate on databases containing transactions. As is common in association rule mining, given a set of itemsets, the algorithm attempts to find subsets which are common to at least a
minimum number C of the itemsets. Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a
time (a step known as candidate generation), and groups of
candidates are tested against the data. The algorithm terminates when no further successful extensions are found. The quest to mine frequent patterns appears in many domains. The
prototypical application is market basket analysis, i.e., to mine the sets of items that are frequent bought together . What makes Apriori so popular is that uses the downward closure property of pattern support (all subsets of a frequent pattern must
themselves be frequent) to prune the search space. Thus only frequent patterns of size k are used to generate patterns of size k+1. Many parallel data mining algorithms inherits this property from Apriori, which is why most parallel data mining algorithms are said to be a variation of Apriori.
Keywords— Data Mining, anomaly detection, Association Rule Mining, clustering, classification, regression, summarization, Apriori, downward-closure property, market basket analysis,
item-sets, enumeration, token, support, confidence, frequent item-sets, minimum support, array list, hash list, rule generation, antecedent, consequent, implementation, parallel systems,
Data mining, a relatively young and interdisciplinary field
of computer science, is the pr ocess that results in the
discovery of new patterns in large data sets. It utilizes
methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract knowledge from an
existing data set and transform it into a human -understandable structure for further use. Besides the raw analysis step, it involves database and data management aspects, data preprocessing, inference considerations, interestingness metrics, complexity considerations, post-processing of found structures, visualization, and online updating .
Data mining involves six common classes of tasks:
detection) – The identification of unusual data records, that might be interesting or data errors and require further
Association rule learning (Dependency modelling) –
Searches for relationships between variables. For example a
supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can
determine which products are frequently bought together and
use this information for marketing purposes. This is
sometimes referred to as market basket analysis.
Clustering – is the task of discovering gr oups and
structures in the data that are in some way or another "similar", without using known structures in the data.
Classification – is the task of generalizing known
structure to apply to new data. For example, an e-mail
program might attempt to classify an e-mail as "legitimate"...