Data Mining Fundamentals

Only available on StudyMode
  • Download(s) : 212
  • Published : January 16, 2012
Open Document
Text Preview
Data Mining

DM Defined
Is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner

Process of analyzing data from different perspectives and summarizing it into useful information

A class of database applications that look for hidden patterns in a group of data that can be used to predict future behavior. DM Defined
The relationships and summaries derived are referred to as models or patterns. Examples include linear equations, rules, clusters, graphs, tree structures and recurrent patterns in time series.

Utilizes observational data as opposed to experimental data. Data that have already been colleted for some purpose other than data mining analysis.

The relationship and structures sort, should be novel. Its of little point regurgitating unless the ‘confirmatory hypothesis’ is used. “Concepts”
Definition: A “concept” is a set of objects, symbols or events grouped together because they share certain characteristics. Concept set, class, group, cluster, roughly

Classical View: Concept Set with well defined deterministic inclusion rules. E.g. A home owner is a good credit risk.

Probabilistic View: A set with probabilistic inclusion rules. E.g. A home owner has an 80% chance of being a good credit risk.

Exemplar View: this states that a given instance is determined to be an example of a particulalr concept if the instance is “similar enough” to a set of “one or more known examples” of the concept. Eg. Mr. Smith owns his own home and is a good credit risk. Example: An Investment Dataset

Possible Business Questions
“Supervised” Leaning
In last two questions, we distinguish ONE of the attributes that we would like to be able to determine from the values of the others. •What characteristics distinguish between Online and Broker investors? (DISCRIMINATION). (Transaction method (categorical)) is the target variable . • Can I develop a model which will predict the average trades/month for a new investor? (PREDICTION). (Trades/month (real)) is the target variable. The Target variable is called the “Output variable”.

The other variables are called “Input variables”.
Clearly, which attributes are the output and input variables depends on your question. For these questions, and output variables, we KNOW the values of the output variables for the cases in the dataset. In such cases we say that we do “SUPERVISED” learning since the learning is controlled by the known values of the output variable in the dataset.

“Unsupervised” Learning
For the question:
“Can I develop a general characterisation/profile of different investor types? (CLASSIFICATION)”, NO particular attribute is singled out as an OUTPUT variable.

The question is open-ended.
We do not know if there are any different investor types at all. •If there are different investor types, we do not know how many types there are. •If there are different investor types then we do not know what the various investor type (or classes, or concepts) mean. We have to determine the meaning of the concepts, and appropriate names, after we have determined that they exist. •The method of induction based learning used is said to be UNSUPERVISED in such a situation, because the there are no known output classes to control the learning process. Another Example Dataset

Two Concept Learning Paradigms
Supervised Learning
builds a learner model, or concept definitions, using data instances of known origin. – and uses the model to determine the outcome new instances of unknown origin.

Unsupervised Learning
A data mining method that builds models from data without predefined classes. –Usually for classification/clustering.

Return to
supervised and unsupervised
learning later
Elements in Data Mining
Extract, transform, and load transaction data onto the data warehouse...
tracking img