Is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner
Process of analyzing data from different perspectives and summarizing it into useful information
A class of database applications that look for hidden patterns in a group of data that can be used to predict future behavior. DM Defined
The relationships and summaries derived are referred to as models or patterns. Examples include linear equations, rules, clusters, graphs, tree structures and recurrent patterns in time series.
Utilizes observational data as opposed to experimental data. Data that have already been colleted for some purpose other than data mining analysis.
The relationship and structures sort, should be novel. Its of little point regurgitating unless the ‘confirmatory hypothesis’ is used. “Concepts”
•Definition: A “concept” is a set of objects, symbols or events grouped together because they share certain characteristics. Concept set, class, group, cluster, roughly
• Classical View: Concept Set with well defined deterministic inclusion rules. E.g. A home owner is a good credit risk.
• Probabilistic View: A set with probabilistic inclusion rules. E.g. A home owner has an 80% chance of being a good credit risk.
• Exemplar View: this states that a given instance is determined to be an example of a particulalr concept if the instance is “similar enough” to a set of “one or more known examples” of the concept. Eg. Mr. Smith owns his own home and is a good credit risk. Example: An Investment Dataset
Possible Business Questions
In last two questions, we distinguish ONE of the attributes that we would like to be able to determine from the values of the others. •What characteristics distinguish between Online and Broker investors? (DISCRIMINATION). (Transaction method (categorical)) is the target variable . • Can I develop a model which will predict the average trades/month for a new investor? (PREDICTION). (Trades/month (real)) is the target variable. The Target variable is called the “Output variable”.
The other variables are called “Input variables”.
Clearly, which attributes are the output and input variables depends on your question. For these questions, and output variables, we KNOW the values of the output variables for the cases in the dataset. In such cases we say that we do “SUPERVISED” learning since the learning is controlled by the known values of the output variable in the dataset.
For the question:
“Can I develop a general characterisation/profile of different investor types? (CLASSIFICATION)”, NO particular attribute is singled out as an OUTPUT variable.
•The question is open-ended.
•We do not know if there are any different investor types at all. •If there are different investor types, we do not know how many types there are. •If there are different investor types then we do not know what the various investor type (or classes, or concepts) mean. We have to determine the meaning of the concepts, and appropriate names, after we have determined that they exist. •The method of induction based learning used is said to be UNSUPERVISED in such a situation, because the there are no known output classes to control the learning process. Another Example Dataset
Two Concept Learning Paradigms
–builds a learner model, or concept definitions, using data instances of known origin. – and uses the model to determine the outcome new instances of unknown origin.
– A data mining method that builds models from data without predefined classes. –Usually for classification/clustering.
supervised and unsupervised
Elements in Data Mining
•Extract, transform, and load transaction data onto the data warehouse...