1. What is data mining? In your answer, address the following: Data mining refers to the process or method that extracts or \mines" interesting knowledge or patterns from large amounts of data.
(a) Is it another hype?
Data mining is not another hype. Instead, the need for data mining has arisen due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge. Thus, data mining can be viewed as the result of the natural evolution of information technology. (b) Is it a simple transformation or application of technology developed from databases, statistics, machine learning, and pattern recognition? No. Data mining is more than a simple transformation of technology developed from databases, statistics, and machine learning. Instead, data mining involves an integration, rather than a simple transformation, of techniques from multiple disciplines such as database technology, statistics, machine learning, high-performance computing, pattern recognition, neural networks, data visualization, information retrieval, image and signal processing, and spatial data analysis.. (c) We have presented a view that data mining is the result of the evolution of database technology. Do you think that data mining is also the result of the evolution of machine learning research? Can you present such views based on the historical progress of this discipline? Do the same for the fields of statistics and pattern recognition.
(d) Describe the steps involved in data mining when viewed as a process of knowledge discovery The steps involved in data mining when viewed as a process of knowledge discovery are as follows: * Data cleaning, a process that removes or transforms noise and inconsistent data * Data integration, where multiple data sources may be combined * Data selection, where data relevant to the analysis task are retrieved from the database * Data transformation, where data are transformed or consolidated into forms appropriate for mining * Data mining, an essential process where intelligent and efficient methods are applied in order to extract patterns * Pattern evaluation, a process that identifies the truly interesting patterns representing knowledge based on some interestingness measures * Knowledge presentation, where visualization and knowledge representation techniques are used to present the mined knowledge to the user
2. How is a data warehouse different from a database? How are they similar? * Differences between a data warehouse and a database: A data warehouse is a repository of information collected from multiple sources, over a history of time, stored under a unified schema, and used for data analysis and decision support; whereas a database, is a collection of interrelated data that represents the current status of the stored data. There could be multiple heterogeneous databases where the schema of one database may not agree with the schema of another. A database system supports ad-hoc query and on-line transaction processing. * Similarities between a data warehouse and a database: Both are repositories of information, storing huge amounts of persistent data. 3. Define each of the following data mining functionalities: characterization, discrimination, association and correlation analysis, classification, regression, clustering, and outlier analysis. Give examples of each data mining functionality, using a real-life database that you are familiar with. * Characterization is a summarization of the general characteristics or features of a target class of data. For example, the characteristics of students can be produced, generating a profile of all the University first year computing science students, which may include such information as a high GPA and large number of courses taken. * Discrimination is a comparison of the general features of target class data objects with the general features...