Data Preprocessing

Data Preprocessing

3

Today’s real-world databases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size (often several gigabytes or more) and their likely origin from multiple, heterogenous sources. Low-quality data will lead to low-quality mining results. “How can the data be preprocessed in order to help improve the quality of the data and, consequently, of the mining results? How can the data be preprocessed so as to improve the efﬁciency and ease of the mining process?” There are several data preprocessing techniques. Data cleaning can be applied to remove noise and correct inconsistencies in data. Data integration merges data from multiple sources into a coherent data store such as a data warehouse. Data reduction can reduce data size by, for instance, aggregating, eliminating redundant features, or clustering. Data transformations (e.g., normalization) may be applied, where data are scaled to fall within a smaller range like 0.0 to 1.0. This can improve the accuracy and efﬁciency of mining algorithms involving distance measurements. These techniques are not mutually exclusive; they may work together. For example, data cleaning can involve transformations to correct wrong data, such as by transforming all entries for a date ﬁeld to a common format. In Chapter 2, we learned about the different attribute types and how to use basic statistical descriptions to study data characteristics. These can help identify erroneous values and outliers, which will be useful in the data cleaning and integration steps. Data processing techniques, when applied before mining, can substantially improve the overall quality of the patterns mined and/or the time required for the actual mining. In this chapter, we introduce the basic concepts of data preprocessing in Section 3.1. The methods for data preprocessing are organized into the following categories: data cleaning (Section 3.2), data integration (Section 3.3), data reduction

Data Preprocessing

You May Also Find These Documents Helpful

Team A Service Request WK4 V2

Team A Service Request WK4 V2

Nt1310 Unit 1 Data Analysis

Nt1310 Unit 1 Data Analysis

Cis 500 Data Mining Report

Cis 500 Data Mining Report

Crisp-Dm

Crisp-Dm

Data Mining - Chapter 2 questions

Data Mining - Chapter 2 questions

Data Mining in Homeland Security

Data Mining in Homeland Security

Design of Product Placement Layout in Retail Shop Using Market Basket Analysis

Design of Product Placement Layout in Retail Shop Using Market Basket Analysis

Data Mining Problems

Data Mining Problems

Data Warehousing and Data Mining

Data Warehousing and Data Mining

Transforming Data Into Information

Transforming Data Into Information

Transforming Data Into Information

Transforming Data Into Information

Data Mining in the Pharmaceutical Industry

Data Mining in the Pharmaceutical Industry

Rapidminer

Rapidminer

Data Processing

Data Processing

Data Processing Made Easy

Data Processing Made Easy

Related Topics