Preview

Data Preprocessing

Powerful Essays
Open Document
Open Document
17962 Words
Grammar
Grammar
Plagiarism
Plagiarism
Writing
Writing
Score
Score
Data Preprocessing
Data Preprocessing

3

Today’s real-world databases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size (often several gigabytes or more) and their likely origin from multiple, heterogenous sources. Low-quality data will lead to low-quality mining results. “How can the data be preprocessed in order to help improve the quality of the data and, consequently, of the mining results? How can the data be preprocessed so as to improve the efficiency and ease of the mining process?” There are several data preprocessing techniques. Data cleaning can be applied to remove noise and correct inconsistencies in data. Data integration merges data from multiple sources into a coherent data store such as a data warehouse. Data reduction can reduce data size by, for instance, aggregating, eliminating redundant features, or clustering. Data transformations (e.g., normalization) may be applied, where data are scaled to fall within a smaller range like 0.0 to 1.0. This can improve the accuracy and efficiency of mining algorithms involving distance measurements. These techniques are not mutually exclusive; they may work together. For example, data cleaning can involve transformations to correct wrong data, such as by transforming all entries for a date field to a common format. In Chapter 2, we learned about the different attribute types and how to use basic statistical descriptions to study data characteristics. These can help identify erroneous values and outliers, which will be useful in the data cleaning and integration steps. Data processing techniques, when applied before mining, can substantially improve the overall quality of the patterns mined and/or the time required for the actual mining. In this chapter, we introduce the basic concepts of data preprocessing in Section 3.1. The methods for data preprocessing are organized into the following categories: data cleaning (Section 3.2), data integration (Section 3.3), data reduction

You May Also Find These Documents Helpful

  • Good Essays

    Data normalization is very important in transactional, or the online transactional processing database world where many data modifications take place constantly and randomly throughout the stored data. In contrast to that, the data warehouse will contain a substantial amount of denormalized and summarized data that is…

    • 752 Words
    • 3 Pages
    Good Essays
  • Good Essays

    Audit and organize the data. Understanding your data before cleaning improves the efficiency of your project and reduces the time and cost of data cleaning. Understand the purpose, location, flow, and workflows of your data before you start.…

    • 522 Words
    • 3 Pages
    Good Essays
  • Powerful Essays

    Cis 500 Data Mining Report

    • 2046 Words
    • 9 Pages

    This report is an analysis of the benefits of data mining to business practices. It also assesses the reliability of data mining algorithms and with examples. “Data Mining is a process that uses statistical, mathematical, artificial intelligence, and machine learning techniques…

    • 2046 Words
    • 9 Pages
    Powerful Essays
  • Powerful Essays

    Crisp-Dm

    • 19391 Words
    • 78 Pages

    Foreword CRISP-DM was conceived in late 1996 by three “veterans” of the young and immature data mining market. DaimlerChrysler (then Daimler-Benz) was already ahead of most industrial and commercial organizations in applying data mining in its business operations. SPSS (then ISL) had been providing services based on data mining since 1990 and had launched the first commercial data mining workbench—Clementine®—in 1994. NCR, as part of its aim to deliver added value to its Teradata® data warehouse customers, had established teams of data mining consultants and technology specialists to service its clients’ requirements. At that time, early market interest in data mining was showing signs of exploding into widespread uptake. This was both exciting and terrifying. All of us had developed our approaches to data mining as we went along. Were we…

    • 19391 Words
    • 78 Pages
    Powerful Essays
  • Good Essays

    Assuming that data mining techniques are to be used in the following cases, identify whether the task required is supervised or unsupervised learning.…

    • 362 Words
    • 2 Pages
    Good Essays
  • Best Essays

    Data Mining is an analytical process that primarily involves searching through vast amounts of data to spot useful, but initially undiscovered, patterns. The data mining process typically involves three major steps—exploration, model building and validation and finally, deployment.…

    • 4628 Words
    • 19 Pages
    Best Essays
  • Powerful Essays

    References: [1] R. Agrawal, T. Imielinski, A. Swami, Proceeding of the ACM SIGMOD Conference on Management of Data, 1993. [2] R. Agrawal, R. Srikant. The International Conference on Very Large Databases, 1994. [3] I. H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, chapter 8, http://www.cs.waikato.ac.nz, 2000. [4] A. Roberts, Guide to WEKA, http://www.comp.leeds.ac.uk/andyr, 2005. [5] M. Levy, B. Weitz, Retailing Management, McGraw-Hill, New York, 2001.…

    • 2113 Words
    • 9 Pages
    Powerful Essays
  • Powerful Essays

    Data Mining Problems

    • 1295 Words
    • 6 Pages

    Suppose that we are responsible for managing product placement within a local supermarket. Our shelving units have 6 shelves each and are numbered from 1 to 6—with 1 being the lowest shelf and proceeding upward until the highest shelf is assigned the number 6. While there are many placement options that we should consider, we decide to look for any correlations between the row a product is placed on and its sales. Since we have our data stored in a data warehouse, it is easily accessible and responds quickly to our data request. Consider each of the following:…

    • 1295 Words
    • 6 Pages
    Powerful Essays
  • Better Essays

    Companies and organizations all over the world are blasting on the scene with data mining and data warehousing trying to keep an extreme competitive leg up on the competition. Always trying to improve the competiveness and the improvement of the business process is a key factor in expanding and strategically maintaining a higher standard for the most cost effective means in any business in today’s market. Every day these facilities store large amounts of data to improve increased revenue, reduction of cost, customer behavior patterns, and the predictions of possible future trends; say for seasonal reasons. Data mining is a process where these corporations extract large amounts of data to help them analyze it from multiple angles. Now for Data warehousing is a process designed for analysis and queries more than transaction processing centralizing the data from multiple sources not just online, but from transactions in stores and over the phone as well as other sources of procurement. This localizes the data placing it in to common models such as; names and definition. Data mining and data warehousing can be extremely helpful and strong tools there are some organizations that struggle with the information such as airlines, people are still in a frenzy to find some sort of pattern (Revels, M., & Nussbaumer, H. 2013).…

    • 1305 Words
    • 6 Pages
    Better Essays
  • Good Essays

    What is Data? What is information? Data is facts; numbers; statistics; readings from a device or machine. It depends on what the context is. Data is what is used to make up information. Information could be considered to be the same characteristics I just described as data. In the context of transforming data into information, you could assume data is needed to produce information. So information there for is the meaningful translation of a set of or clusters of data that’s produces an output of meaningful information. So data is a bunch of meaningless pieces of information that needs to be composed; analyzed; formed; and so forth to form a meaningful piece of information.…

    • 880 Words
    • 4 Pages
    Good Essays
  • Satisfactory Essays

    We will start by defining what Data is and what Information is and investigate what the differences are between the two. Data is defined as individual facts, statistics, or items of information. Information is defined as knowledge gained through study, communication, research, instruction, etc.; factual. As we can see Data is units of information, and information is a collection of facts. In order to logically process this information into presentable facts we must mathematically assess the data in to a reliable representation. We would use the linear formula y=mx+b, and use the data to represent the trend of the slope so it can give a reader a visual depiction of the information presented.…

    • 591 Words
    • 3 Pages
    Satisfactory Essays
  • Better Essays

    “Data mining is the practice of automatically searching large stores of data to discover patterns and trends that go beyond simple analysis. Data mining uses sophisticated mathematical algorithms to segment the data and evaluate the probability of future events. Data mining is also known as Knowledge Discovery in Data (KDD)” (Oracle, 2008). As stated, data mining is used to help find patterns and relationships stored within large sets of data, these patterns and relationships are then used to provide knowledge and value to the end user. The data can help prove and support earlier predictions usually based on statistics or aid in uncovering new information about products and customers. It is usually used by business intelligence organizations, and financial analysts, but is increasingly being used in the sciences to extract information from the enormous data sets generated by modern experimental and observational methods. Data…

    • 3024 Words
    • 13 Pages
    Better Essays
  • Satisfactory Essays

    Rapidminer

    • 493 Words
    • 2 Pages

    1. Dataset For this tutorial, we will work on some unlabeled data from the US Census Bureau. The following introduction to this dataset is for you to learn about its attributes and interpret results: Attributes of the raw data is discretized to have less attribute values, which is the data we are seeing now. Attributes description of the raw data attributes is at: http://archive.ics.uci.edu/ml/databases/census1990/USCensus1990raw.attributes.txt Some attributes are kept the same from raw dataset to the current dataset, with an “i” attached to the front of current attribute name indicating it’s unchanged; the discretized attributes of raw data set are named with a “d” added in front of their original names. For example, in current data set, attribute “dAge” is discretized from raw data set, and its description should be “AAGE” in the raw data description (Age); “iAvail” means the attribute values is not changed from its raw values, and its corresponding attribute is “AVAIL” in raw data description (Available for work). For more information, the mapping functions from raw attributes to current attributes can be found here: http://archive.ics.uci.edu/ml/databases/census1990/USCensus1990.mapping.sql The file used in this tutorial is an abbreviated version of the data set, obtaining the first 10,000 instances out of 2,458,285. [Note: If your computer does not have big memory, you will notice the following clustering process is executed very slowly. Then you may use the file UScensus_3000.xlsx to do this Lab. This file has only 3000 instances, although it may not get as interesting results as the larger file, it should take much less memory than the larger set with 10000 instances.] Start RapidMiner and ReadExcel UScensus_10000.xlsx, and set role of the “case ID” to be id, then store the dataset to your repository (please recall tutorial 2 on importing and storing data). Please note the dataset is a little bigger than those we have worked on,…

    • 493 Words
    • 2 Pages
    Satisfactory Essays
  • Satisfactory Essays

    Data Processing

    • 395 Words
    • 2 Pages

    During the collection of data, our group noted the effect that temperature change had on aquatic macro invertebrates. Our data was collected from three different ponds amongst the Lake Harriet/Lake Calhoun vicinity. We took samples from the bird sanctuary pond, Lake Calhoun holding pond and the Lake Harriet duck area. Prior to our procedure, we measured the temperatures of each pond area. We used the low-temperature climate (bird sanctuary pond) to compare to the higher-temperature climate (Lake Calhoun holding pond and Lake Harriet duck area.) After completing our experiment by surveying various sections of each three experimental sites, we gathered our information using a stream study. We surveyed the areas four different times for maximum proficiency. After recording each sample study four times for each area, we added up their water quality rating total index count and divided it by four, generating an average for our results.…

    • 395 Words
    • 2 Pages
    Satisfactory Essays
  • Powerful Essays

    Data Processing Made Easy

    • 3213 Words
    • 13 Pages

    The edited data are classified and coded. The responses are classified into meaningful categories so as to bring out essential pattern. By this method, several hundred responses…

    • 3213 Words
    • 13 Pages
    Powerful Essays