Preview

Rapidminer

Satisfactory Essays
Open Document
Open Document
493 Words
Grammar
Grammar
Plagiarism
Plagiarism
Writing
Writing
Score
Score
Rapidminer
Use K-Means for Clustering
1. Dataset For this tutorial, we will work on some unlabeled data from the US Census Bureau. The following introduction to this dataset is for you to learn about its attributes and interpret results: Attributes of the raw data is discretized to have less attribute values, which is the data we are seeing now. Attributes description of the raw data attributes is at: http://archive.ics.uci.edu/ml/databases/census1990/USCensus1990raw.attributes.txt Some attributes are kept the same from raw dataset to the current dataset, with an “i” attached to the front of current attribute name indicating it’s unchanged; the discretized attributes of raw data set are named with a “d” added in front of their original names. For example, in current data set, attribute “dAge” is discretized from raw data set, and its description should be “AAGE” in the raw data description (Age); “iAvail” means the attribute values is not changed from its raw values, and its corresponding attribute is “AVAIL” in raw data description (Available for work). For more information, the mapping functions from raw attributes to current attributes can be found here: http://archive.ics.uci.edu/ml/databases/census1990/USCensus1990.mapping.sql The file used in this tutorial is an abbreviated version of the data set, obtaining the first 10,000 instances out of 2,458,285. [Note: If your computer does not have big memory, you will notice the following clustering process is executed very slowly. Then you may use the file UScensus_3000.xlsx to do this Lab. This file has only 3000 instances, although it may not get as interesting results as the larger file, it should take much less memory than the larger set with 10000 instances.] Start RapidMiner and ReadExcel UScensus_10000.xlsx, and set role of the “case ID” to be id, then store the dataset to your repository (please recall tutorial 2 on importing and storing data). Please note the dataset is a little bigger than those we have worked on,

You May Also Find These Documents Helpful

  • Good Essays

    Acct 505 Course Project

    • 596 Words
    • 3 Pages

    The 1st individual variable is LOCATION which is a categorical variable. The three subcategories are Urban, Suburban and Rural. Since Location is a categorical variable, the measures of central tendency have not been calculated for this variable. The frequency distribution and pie chart are given as follows:…

    • 596 Words
    • 3 Pages
    Good Essays
  • Good Essays

    The first variable considered is Location, a categorical variable. The three subcategories are Urban, Suburban and Rural. The frequency distribution and pie chart are included. Measures of central tendency and descriptive statistics are not calculated due to the categorical nature of the variable.…

    • 1935 Words
    • 8 Pages
    Good Essays
  • Powerful Essays

    TangleWood Case 2

    • 697 Words
    • 3 Pages

    The environmental scan also included the ethnic demographics which I acquired from the 2013 and 2010 US Census as follows:…

    • 697 Words
    • 3 Pages
    Powerful Essays
  • Good Essays

    Mgmt600-1204a-06 P2 Ip

    • 1021 Words
    • 5 Pages

    The next category under the general summary is the 2000 Race and Ethnicity. The first thing is to condense the different classes to three or four instead of 17. These four categories will be Black, White, Hispanic, and Other. The total of Blacks in Chicago was 6.80%. Chicago’s s statistics for the White population was 142.60%. The Hispanic population amount was relatively low at only 5.40%. The last category is Other and this total was the highest of them all at 145.20%. The U. S. total of the Black population was 14.30%. The…

    • 1021 Words
    • 5 Pages
    Good Essays
  • Satisfactory Essays

    Workbook Exercise 11

    • 563 Words
    • 3 Pages

    1. What demographic variables were measured at least at the interval level of measurements? Age, income, length of labor, return to work, and number of hours working per week…

    • 563 Words
    • 3 Pages
    Satisfactory Essays
  • Good Essays

    Econ 4130 Review 1

    • 2562 Words
    • 11 Pages

    2 data sources: Censuses, surveys (population, demographic information, occupational distributions)&Tax records (production information, shipping information, exports and imports, wealth).…

    • 2562 Words
    • 11 Pages
    Good Essays
  • Good Essays

    It/205 Week 8 Checkpoint

    • 339 Words
    • 2 Pages

    A census is the process of acquiring and recording information from the general public to determine population or housing numbers. For the 2010 census, the United States Census Bureau decided to implement a program called Field Data Collection Automation (FDCA). The program was intended to improve the data collection process by using handheld electronic devices. The mobile handhelds were to canvass addresses during the initial stage of the census.The FDCA program was important to the U.S. Census Bureau because it would replace the millions of paper forms and maps that were previously required to complete the census (Laudon&Laudon, 2011, p413). It was also important because it was meant to reduce costs, improve data quality, improve data collection efficiency, and encourage people to participate. Unfortunately, the FDCA program had problems.…

    • 339 Words
    • 2 Pages
    Good Essays
  • Satisfactory Essays

    Sas Assignment

    • 300 Words
    • 2 Pages

    [8marks] Create the SAS data, work.A1Q1 with the features described below from the SAS dataset agpop. a. Create the SAS data, work.A1Q1. b. Display only the variables region, county, state, acres92, and farms92. c. The data includes two subsets (i) the number of acres devoted to farms in 1992 (acres92) was between 32,000 and 100,000 in west region (‘W’). (ii) the number of acres devoted to farms in 1992 (acres92) is less than 50,000 in south region (‘S’). d. Use the labels below to replace the variable names. Variable acres92 farms92 Label Number of Acres Number of Farms…

    • 300 Words
    • 2 Pages
    Satisfactory Essays
  • Good Essays

    c) To provide the infrastructure and tools to transform raw data into usable corporate information of the highest quality.…

    • 2215 Words
    • 17 Pages
    Good Essays
  • Good Essays

    Population and Sampling

    • 737 Words
    • 3 Pages

    Every decade the US government conducts a census on the population. The data provided in this census is then used for reseach purposes, business marketing, planning, and various other sampling needs. This leads to the question of “What is population?” This can be defined as people that occupy a town which is located in specific region within a specific county/state, and their individual characteristic such as sex, age, ethnicity or marital status. The term “population” is made up of all members and/or elements in that defined group. The basic topics covered in population are birth, growth, age and death.…

    • 737 Words
    • 3 Pages
    Good Essays
  • Satisfactory Essays

    Health Care Practitioner

    • 701 Words
    • 3 Pages

    Segmentation scheme given in the case can be broadly classified into demographic and psychographic -…

    • 701 Words
    • 3 Pages
    Satisfactory Essays
  • Powerful Essays

    Study Guide

    • 3863 Words
    • 16 Pages

    4. In a questionnaire, respondents are asked to mark their gender as male or female. Gender is an example of the…

    • 3863 Words
    • 16 Pages
    Powerful Essays
  • Good Essays

    Census Bureau, 2013 CPS. The use of data not collected by researchers directly is considered to be secondary data. This secondary data were taken from the U.S. Census Bureau website and analyzed. Currently, it is the latest CPS data available according to the U.S. Census Bureau website (U.S. Department of Commerce, 2015). The CPS consists of several different sections and sub categories, over 5,000 households and 200 questions. The data are cross-sectional, in that it was completed in 2013, randomly selected, and consist of a nationally represented sample. Cross-sectional data, according to Lavrakassec (2008) “are data that are collected from participants at one point in time” (Lavrakassec, 2008). He explained that in instances when surveys could not be observed directly cross-sectional data are used. These types of data are often times not collected at the same time, but over a period of time, although cross-sectional data allow sample data to be collected at a faster rate on time without bias (Lavrakassec,…

    • 1070 Words
    • 5 Pages
    Good Essays
  • Better Essays

    Census Bureau expanded the American Community Survey sample to 3 million households. (United States Census Bureau, 2015) Then on January, 2006 the American Community Survey included group quarters in their information. They decided to provide annually updated social, economic, and housing data for states and communities. This was very helpful for the fact I did not want to show very old data pertaining to Texas.…

    • 2163 Words
    • 9 Pages
    Better Essays
  • Powerful Essays

    Electric Bill Extimation

    • 1326 Words
    • 6 Pages

    * technical tables with parameters that do not change but use permanent demographic or geographical data from cities, regions, municipalities…

    • 1326 Words
    • 6 Pages
    Powerful Essays