March 14, 2011
Kudler Dimensional Model Hands-On-Project
Kudler is looking for ways to increase sales and customer satisfaction. To achieve this goal Kudler will use data mining tools to predict future trends and behaviors to allow them to make proactive, knowledge-driven decisions. Kudler’s marketing director has access to information about all of its customers: their age, ethnicity, demographics, and shopping habits. The starting point will be a data warehouse containing a combination of internal data tracking all customers contact coupled with external market data about competitor activity. Background information on potential customers also provides an excellent basis for prospecting.
ETL processes that populate the dimensional model database
The (ETL) will extract data from the source and load it into the data warehouse. OLTP database transforms data extracted to match the data warehouse schema and loaded into the data warehouse database along with other non-OLTP data such as text files, legacy systems, and spreadsheets. ETL updates as required (monthly, daily, hourly). ETL operations performed on a relational database server separate from the source databases and the data warehouse minimizing the impact of the intense periodic ETL activity on source and data warehouse databases.
To extract the data, each source’s distinct set of characteristics will need to be managed and integrated into the ETL system. The content of the logical data mapping document has been proven to be the critical element required to plan ETL process. The primary purpose of this document is to provide the ETL developer with a clear-cut blueprint of exactly what the expectations are from the ETL process.
Data Discovery’s Phase key criterion for the success of the data warehouse is the cleanliness and cohesion of the data within it. The responsibility of the ETL team to drill down further into the data requirements to determine every source system, table, and attribute required to load the data warehouse. ETL team will have the following responsibilities: collecting and documenting source systems, keeping track of source systems, determining the system of record point of origin of data. Definition of the system-of-record is important because of data stored redundantly across many systems. When collecting the data the team will verify that there are no varying versions of the same data. Understanding the content of the data is crucial for determining the best approach for retrieval.
This is the Main step where the ETL adds value, changes data and provides guidance to used for its intended purposes. Data Quality paradigm will check for correct, unambiguous, consistent, and complete data. Quality checks are run at two places - after extraction and after cleaning and confirmation. Structure enforcement checks that tables have proper primary and foreign keys, and obey referential integrity.
The data loading module consists of all the steps required to administer slowly changing dimensions (SCD) and write the dimension to disk as a physical table in the proper dimensional format with correct primary keys, correct natural keys, and final descriptive attributes.
Loading Dimensions: physically built to have the minimal sets of components, the primary key is a field containing meaningless unique integer – Surrogate Keys, the DW owns these keys and never allows any other entity to assign them. De-normalized flat tables – all attributes in a dimension must...