IST-722

Introduction

Agenda

1.Define data warehousing

2.Explain 4 characteristics of a data warehouse

3.Discuss the relationship between data warehouse, business intelligence and analytics.

4.Explain the 5 types of analytics

5.Demonstrate how the process of data warehousing works

6.Learn about the fathers of data warehousing

7.Cover our case studies

Starts with the OPERATIONAL Database (OLTP)

  • On-Line Transaction Processing System
  • Typically stored in a Relational Database or files.
  • Highly Normalized (Data stored as efficiently as possible, lots of tables.)
  • Optimized for processing speed and handling the “now”.
  • Designed for capturing data, not for reporting on it.
  • Designed to support the operational needs of the organization.

Issues Reporting with Transactional Databases

  • Difficult, time-consuming and error prone
    • ­Many joins, sub-selects, Due to vast number of tables.
    • ­How do you know your query is correct?
  • Resource-intensive
  • The database is not optimized for this purpose.
  • ­Multi table joins are RAM and CPU hogs
  • Impossible
  • ­Transactional systems are flushed or archived frequently to maintain performance.
  • ­You can’t query data you no longer have

Solution? The Data Warehouse

  • Designed to support an organization’s informational needs.
  • Data is re-structured conducive to reporting and analytic applications.
  • OLTP databases are data sources for the Data Warehouse.
  • Data grows over time; existing data in the warehouse never changes.

4 Characteristics of the Data Warehouse

  • Subject-Oriented
  • Non-volatile
  • Integrated
  • Time-variant

Analytics is the Technology-Driven Analysis of Data

1.Retrospective: Traditional Business Intelligence / Reporting “What Happened?” 2.Diagnostic: Analytic Dashboard / Drill-Down “Why did it happen?” 3.Descriptive: Real-time Dashboard “What is happening now?” 4.Predictive: Machine Learning / Forecasting “What is likely to happen?” 5.Prescriptive Analytics: Make a decision or Take Action “What should I do about it?”

Architectures

Agenda

  • Understand the different types of data warehouse architectures.
  • Understand the difference between technical architecture and systems architecture.
  • Explain the components essential to all technical architectures.
  • Learn how the technical architecture components integrate
  • Discuss systems architecture common to data warehousing
  • Understand key terminology related to both technical architcture and systems architecture.

Logical Architecture

How the data stores are arranged in the data warehouse and how data moves from data store to data store

Physical Infrastructure

Physical Configuration of Systems, Networks and Servers to support the Technical Architecture

Stage Data Store

  • An internal data store. It is not user-facing.
  • Stores extracts from source systems acting as a source for other systems in the data warehouse.
  • Reduces contention with source systems
  • Consolidates data from multiple sources
  • Change Detection
  • Snapshot data to a point in time.

4 Reasons To Stage Data

1.Resource Contention 2.Consolidation 3.Change Detection 4.Snapshotting

NDS: Normalized Data Store

  • An internal data store. Not user-facing.
  • Used as the Organization's source of a "single version of the truth" for other systems.
  • Subject-oriented, integrated, non-volatile and time-variant data from the OLTP sources.
  • Stored in 3rd Normal form, to reduce redundancy.
  • Use as a source for data marts and decision support systems, which use DDS.
  • Grows in size over time due to historical data.

ODS: Operational Data Store

  • A hybrid data store. Parts are internal, parts are user-facing.
  • Integrated, detailed, volatile, and current data from source systems.
  • Key differences:
    • Volatile – data are updated and removed to reflect current.
    • Consolidated from disparate sources.
  • Does not grow over time. References a point in time, which is typically "now".
  • Structured differently than NDS or DDS and therefore should be stored as a separate DBMS.

DDS: Dimensional Data Store

  • A user-facing data store.
  • Subject-oriented, integrated, non-volatile, and time-variant data from source systems.
  • Stored in dimensional format to support ad-hoc analytical query by end users and decision support systems.
  • RDBMS Star Schema
  • MDBMS Cube
  • Grows in size over time due to historical data.
  • Data are consolidated and de-normalized. So no single version of the truth, but its easier for business users to query.

DDS: ROLAP / Star Schema / Data Mart

  • When the DDS is implemented in a Relational DBMS it is called ROLAP.
  • Relational Online analytical processing (ROLAP)
  • The schema is a star schema because of the consistent M-1 structure between Fact and Dimension tables
  • A Single Star Schema is known as a Data Mart.

MDS: MOLAP / Cube

  • When the DDS is implemented in a Multi-Dimensional DBMS it is called MOLAP.
  • Multi-Dimensional online analytical processing (MOLAP)
  • Facts are pre-aggregated across all dimensions for improved performance.
  • Supports Semantic Metadata

Metadata

  • Metadata means "Data about the data." It is an essential part of the data warehouse technical architecture.
  • Metadata is internal
  • 3 Types:
    • Technical Metadata – Infrastructure oriented. Indexes, table partitions, data types, data transformations. Security.
    • Business Metadata – User oriented. Data structure definitions, Data dictionaries, implicit data hierarchies, data quality screens.
    • Process Metadata – System oriented. Performance metrics and measurements. Auditing the ETL Processes.

Common Technical Architectures

1.Independent Data Marts 2.Centralized 3.Enterprise Bus Architecture - With ODS (ODS + DDS) 4.Hub And Spoke - With ODS 5.Federated With ETL - Federate with EII

System Architectures

1.SMP Symmetric Multi-Processing 2.MPP Massively Parallel Processing 3.Hadoop Map-Reduce/HDFS

  • General purpose distributed batch processing framework.
  • Fault Tolerant.
  • Runs on affordable commodity hardware.
  • Processing and Data are decoupled and distributed over the network
  • Slower Query execution than MPP.
  • Vendors: IBM, Cloudera, Hortonworks, MapR

Building The Data Warehouse

Agenda

  • Data Warehouse development methodologies
  • The Kimball Lifecycle for building a data warehouse
  • Requirements gathering
  • Creating dimensional models from functional requirements
  • Walk through the high level dimensional modeling worksheet.

Understanding The Business

  • Event – a frequent activity within the business. Also known as a business process. Helps you to identify business transactions.
    • Examples: include a purchase order; student registers for a course.
  • Status - The condition of an object at a point in time. Helps you to identify workflows.
    • Examples: order was packaged but now has shipped; student has registered and now receives a grade.
  • Level - is a quantitative measurement of an object at a point in time. Helps you to identify periodic snapshots.
    • Examples: credit card balance; student GPA
  • Roles – The who, what and when of the event, status or level. Helps you to identify dimensions.

Functional Requirements

  • Define what the system does or should do.
  • These requirements address needs of business users.
  • Examples:
    • Business users must be able to analyze sales of product over time and by geographic region, customer segment or sales territory.
    • Business users must be able to view their finance data within their own department such as revenues and expenses by fiscal period broken down into account codes.

Non-Functional Requirements

  • Guide and constrain the system architecture.
  • These requirements outline procedures, rules or regulations.
  • Examples:
    • The maximum query response time should be no longer than 20 seconds.
    • Extracts from the Accounting system can only take place on Sundays between 5 and 6 AM.

Dimensional Modeling

Agenda

  • Dimensional Model Designs in Detail
  • Slowly Changing Dimensions
  • Rapidly Changing Dimensions
  • Advanced Dimensional Modeling Concepts
  • Walkthrough: Detailed Dimensional Modeling Worksheet

Slowly Changing Dimensions

Dimensional data changes infrequently but when it does you need a strategy for addressing the change.

­Ex: What happens when a customer has a new address, or an Employee has a name change?

3 Popular strategies

  • Type 1: Overwrite the existing attribute
  • Type 2: Add a new Dimension row
  • Type 3: Add a new Dimension attribute

These strategies are not mutually exclusive, and can be combined within a single dimension.

Conformed Dimensions

  • These are master or common reference dimensions and a key part of the Enterprise Bus technical architecture.

  • Shared across business processes (fact tables) in the DW.

  • Reusable, can be used for drill-across, where you combine facts across a common dimension

  • Lower time to develop next star schema.

  • Contain a super-set of attributes required by all fact tables.

Two types of Conformed Dimensions:

­Identical Dimensions – exactly the same dimensions (Ex. Dates)

­Perfect Subset of an existing dimension.

Dimensional Model Development

Agenda:

  • The Physical Design Process and Environments
  • Developing Standards
  • Physical Design Guildelines
  • Demo: Star Schema Generation from the Detailed Worksheet
  • Demo: Validating your star schema with data.

Partitioning

#2 way to improve performance.

  • Most data in fact tables are inserted in chronological order.

  • It makes sense to physically organize the fact table by date.

Index #3 way to improve performance

  • Index improves search of a table by creating an internal structure with keys build from one or more columns. You can have multiple indexes per table.
  • Clustered index is an index on the table itself. It determines the order the data is written to the table, thus there can only be one.

Introduction to ETL

Agenda

  • Learn about ETL Approaches and Architectures
  • Discuss Common Subsystems of ETL
  • Describe Data Extraction and Staging Techniques
  • Explain Common and Advanced ETL Patterns

ETL Development

Agenda

  • Explain the ETL Development Process
  • Demonstrate an ETL Tool: SQL Server Integration Services
  • Walk Through a Demonstration:
  • Fudgemart Employee Timesheets
  • Source to stage
  • Loading Dimensions
  • Loading the Fact Table
  • Putting it all Together

Managing The Data Warehouse

Agenda:

  • Understand the importance of Data Governance when it comes to a data warehousing initiative
  • Explain Master Data Management
  • Define Data Quality and Discuss its importance in the data warehouse
  • Explain the role of metadata in the data warehouse
  • Discuss ways to secure data in the data warehouse

Business Intelligence

Agenda:

  • Define business intelligence

  • Explain the 6 categories of BI

  • Discuss how we measure BI

  • Describe the process of developing BI

  • Demonstrate how to build OLAP cubes in a multi-dimensional database.

Big Data And The Data Warehouse

Agenda

  • Understanding Growth Of Data In the Enterprise

  • Defining Scalability

  • Define Big Data

  • Explain how Hadoop, MapReduce and HDFS Work

  • Discuss YARN Applications used in Data Warehousing