Friday 9 January 2015

Part 1: Data Warehousing Concepts

What is a Data Warehouse ?

A data warehouse is a relational database that is designed for faster querying and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources.

In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.

A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon:
  • Subject Oriented 
  • Integrated 
  • Nonvolatile 
  • Time Variant 

Subject Oriented

Data warehouses are designed to help you analyze data. For example, to learn more about your company's sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like "Who was our best customer for this item last year?" This ability to define a data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented.

Example: In an insurance company, there are many subject areas like premium collected from the customers, claimed given to the customers, etc. The data warehouse always concentrate on one more of subjects illustrated above.

Integrated

Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated.

Example: Consolidating the sales details of Super Market chain located in different cities to a homogeneous format.

Nonvolatile

Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred.

Example: Once data is loaded to Data Warehouse for analysis, it will not change. The change happens only in the OLTP.

Time Variant

In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse's focus on change over time is what is meant by the term time variant. The data is analysed for specific data for specific period only. 
Example: The analysis of 2014 sales data on best performing products will not be useful in 2015 Sales unless it is used to forecast the sales.

The following example explains how the data is integrated from source systems to target systems.

Example of Source Data
Example of Target Data
In the above example, the customer application date is represented in different formats and with different attribute names across multiple source systems. But when it is integrated to the data warehouse, the data types, formats and attribute names are standardized. And analysts will be able to depend on single version of truth rather than analyzing multiple source systems.

OLTP to Data Warehouse Architecture Diagram


In the above diagram, sections 1,2 and 3 are OLTP Modules and 4, 5 and 6 are Data Warehouse and Business Intelligence Modules.

Business Modelling

This is the first step in the OLTP System development and it involves the following processes.
  • Identifying the Business Processes modules by visualizing and analyzing the business requirements. Example: Billing / Sales Module, Stock Management Module, Employee Module, etc in a Super Market.
  • Identifying the process flow between the Modules already identified. Example: how the process Sales Module is related with Stock Module.
  • Identifying the Data Flow between Modules. Example: In Sales and Stock module, quantity and price details are flown. 

Data Modelling

At this stage the data models of OLTP System and dimensional data models of Data Warehouse is developed. In most cases the data warehouse is built after the OLTP development is completed and Go-Live. This step involves identifying the physical tables for each module in OLTP and Data Warehouse. In OLTP, tables will be highly normalized. But in Data Warehousing the tables are normalized up to only 3NF.

Online Transaction Processing (OLTP)

In OLTP, transaction data of business day to day operations are stored in the database where more number of insert, update and delete operations are performed. Here the data is highly normalized and scattered in many tables. The data model for OLTP is designed in such a way to reduce storage space of the database and to minimize the update overhead. An OLTP system may store data in databases, files, XML data stores for data exchange and special systems like PeopleSoft, Siebel, SAP, etc.

ETL Process

The process starts only if the Data Warehouse dimensional modelling explained in the Data Modelling Step is completed. As I already told you, in most cases that I have seen Dimensional Modeling of Data Warehouse is performed just before the ETL process development. But there are end to end products like Oracle's Siebel Analytics, now it is called OBIEE where the OLTP, Data Warehouse and Business Intelligence or Reporting Modules are pre-built. The ETL process involves the below steps.
  • Creation of Sources and Targets. The identification of these objects is done in the Data Modelling Step.
  • Extract right data through proper Data Profiling and Cleansing Techniques.
  • Transforming the data in the desired format in order to meet the business's reporting requirements. Aggregation, Filtering, Joining, Sorting, etc are some of the transforming techniques available in the ETL Tools like Informatica, Ab-Initio, etc. There are ELT Tools as well available in the market. Oracle Data Integrator is an example for this.
  • The transformed data from the OLTP System is loaded in to the Data Warehouse Dimensional Tables by creating workflows and these Workflows can be executes based on the business requirement.
  • The Data Warehouse is built using the ETL process. Data Warehouse may have one or more Data Marts. Data Marts are subset of a Data Warehouse. There are 2 approaches of building Data Warehouse, one is Top-Down Approach and other Bottom-Up Approach. I will explain this in the upcoming posts.

Business Intelligence

The business reports are generated from the Data Warehouse which helps in Business Continuity Plans, Forecasting, Internal Reporting, etc.

In the next post, I will explain the difference between OLTP and Data Warehouse and Data Warehouse Architecture in detail. 
If my post helped you understand the topic, please like and share with your friends as well. Thanks for reading this article.

0 comments:

Post a Comment