Data Warehousing





Trends and Challenges

Every business owner dreams of having the ability to know what is happening in all aspects of his or her operation and of being able to use that information to better market--and profit by--his or her wares. Also, the business user wants to analyze his past data and improve his business strategy to maximize return on investment.

In years past, a one-owner enterprise could accomplish this function fairly well. It took a small army of
clerks and accountants collecting and entering data into vast numbers of ledgers and analysts then using the data to analyze and report trends, sales summaries, growth potential, cost containment, and so on. It was time-consuming for small and midsize enterprises, but worth the effort and expense when the statistical
reporting resulted in growth and increased profits. As businesses grew, merged, and expanded into other locations and markets throughout the world, the ability to track and maintain these minutiae became increasingly difficult. As the amount of data increased, this army of clerks was replaced with computers, but
the ability to input so much data from so many sources into some central repository for analysis and reporting became paramount. The introduction of data warehousing applications answered this need.

By definition - "Data warehousing is revolutionizing the way businesses in a wide variety of industries perform analysis and make strategic decisions."

A Data Warehouse is a repository of similar information belonging to a single entity, available for queries and analysis. Data and information are extracted, cleansed and stored from heterogeneous sources into a single database. This makes it much easier and more efficient to run queries over data that originally came from different sources.

A data warehouse is a record of an enterprise's past historical transactional and operational activities, stored in a single database. The database design favours data analysis and reporting in order to gain strategic insight and to facilitate decision making. Data warehouses are not used for current, "live"
data.

Data warehouses often hold large amounts of information of an enterprise or an organisation which are sometimes grouped into smaller logical units called data marts. And hense Data Marts are subsets of Data Warehouse.

Why Data Warehousing???

"The need to transform huge amount of raw databases into actionable information for Competitive Advantage".

The current business environment is competitive, ever changing, chaotic and global. Competitive advantage stems from well informed decisions based on an understanding of processes, products, customers preferences, competition and the company's strengths.

There are Information and Knowledge Gaps in various business areas.

There is information deficiency in various areas:

  • Sales and marketing
  • Healthcare Administration
  • Production planning
  • Quality control
  • Inventory management
  • Finance and Administration
  • Information Technology

Business Pains across all departments:

Marketing Analyst

Why is this product not selling well in this location ?
What is the current market share of my products across regions ?
Who are the customers who contribute towards 85% of my revenue ?
Why are my customers leaving ?
Who are the customers likely to leave in the near future ?
Which product segment is most profitable ?

Production Manager

What is the defect rate ?
How do I improve my production processes ?
What is the optimum amount of this item to keep in stock at a particular time ?
Why do I get different stock position reports for the same depot ?

Financial Controller

Who are my Top 10 bad debtors ?
Which customers have declining payment patterns

I.T. Manager

How should I integrate the different data sources ?
How do I insure VALIDITY and QUALITY of the data ?
How do I produce reports from diverse systems for various departments ?
How should I reduce dependency of business users on I.T. ?

 



Our Solution

click image for larger view

A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process".

Subject Oriented:
Data that gives detailed information about a particular subject instead of about a company's ongoing operations.

Integrated:
Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole.

Time-variant:
All data in the data warehouse is identified with a particular time period.

Non-volatile:
Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business.


A data warehouse is a central repository for all, or significant parts, of the data that an enterprise's various computer systems collect. Typically, a data warehouse is housed on an enterprise's mainframe server, but it can reside with a storage service provider. Data from various online transaction processing
(OLTP) applications and other sources are selectively extracted, cleansed and loaded into the Data Warehouse database in a organized manner. Analytical applications such as online analytical processing (OLAP) tools, data mining, statistical modeling, geographical information systems (GIS), decision support
systems (DSS), and other user queries are then applied to the repository. For example, the XYZMart nationwide chain of discount stores can establish a data warehouse that collects all the transactions from every cash register, inventory tracking, and customer service entry of every store.

A data warehouse is designed to capture data from diverse sources for useful analysis and access. It's not generally designed for an end user who may need access to specialized, sometimes local, databases. This function is filled by a data mart.

A data mart is a repository of data gathered from operational data and other sources and is designed to serve a particular community of end users for a specific purpose. A data mart doesn't always derive from a data warehouse. For example: The shoe buyers for XYZMart might collect data regarding the inventory and sales of various brands and models of shoes in every store. They might also want to include data regarding stocking activity, personnel turnover, and major athletic events in the communities for each of the store locations. Defined user queries and other analytical applications would then be applied to the data mart
database.

The design of a data mart tends to start from an analysis of user needs, and the design of a data warehouse tends to start from an analysis of what data already exists and how it can be collected in such a way that the data can later be used.

Elements of data warehousing


click image for larger view

click image for larger view


There are five major elements of data warehousing: Data Acquisition, Data Modeling and Schema, Metadata, Data Management, and Data Analysis.

Data Acquisition:
Data acquisition involves identifying, capturing, and transforming data in operational systems so the data can be loaded into a data warehouse or data mart. This phase is generally refferred as ETL (Extraction, Transformation, Loading ) process. During the ETL process, data is extracted from an OLTP database, transformed to match the data warehouse schema, and loaded into the data warehouse database. Many data warehouses also incorporate data from non-OLTP systems, such as text files, legacy systems, and spreadsheets; such data also requires extraction, transformation, and loading.

In its simplest form, ETL is the process of copying data from one database to another. This simplicity is rarely, if ever, found in data warehouse implementations; in reality, ETL is often a complex combination of process and technology that consumes a significant portion of the data warehouse development efforts and requires the skills of business analysts, database designers, and application developers.

Data acquisition is a complex, time-consuming, and costly phase of building and managing a data warehouse, but if this phase isn't correctly carried through, the data warehouse won't be effective. During data acquisition, data is extracted, transformed, transported, and loaded. These steps are generally
performed by either a single tool set or a combination of tool sets and internal processes designed to efficiently coordinate the various steps and maintain data integrity.

The extraction step is the process of identifying and retrieving a set of data from an operational system. Transformation uses application tools that allow data warehouse administrators to apply business rules for integrating data from multiple tables and source systems. The transporting step moves operational data
from the source system to the transformation server or target environment and is usually performed through a file transfer or messaging-based process. The loading step uses output preformatted to meet the requirements of the target database's load utility and a scheduling mechanism to trigger the load.

Data Modeling and Schema:
Data modeling is the analysis of data objects used in a business or other context and the identification of the relationships among these data objects. A data model consists of objects (for example, a product, a product price, or a product sale) and expressions of the relationships between each of these
objects. The total description of the objects and relationships is used to create a class or subclass that defines all the general relationships. These become templates for objects that, when executed in a program, manage the variables of new transactions and other activities in a way that effectively represents the real world. A schema, in computer programming, is the organization or structure for a database. The activity of data modeling leads to a schema.

Data modeling is probably the most labor intensive and time consuming part of the development process. Why bother especially if you are pressed for time? A common response is that you should no more build a database without a model than you should build a house without blueprints.

The goal of the data model is to make sure that the all data objects required by the database are completely and accurately represented. Because the data model uses easily understood notations and natural language, it can be reviewed and verified as correct by the end-users. Once designed and started developing data
warehouse, changes to the design would be very difficult, on short it will be a Re-Design.
A poorly designed database will require more time in the long-term. Without careful planning you may create a database that omits data required to create critical reports, produces results that are incorrect or inconsistent, and is unable to accommodate changes in the user's requirements.

Metadata:
Metadata is a definition or description of data. It's commonly referred to as "data about data." It's considered the glue that holds together all components and views of a data warehouse. For data warehouse administrators, metadata simplifies the process of managing all the processes involved in creating and
sustaining a data warehouse. For end users, metadata is the entranceway to the data warehouse and functions much like a card catalog in a library. Unfortunately, most application tools generate their own metadata, which they store in proprietary formats or storage mechanisms. Several standards-setting
bodies are currently working on some type of metadata standard to facilitate information sharing and common administration.

Following are some of the examples of Meta data:

  • Used for process-automization, analysis,administration
  • Helps user to understand, access and use the data
  • Schema definitions, ETL-information, access rights, physical location of data
  • Materialized views, query plans, statistics

The extensible markup language (XML) explains how to describe a collection of data and is sometimes referred to as metadata. In XML, meta implies underlying definition or set of rules. In other usages, meta sometimes implies description rather than definition.

Data Management:
Data management includes the access and storage mechanisms that support the data warehouse. This is usually a relational, multidimensional, or other specialized database designed to facilitate complex queries. A relational database is a collection of data items organized as a set of formally described tables from
which data can be accessed or reassembled in many different ways without having to reorganize the database tables. During the creation of a relational database, definitions are established of possible values and further constraints that may apply to each data value. The definition results in a table of metadata or
formal descriptions of the tables, columns, domains, and constraints. The standard user and application program interface to a relational database is the structured query language (SQL).

A multidimensional database stores data in a matrix-like structure (similar to spreadsheets) and is designed to provide rapid query responses to multidimensional queries. Because they have limited data storage capabilities, multidimensional databases are generally used to support application-specific
data marts. These types of databases are optimized for data warehouse and online analytical processing applications and are frequently created using input from existing relational databases.

Specialized analytical databases employ specialized index structures to speed responses to complex queries.

Data Analysis:
Data analysis applications enable end users to access and analyze data stored in data warehouses or data marts. There are many variants of data analysis software. The main types of data analysis software include data mining tools, online analytical processing (OLAP) tools, Enterprise Business Intelligence Suites, and decision support systems.

Data mining is the analysis of data for relationships that haven't previously been discovered. It enables users to identify patterns and relationships within a set of data and create models to anticipate behavior or events based on trends in the data.Data mining results can include:

  • associations--one event correlated to another event
  • sequences--one event leading to another later event
  • classification--the recognition of patterns and a resulting new organization
    of data
  • clustering--finding and visualizing groups of facts not previously known
  • forecasting--discovering patterns in the data that can lead to predictions
    about the future.



SEEinfobiz's recommended Data Warehousing Process

In order to address diverse business pains we recommend businesses to build a warehouse of all activities such as sales, purchase, quality, production, inventory, delivery,Supply Chain, human resource, finance and administration. We advise businesses to build data marts across departments

SEEinfobiz follows a rapid Data Warehousing implementation methodology which includes the following phases:

  • Assessment/Feasibility Study
  • Business Information Requirement by Data Marts and Subject areas
  • Product selection and installation
  • Data Modeling - Logical and Physical Design
  • Extraction Transformation Loading ( ETL )
  • OLAP Cube Design
  • Front End Development
  • Performance Tuning
  • Quality Assurance
  • Rolling out to Production
  • Production Maintenance
  • Incremental Enhancements

Assessment/Feasibility Study :

A data warehousing project feasibility study either refines a Project Business Case by examining the range of possible options and potential issues, or forms a basis for its development. The feasibility study addresses issues that could influence the success of a potential project and assess the advantages and disadvantages of each option so they can be ranked. It includes a cost/benefit analysis and results in the development of a Feasibility Report. In many ways, a Feasibility Report can be treated like a mini-project in its own right; its outcome being a decision on how the larger project should be handled. If it is completed as the
project is being initiated, one of its outputs could be a preliminary Project Business Case document. If it is completed after the Business Case has already been developed, it can help refine it. In either case, it should narrow the range of options, assess each of the remaining options, and propose solutions to issues raised.

- Examine the scope of the study to assess the work involved and any constraints, such as quality, cost, time and so forth. A resulting work plan should outline the study's delivery time, interim and final reports required.

- Create a plan for the study. This plan should include a milestone plan and a esponsibility chart. The milestone chart should be robust, but flexible enough to cope with emerging or unexpected issues and adequate time should be allowed for requesting and collecting information, as well as interpreting and analysing the results.

Business Information Requirement by Data Marts and Subject areas :
The first thing that the project team should engage in is gathering requirements from end users. Because end users are typically not familiar with the data warehousing process or concept, the help of the business sponsor is essential. Requirement gathering can happen as one-to-one meetings or as Joint Application
Development sessions, where multiple people are talking about the project scope in the same meeting.

The primary goal of this phase is to identify what constitutes as a success for this particular phase of the data warehouse project. In particular, end user reporting / analysis requirements are identified, and the project team will spend the remaining period of time trying to satisfy these requirements.

Associated with the identification of user requirements is a more concrete definition of other details such as hardware sizing information, training requirements, data source identification, and most importantly, a concrete project plan indicating the finishing date of the data warehousing project.

Based on the information gathered above, a disaster recovery plan needs to be developed so that the data warehousing system can recover from accidents that disable the system. Without an effective backup and restore strategy, the system will only last until the first major disaster, and, as many data warehousing
DBA's will attest, this can happen very quickly after the project goes live.

Deliverables -
- A list of Statements, static / operational reports, Analytical reports / cached cubes to be delivered to the end users by the end of this current phase.
- An updated project plan that clearly identifies resource loads and milestone delivery dates.
- Possible Pitfalls

Data Modeling:
This is a very important step in the data warehousing project. Indeed, it is fair to say that the foundation of the data warehousing system is the data model. A good data model will allow the data warehousing system to grow easily, as well as allow for good performance.

In data warehousing project, the logical data model is built based on user requirements, and then it is translated into the physical data model.

Part of the data modeling exercise is often the identification of data sources. Sometimes this step is deferred until the ETL step. Should the data not be available, this is a good time to raise the alarm. If this was delayed until the ETL phase, rectifying it will becoming a much tougher and more complex process.

Deliverables -
Identification of data sources
Logical data model.
Physical data model.

Data Modeling
Relational (OLTP) Data Modeling

Relational Data Model is a data model that views the real world as entities and relationships. Entities are concepts, real or abstract about which information is collected. Entities are associated with each other by relationship and attributes are properties of entities. Business rules would determine the relationship between each of entities in a data model.
Dimensional modeling (OLAP) is the design concept used for building data warehouse. In this model, all data is contained in two types of tables called Fact Table and Dimension Table.

Fact Table : Fact table contains the measurements or metrics or facts of business processes. If your business process is Sales, then a measurement of this business process such as "monthly sales number" is captured in the fact table. In addition to the measurements, the only other things a fact table contains are foreign keys for the dimension tables.

Dimension Table : Context of the measurements are represented in dimension tables. You can also think of the context of a measurement as the characteristics such as who, what, where, when, how of a measurement (subject ). In your business process Sales, the characteristics of the 'monthly sales number' measurement can be a Location (Where), Time (When), Product Sold (What).
The Dimension Attributes store foreign keys to these dimension tables. E.g. Location attribute may contain citycode which is in turn stored in dimension table for cities and their description. The dimension attributes may also contain one or more hierarchical relationships. E.g. City might be located in a country, state etc.
Before designing your data warehouse, you need to decide what this data warehouse contains. Say if you want to build a data warehouse containing monthly sales numbers across multiple store locations, across time and across products then your dimensions are:
Location :
Time
Product

OLAP & its Hybrids
OLAP, an acronym for Online Analytical Processing is an approach that helps organization to take advantages of DATA. Popular OLAP tools are Cognos, Business Objects, Micro Strategy etc. OLAP cubes provide the insight into data and helps the topmost executives of an organization to take decisions in an efficient manner.
Technically, OLAP cube allows one to analyze data across multiple dimensions by providing multidimensional view of aggregated, grouped data. With OLAP reports, the major categories like fiscal periods, sales region, products, employee, promotion related to the product can be ANALYZED very efficiently, effectively and responsively. OLAP applications include sales and customer analysis, budgeting, marketing analysis, production analysis, profitability analysis and forecasting etc.

ROLAP

ROLAP stands for Relational Online Analytical Process that provides multidimensional analysis of data, stored in a Relational database(RDBMS).

click image for larger view

MOLAP

MOLAP(Multidimensional OLAP), provides the analysis of data stored in a multi-dimensional data cube.

click image for larger view

HOLAP
HOLAP(Hybrid OLAP) a combination of both ROLAP and MOLAP can provide multidimensional analysis simultaneously of data stored in a multidimensional database and in a relational database(RDBMS).

click image for larger view

DOLAP
DOLAP(Desktop OLAP or Database OLAP)provide multidimensional analysis locally in the client machine on the data collected from relational or multidimensional database servers.

 

Data Modeling Development Cycle

Gathering Business Requirements - First Phase:
Data Modelers have to interact with business analysts to get the functional requirements and with end users to find out the reporting needs.

Conceptual Data Modeling(CDM) - Second Phase:
This data model includes all major entities, relationships and it will not contain much detail about attributes and is often used in the INITIAL PLANNING PHASE.

Logical Data Modeling(LDM) - Third Phase:
This is the actual implementation of a conceptual model in a logical data model. A logical data model is the version of the model that represents all of the business requirements of an organization.

Physical Data Modeling(PDM) - Fourth Phase:
This is a complete model that includes all required tables, columns, relationship, database properties for the physical implementation of the database.

Database - Fifth Phase:
DBAs instruct the data modeling tool to create SQL code from physical data model. Then the SQL code is executed in server to create databases.


Schema for representing multidimensional data

What is SnowFlake Shema ?

click image for larger view


What is Star Schema?

Star Schema is a relational database schema for representing multimensional data. It is the simplest form of data warehouse schema that contains one or more dimensions and fact tables. It is called a star schema because the entity-relationship diagram between dimensions and fact tables resembles a star where one fact table is connected to multiple dimensions. The center of the star schema consists of a large fact table and it points towards the dimension tables. The advantage of star schema are slicing down, performance increase and easy understanding of data.

click image for larger view

Steps in designing Star Schema

  • Identify a business process for analysis(like sales).
  • Identify measures or facts (sales dollar). Identify dimensions for facts(product dimension, location dimension, time dimension, organization dimension).
  • List the columns that describe each dimension.(region name, branch name, region name).
  • Determine the lowest level of summary in a fact table(sales dollar).

Important aspects of Star Schema & Snow Flake Schema

  • In a star schema every dimension will have a primary key.
  • In a star schema, a dimension table will not have any parent table.
  • Whereas in a snow flake schema, a dimension table will have one or more parent tables. Hierarchies for the dimensions are stored in the dimensional table itself in star schema.
  • Whereas hierachies are broken into separate tables in snow flake schema. These hierachies helps to drill down the data from topmost hierachies to the lowermost hierarchies.


Extraction ,Transformation, Loading:
The ETL process typically takes the longest to develop, and this can easily take up to 50% of the data warehouse implementation cycle or longer. The reason for this is that it takes time to get the source data, understand the necessary columns, understand the business rules, and understand the logical and physical
data models.

Deliverables -
Data Mapping Document
ETL process and Scripts

OLAP Cube Design:
Usually the design of the olap cubes can be derived from the Requirement Gathering phase. More often than not, however, users have some idea on what they want, but it is difficult for them to specify the exact report / analysis they want to see. When this is the case, it is usually a good idea to include enough
information so that they feel like they have gained something through the data warehouse, but not so much that it stretches the data warehouse scope by a mile. Remember that data warehousing is an iterative process - no one can ever meet all the requirements all at once.

Front End Development :
Regardless of the strength of the OLAP engine and the integrity of the data, if the users cannot visualize the reports, the data warehouse brings zero value to them. Hence front end development is an important part of a data warehousing initiative.

So what are the things to look out for in selecting a front-end deployment methodology? The most important thing is that the reports should be delivered over the web, so the only thing that the user needs is the standard browser. These days it is no longer desirable nor feasible to have the IT department doing program installations on end-users desktops just so that they can view reports. So, whatever strategy one pursues, make sure the ability to deliver over the web is a must.

Performance Tuning:
There are three major areas where a data warehousing system can use a little performance tuning:

ETL - Given that the data load is usually a very time-consuming process (and hence they are typically relegated to a nightly load job) and that data warehousing-related batch jobs are typically of lower priority, that means that the window for data loading is not very long. A data warehousing system that has
its ETL process finishing right on-time is going to have a lot of problems simply because often the jobs do not get started on-time due to factors that is beyond the control of the data warehousing team. As a result, it is always an excellent idea for the data warehousing group to tune the ETL process as much as
possible.

Query Processing - Sometimes, especially in a OLAP environment or in a system where the reports are run directly against the relationship database, query performance can be an issue. A study has shown that users typically lose interest after 30 seconds of waiting for a report to return. Our experience has
been that OLAP reports or reports that run directly against the RDBMS often exceed this time limit, and it is hence ideal for the data warehousing team to invest some time to tune the queries, especially the most popularly ones.

Report Delivery - It is also possible that end users are experiencing significant delays in receiving their reports due to factors other than the query performance. For example, network traffic, server setup, and even the way that the front-end was built sometimes play significant roles. It is important
for the data warehouse team to look into these areas for performance tuning.

Quality Assurance :
Once the development team declares that everything is ready for further testing, the QA team takes over. The QA team is always from the client side, QA team may also include members from development team to identify exact problem. Usually the QA team members will know little about data warehousing, and some of them may even resent the need to have to learn another tool or tools. This makes the QA process a tricky one

Rolling out to Production :
Once the QA team gives thumbs up, it is time for the data warehouse system to go live. Some may think this is as easy as flipping on a switch, but usually it is not true. Depending on the number of end users, it sometimes take up to a full week to bring everyone online! Fortunately, nowadays most end users access the
data warehouse over the web, making going production sometimes as easy as sending out an URL via email.

Production Maintenance :
Once the data warehouse goes production, it needs to be maintained. Tasks as such regular backup and crisis management becomes important and should be planned out. In addition, it is very important to consistently monitor end user usage.

This serves two purposes:
1. To capture any runaway requests so that they can be fixed before slowing the entire system down, and
2. To understand how much users are utilizing the data warehouse for return-on-investment calculations and future enhancement considerations.

Incremental Enhancements :
Once the data warehousing system goes live, there are often needs for incremental enhancements. These changes will not be a new data warehousing phase, but simply small changes that follow the business itself. For example, the original geographical designations may be different, the company may originally have 4 sales regions, but now because sales are going so well, now they have 10 sales regions. Also, changes in the look and feel.