The requirements of a Data Warehouse solution continually change during its life time. For some organizations you’ll typically expect more changes than others. Think of maturity, history of organizational changes: mergers, acquisitions etc.
You should adopt a solution type that will yield the best value taking these things into account.
Data Warehouse Solutions
When confronted with a new data warehouse solution that I am going to design or redesign I will typically choose between one of the following four approaches:
- One-Shot Data Warehouse
- Keeping History in Dimensions
- Historical / Persistent Staging Area
- Data Vault
In this blog post I’ll briefly dive into these approaches and discuss some of the implementation issues.
The choice for a specific approach depends on a number of circumstances, notably:
- The expected total size of the solution
- The number of source systems and how the information in them overlap
- The average expected life-time of source systems: source system volatility
- Current user requirements
- Expected number of changes in user requirements: requirements volatility
1. One-Shot Data Warehouse (small- and mid-sized solution)
Current computing possibilities will typically enable the complete build (from scratch) of a data warehouse solution for small- and mid-sized solutions within the typically nightly maintenance window. This can be a very efficient solution with a high return on investment. Some people call it the destroy and rebuild approach since you are removing all previous data from the data warehouse before rebuilding it.
An obvious disadvantage of this approach is that there is no track of changes in the source system: If a customer moves from London to Paris all the historical sales will be reported as Paris sales. Another disadvantage is the sheer amount of data that is transferred every night. That can accumulate to an amount that can not been loaded in the nightly maintenance window.
2. Keeping History in Dimensions
The problem of keeping track of history has been a major issue in data warehousing. In the theories by Ralph Kimball Slowly Changing Dimensions play an import role. In his books he mentioned some patterns to handle history in dimensions.
Type 1: Change history
For some attributes the business isn’t interested in keeping historical information. This of course is most appropriate when correcting previous errors. Or in other situations where there is no business value in keeping track of historical information. For example when a person’s name changes due to marriage. The rows that were imported earlier in the dimension table are corrected by an UPDATE statement.
Type 2: Keep history
In this pattern history is preserved because a new record is entered for every change in a dimension attribute. The old row is marked as inactive and an end date is added to the row. Any new fact tables records that are inserted after this dimension row is changed will link to the newly added row.
Type 1 and Type 2 are the most commonly used. However there are some alternatives. They are described in this Wikipedia article.
Besides the tracking of history another major advantage is that you’ll only need changed and new rows (the delta rows) from your source system. And thus diminishing the time you need to load the data in your data warehouse. However getting only these delta rows can be challenging.
3. Historical / Persistent Staging Area
In this approach you’ll typically get a copy of all relevant source tables and add temporal information in a ‘StartDate’ and a ‘EndDate’ column. Again you’ll only process the delta rows: loading new and changed rows. And whenever a row has changed you’ll end date the old row.
Based upon this historical staging area you can adopt method 1 or 2 to load the data in your dimensional model.
What are the reasons for having a persistent staging area?
In an earlier blog post I gave this answer:
Well …first of all it could be a demand from auditors or data governance initiatives. Possible driven by external regulations.(Sarbanes-Oxley, Basel I, Basel II, HIPAA, ..)
A second reason which can be tied to the first has to do with keeping control of / reporting on data quality issues in the source systems and thus:
- Identify possible improvements in processes. (e.g. same data entered twice)
- Increase / decrease confidence in the data, information and decisions
The third reason has to do with agility. The ability to respond to changes rapidly. Small changes like changing a dimension attribute from type 1 to 2 or adding additional attributes to a dimension. But also large changes like big organizational changes, mergers, new ERP implementations. By having a persistent staging area it’ll be far easier to respond to these changes and rebuild the data marts (with history). Resulting in quicker, better reporting after such a change.
4. Data Vault
The Data Vault is a special implementation of a persistent staging area. Source tables are split based on column aspects separating the structural items (business keys and the associations between the business keys) from the descriptive attributes.
And thus the same advantages apply to the Data Vault as well. The power of the Data Vault comes from the integration on business keys. This approach is therefore especially suitable if you have source systems with overlapping information and/or if source systems change frequently over time.
This wiki article by Ronald Kunenberg gives a nice overview of the Data Vault Methodology with a lot of additional references.
In this article I discussed some aspects that you should consider when you choose an approach for (re)designing your Data Warehouse. This table summarizes these aspects:
|Aspect / Approach:||1||2||3||4|
|Simple, fast solution||x|
|Only process delta rows||x||x||x|
|Keep track of historical information||x||x||x|
|Solve auditing / data governance issues||x||x|
|Control over data quality in source systems||x||x|
|Easily adapt to changing user requirements||x||x|
|Source systems with overlapping information||x|
|Frequently changing source systems||x|