Truncate all Tables

TruncateAllTables

There are many reasons for me to blog. One of the less vain is to recover and reuse code I wrote earlier. In my current project we have a need to test, test and retest the initial load of the data warehouse. 

For this purpose I wrote this T-SQL script that:

  • Drops all foreign key constraints;
  • Truncates all the tables;
  • Recreates all foreign key constraints. I dropped earlier.
/*
Truncate all tables in a database, by:
– dropping all foreign key constraints
– truncating all the tables
– recreating the dropped foreign key constraints
*/
DECLARE @DropConstraints nvarchar(max)
DECLARE @TruncateTables nvarchar(max)
DECLARE @RecreateConstraints nvarchar(max)
SELECT
— DROP CONSTRAINTS
@DropConstraints = STUFF(
(
SELECT char(10) + ‘ALTER TABLE ‘
+ QUOTENAME(OBJECT_SCHEMA_NAME(f.parent_object_id))
+ ‘.’ + QUOTENAME(OBJECT_NAME(f.parent_object_id))
+ ‘ DROP CONSTRAINT ‘ + QUOTENAME(f.name)
FROM .sys.foreign_keys f
JOIN .sys.foreign_key_columns fc
ON f.OBJECT_ID = fc.constraint_object_id FOR XML PATH(”)
)
,1,1,”)
— TRUNCATE TABLES
, @TruncateTables = STUFF(
(
SELECT char(10) +’TRUNCATE TABLE ‘
+ QUOTENAME(OBJECT_SCHEMA_NAME(object_id))
+ ‘.’ + QUOTENAME(OBJECT_NAME(object_id))
FROM sys.tables FOR XML PATH(”)
)
,1,1,”)
— RECREATE CONSTRAINTS
, @RecreateConstraints = STUFF(
(
SELECT char(10) + ‘ALTER TABLE ‘
+ QUOTENAME(OBJECT_SCHEMA_NAME(f.parent_object_id))
+ ‘.’ + QUOTENAME(OBJECT_NAME(f.parent_object_id))
+ ‘ ADD CONSTRAINT ‘ + QUOTENAME(f.name)
+ ‘ FOREIGN KEY ( ‘+ COL_NAME(fc.parent_object_id, fc.parent_column_id) +’ )’
+ ‘ REFERENCES ‘ + QUOTENAME(OBJECT_SCHEMA_NAME (f.referenced_object_id))
+ ‘.’ + QUOTENAME(OBJECT_NAME (f.referenced_object_id))
+ ‘ ( ‘ + COL_NAME(fc.referenced_object_id, fc.referenced_column_id) +’ )’
FROM .sys.foreign_keys f
JOIN .sys.foreign_key_columns fc
ON f.OBJECT_ID = fc.constraint_object_id FOR XML PATH(”)
)
,1,1,”)
PRINT @DropConstraints
PRINT ‘————————————————–‘
PRINT @TruncateTables
PRINT ‘————————————————–‘
PRINT @RecreateConstraints
EXEC sp_executesql @DropConstraints;
EXEC sp_executesql @TruncateTables;
EXEC sp_executesql @RecreateConstraints

As an alternative based on a comment by Roland Bouman you can also use..

EXEC sp_msforeachtable “ALTER TABLE ? NOCHECK CONSTRAINT all”
EXEC sp_msforeachtable “TRUNCATE TABLE ?”
EXEC sp_msforeachtable “ALTER TABLE ? WITH CHECK CHECK CONSTRAINT all”

Viewing or Saving the Compiled Biml

Some time ago a BIML enthusiast asked me for the location of the temporary files that BIDSHelper create when you generate the SSIS packages in BIDSHelper. I couldn’t help him and explained that the Biml Engine first compiles the Biml documents,  applying any BimlScript in memory and then creates the SSIS packages. The intermediate step isn’t persisted on disk.

Debugging Biml Files Made Easier

Obviously he needed this intermediate result to better debug his Biml files and improve his development efforts. Recently I learned this simple trick to create the intermediate results he wanted. And I like to share it with you in this blog post:

  • Add a new Biml File to your project and name it SaveCompiledBiml.biml
  • Replace the contents of this file with the next code block
<Biml xmlns=”https://web.archive.org/web/20160825182845/http://schemas.varigence.com/biml.xsd”>
<#
string xBiml = RootNode.GetBiml();
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.LoadXml(xBiml);
string sFile = @”C:\temp\myBimlFile.xml”;
xmlDoc.Save(sFile);
#>
</Biml>
<!–Directives:–>
<#@ template tier=”5″ #>
<#@ import namespace=”Varigence.Hadron.CoreLowerer.SchemaManagement” #>
<#@ import namespace=”System.Xml” #>
<!–BIML Code to save the compiled BIML to file: SaveCompiledBiml.biml–>

Some remarks:

Change the directory and filename of the sFile variable to match your environment / wishes.
In this example I have used tier 5. If you have files with a higher tier change the template tier directive in the SaveCompiledBiml file.  
(BIML files are compiled in the order of their “tier”, the files with the lowest tier first and those with the highest last. The default tier is 0 for files without BimlScript and 1 for files with BimlScript.)

Do you want to improve the quality and speed of your ETL-development?

Biml is quickly gaining popularity as the way to automate the creation of SSIS packages. And thus saving time and money while improving quality. Are you interested in automation of SSIS packages with Biml? Consider my Biml Workshop to get a headstart.

Building a data warehouse while the source systems are still in development

Some years ago the client of a contractor I worked for made some major investments in their IT-landscape. They replaced their ERP – and CRM system and of course BI/Reporting was part of the acceptance criteria for the new system. It was a big project and a disaster! ERP, CRM and BI consultants from the contractor where running in and out. Discussing requirement details with the client and adapting the systems to these wishes. For the BI team it was especially hard. When we build something on Monday chances were slim that it still worked on Thursday. We depended upon the ERP and CRM team to communicate their changes and provide us with correct test data. And there was no love wasted between the teams. I was very glad when the opportunity arose to leave this war hole and move on. And I did.

Trending

Nevertheless it seems to become a trend to build/adapt the BI system before a new source system has been launched. And make it part of the acceptance criteria for the new system. This of course offers the client an easy way to verify the new system by comparing the reports they work with. In my previous three projects (part) of the source system was still in development and I would probably have gone crazy keeping up with all the changes if I hadn’t been able to automate the work.

Agility is the name of the game

agility_drill
agility_drill

In these situations I have found that you (as in an agility drill) need three things:

  • Light feet:
    An alert/reporting system that gives you the insights in the changes that are made in the source systems. I’ll typically import metadata of the source systems tables, views and columns in the database on a daily basis.
  • Run on balls of feet:
    data warehouse automation framework that automates the creation of the data warehouse objects (Tables, Views, Primary keys, Foreign key relations and Indexes) and the ETL procedures (Stored procedures and SSIS packages) to load the data into the data warehouse.
    I have written extensively about this earlier.
  • Use Arms (to keep your balance):
    testing framework that daily checks and alerts you about the results, compared to the results the business users expect.

Testing Framework

A testing framework has a complicated ring to it. But if you keep it simple and focus on the results the users expect to see it will be easy and the advantages are great:

  • You’ll notice any errors early in the process.
  • You can correct the errors before the users’ starts testing.
  • You can pinpoint any changes made in the source system that ruin your reports.
  • You’ll gain confident in your solution and sleep better.

I’ll describe a method that has worked for me. Key to success herein is starting early!

First get hard copies and digital versions of reports the users are using now. Ask them to highlight the important measures. Start entering the tests in an Excel spreadsheet and use the digital versions of the reports to copy any relevant data) I use the following format where I keep values in the column testName unique

testNametestTypecriterion1criterion2expectedValueexpression
InternetSales201401DWH201404Internet125.035 

Now the difficult/laborious part: start entering SQL statement in the expression column that will return the expected value. Use [criterion1] and/or [criterion2] as placeholders in these statements. The values in the criterion1 and criterion2 column will replace these placeholders at execution time. You will now be able to copy the expression to similar test with different values for the criteria. Example expression:
SELECT CAST(SUM(f.sales) AS int)
FROM factSales f
INNER JOIN dimDate d
ON f.DateKey = d,DateKey
INNER JOIN dimChannel ch
ON f.ChannelKey = ch.ChannelKey
WHERE d.YearMonth = [criterion1]
AND ch.ChannelName = ‘[criterion2]’
/* SQL code of example test expression */

Testing Framework: Tables

Import your Excel spreadsheet into a SQL Server table with the following definition:
CREATE TABLE DataTests (
testName nvarchar(64)
, testType nvarchar(20)
, criterion1 nvarchar(128)
, criterion2 nvarchar(128)
, expectedValue int
, expression nvarchar(1024)
)
/* SQL code to create the DataTests table */

Create an additional table to store the result of the tests that were performed. Use the following definition:
CREATE TABLE DataTestResults (
testName nvarchar(64)
, testDate datetime default getdate()
, actualValue int
, expectedValue int
)
/* SQL code to create the DataTestResults table */

Testing Framework: SSIS Package

Now create an SSIS Package that will perform the tests. It uses:

  • An Execute SQL Task to get a list of tests
  • A For Each Loop Container that is used to loop through the list with in it:
  • An Execute SQL Task to execute every test
  • An Execute SQL Task to log the result of every test
image

This package can be described with the following BIML (Business Intelligence Markup Language) snippet:

<Packages>
<Package Name="PKG TestData" ConstraintMode="Linear">
<Variables>
<Variable Name="TestList" DataType="Object" />
<Variable Name="testQuery" DataType="String" />
<Variable Name="testName" DataType="String" />
<Variable Name="expectedValue" DataType="Int32"> 0 </Variable>
<Variable Name="actualValue" DataType="Int32"> 0 </Variable>
</Variables>
<Tasks>
<!--Get list with tests to be performed—>
<ExecuteSQL Name="SQL GetTestList" ConnectionName="META" ResultSet="Full">
<DirectInput>
SELECT testName
, testQuery = REPLACE(REPLACE(expression,'[criterion1]',criterion1),'[criterion2]',criterion2) , expectedValue
FROM DataTests
WHERE testType='DWH'
</DirectInput>
<Results>
<Result Name="0" VariableName="User.TestList" />
</Results>
</ExecuteSQL>
<!--Loop through tests and perform them—>
<ForEachAdoLoop Name="FELC Test" SourceVariableName="User.TestList" ConstraintMode="Linear" >
<VariableMappings>
<VariableMapping Name="0" VariableName="User.testName" />
<VariableMapping Name="1" VariableName="User.testQuery" />
<VariableMapping Name="2" VariableName="User.expectedValue" />
</VariableMappings>
<Tasks>
<!--Perform test—>
<ExecuteSQL Name="SQL Execute Test" ConnectionName="DWH" ResultSet="SingleRow" >
<VariableInput VariableName="User.testQuery" />
<Results>
<Result Name="0" VariableName="User.actualValue" />
</Results>
</ExecuteSQL>
<!--Log test result—>
<ExecuteSQL Name="SQL Log Test Result" ConnectionName="META" ResultSet="None">
<DirectInput>
INSERT INTO DataTestResults (testName,actualValue,expectedValue) VALUES (?,?,?)
</DirectInput>
<Parameters>
<Parameter Name="0" Direction="Input" DataType="String" VariableName="User.testName" />
<Parameter Name="1" Direction="Input" DataType="Int32" VariableName="User.actualValue" />
<Parameter Name="2" Direction="Input" DataType="Int32" VariableName="User.expectedValue" />
</Parameters>
</ExecuteSQL>
</Tasks>
</ForEachAdoLoop>
</Tasks>
</Package>
</Packages>
*/ BIML code to create the testing ssis package */

Conclusion

In this blog post I discussed the components of an agile system that are necessary when you build a data warehouse where the source systems are still in development:

  • An Alert System to inform you of changes in the source systems.
  • A Data Warehouse Automation Framework.
  • A Testing Framework.

I ended with a simple implementation of a testing framework that worked for me. Does this help you or did you implement something similar? … I love to hear from you!

Making your Biml files less complex

The combination of XML and C# code in a Biml file can make Biml files very complex. In web development Microsoft overcame this problem by introducing the code-behind model with one file for the static (html) text and a separate file for the dynamic code. Unfortunately this is not possible with Biml. There are however some ways to make your files less complex:

  1. Separate files
  2. Move code logic to T-SQL

In this post I’ll briefly describe the first option and use some examples to explain the second option.


Separate files

You can use the include directive to insert a piece of static xml of another file in your Biml file. John Minkjan has a nice example on his website.
If you need more dynamic content you can also opt for the CallBimlScript function which will allow you to use properties. Of course you will have to handle these properties in the callee file.


Move some code logic to T-SQL

A typical Biml solution exists not only of Biml files but also of metadata that describe the packages that you want to build. I typically use a table mapping table and a column mapping table in a SQL Server database. This allows me to create a stored procedure that combines information from both tables in one dataset with all relevant information in one row.

For the next examples I will use the following information in this source to target mapping table:

image


Using FOR XML PATH(‘’) in T-SQL

With the “FOR XML PATH(‘’)” syntax you can transform columns into a string. The T-SQL statement:

SELECT srcColumnList = SELECT  ', ' + srcColumn 
FROM meta.ColumnMapping WHERE srcColumn is not null
FOR XML PATH ('')

returns: , Id, f_name, l_name, sex a string you can almost use as a column list in a source component in the SELECT <column list> FROM <source table> statement. Almost … because you’ll have to remove the first comma and handle strange characters in column names. So you’ll have to do some stuff to remove this comma:

SELECT srcColumnList = STUFF(( SELECT ', ' + QUOTENAME(srcColumn) FROM meta.ColumnMapping WHERE srcColumn is not null FOR XML PATH('')),1,2,'')

which returns the string we need: [Id], [f_name], [l_name], [sex]


Creating the data conversion task

Expanding on the previous statement you can create additional dynamic content for your Biml solution. In this case the column definition in  a Data Conversion Task. In Biml you would write:

<DataConversion Name=”DC”>
  <Columns>
    <Column SourceColumn=”f_name” TargetColumn=”FirstName”
            DataType=”AnsiString” Length=”40″ CodePage=”1252″ />
     <…more columns …>
</Columns>
</DataConversion>

To create the column list for this data conversion use the statement:

SELECT DCColumns = REPLACE(REPLACE(STUFF(
    (SELECT  char(10) + '<Column SourceColumn="' + srcColumn 
        + '" TargetColumn="' + tgtColumn
        + '" DataType="' + tgtDataType
        + CASE WHEN tgtDataType='AnsiString'
        THEN '" Length="' + CAST(tgtLength AS varchar(10))
             + '" CodePage="1252" />'
        ELSE '" />' END
     FROM meta.ColumnMapping
     FOR XML PATH('')),1,1,''),'&lt;','<'),'&gt;','>')

Which returns the Biml string we need:

<Column SourceColumn=”Id” TargetColumn=”CustomerID”
     DataType=”Int32″ />
<Column SourceColumn=”f_name” TargetColumn=”FirstName”
     DataType=”AnsiString” Length=”40″ CodePage=”1252″ />
<Column SourceColumn=”l_name” TargetColumn=”LastName”
     DataType=”AnsiString” Length=”40″ CodePage=”1252″ />
<Column SourceColumn=”sex” TargetColumn=”Gender”
    DataType=”AnsiString” Length=”6″ CodePage=”1252″ />

Some remarks to the SQL statement:

  • We don’t really need the char(10), but the line break is convenient when we look at the results.
  • Since this query uses the for xml syntax and xml can’t handle the opening en closing tag signs. SQL server replaces them with their escape codes. We have to use the REPLACE function to change these escape codes back into the opening and closing tag signs.

Creating the stored procedure

For this example I would create the following stored procedures that combines the two metadata tables:

CREATE PROCEDURE meta.getPackageDetails (@MappingType varchar(50)) AS
SELECT
      PackageName = t.MappingName
    , TargetTable = t.tgtSchema + '.' + t.tgtTable
    , TargetConnection = t.tgtConnection
    , SourceTable = t.srcSchema + '.' + t.srcTable
    , SourceConnection = t.srcConnection
    , srcColumnList = STUFF((SELECT  ', ' + QUOTENAME(srcColumn)
        FROM meta.ColumnMapping
        WHERE srcColumn is not null
        AND TableMappingID=t.TableMappingID
        FOR XML PATH('')),1,2,'')
    , DCColumns = REPLACE(REPLACE(STUFF(
        (SELECT  char(10) + '<Column SourceColumn="' + srcColumn
        + '" TargetColumn="' + tgtColumn
        + '" DataType="' + tgtDataType
        + CASE WHEN tgtDataType='AnsiString'
        THEN '" Length="' + CAST(tgtLength AS varchar(10))
            + '" CodePage="1252" />'
        ELSE '" />' END
        FROM meta.ColumnMapping
        WHERE TableMappingID=t.TableMappingID
        FOR XML PATH('')),1,1,''),'&lt;','<'),'&gt;','>')
FROM meta.TableMapping t
WHERE t.MappingType = @MappingType

Less complex Biml file

The Biml file that you need to create the packages:

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <#@ include file="Connection.biml" #>

    <Packages>
    <# string sConn =  “Provider=SQLNCLI10;Server=.\\SQL2012;
       Initial Catalog=BimlSamples;Integrated Security=SSPI;"; #>
    <# string sSQL = string.Format("Exec meta.getPackageDetails
      {0}", "dim"); #>
    <# DataTable tblPackages = ExternalDataAccess.GetDataTable
      (sConn,sSQL); #>
    <# foreach (DataRow pkg in tblPackages.Rows){ #>
       
        <Package Name="<#=pkg["PackageName"]#>"
                ConstraintMode="Linear">
            <Tasks>
        <Dataflow Name="DFT <#=pkg["PackageName"]#>">
          <Transformations>
            <OleDbSource Name="ODS Source"
              ConnectionName="<#=pkg["SourceConnection"]#>" >
              <DirectInput>
                SELECT <#=pkg["srcColumnList"]#>
                FROM <#=pkg["SourceTable"]#>
              </DirectInput>
            </OleDbSource>
           
            <DataConversion Name="DC">
              <Columns>
                <#=pkg["DCColumns"]#>
              </Columns>
            </DataConversion>

            <OleDbDestination Name="ODD Target"
                ConnectionName="<#=pkg["TargetConnection"]#>">
              <ExternalTableOutput
                 Table="<#=pkg["TargetTable"]#>"/>
            </OleDbDestination>

          </Transformations>
        </Dataflow>
      </Tasks>
        </Package>
        <#}#>
    </Packages>
</Biml>
<#@ import namespace="System.Data" #>

Conclusion

In this blog post I discussed some ways to make you Biml files less complex. You can move (semi)-static content to other files and import them with the include directive or use the CallBimlScript function. And you can move parts of complicated Biml code to T-SQL. The choice of how much you will move to T-SQL will largely depend on your proficiency in C# or T-SQL.

8 Practical BIML Tips

You can leverage Business Intelligence Markup Language – or BIML – to automate the creation of Microsoft SQL Server Integration Services (SSIS) Packages. The popular open source BIDSHelper project includes Biml functionality, enabling anyone to write and execute BIML code for free. Among professional SSIS Developers BIML is quickly gaining popularity.
In this post I’ll share some practical tips ….

1. Better copy and paste experience

The irritating behavior of the XML editor in Visual Studio when copying BIML script and how to overcome it  has been documented on several blogs. In this overview of tips I couldn’t discard it.
In Visual Studio / SSDT … Select Options in the Tools menu.
In the Treeview of the Options form expand Text Editor, expand  XML and choose Formatting.
Next uncheck both options under Auto Reformat.

clip_image002

2. Put directive tags #@..# at the bottom of your BIML file

In many examples (including mine) these directive are placed at the top of the BIML file. Which makes sense because this is the default location in software development. However when Visual Studio opens a file with a directive before the <Biml> xml tag it doesn’t use the xml editor and we lose the formatting and intellisense features in the gui.
So instead place the directives at the bottom of the file. After the closing </Biml> tag. This will not have any effect on the creation of packages.

3. Vanilla Biml file

Before using script in a BIML file create a working BIML file that can create a package with more than 80% of the expected functionality.
Copy this file and use it as a base and then start scripting.
Why? The Combination of xml and code in one document makes it more complicated to select the correct BIML elements and attributes. Next use a small dataset so when you test your work only a small amount of packages are created.

4. Test and save often

During development regularly often check your work. Use the options: “Check Biml for Errors” or “Generate SSIS Packages” from the context menu. This way you not only test your work but save it as well.
Why? Debugging BIML files is mostly a pain. Error messages are limited and often refer to the wrong row and small typos can have a huge impact. So you better find your errors early in development.

5. Special XML characters

Xml has some special characters that you need to enclose in  a CDATA tag or replace the special character with its escape code:

  • double quote ” ( &quot; )
  • single quote ‘  ( &apos; )
  • less than sign  < ( &lt; )
  • greater than sign  >  ( &gt; )
  • ampersand & ( &amp; )

As an example suppose you have the following BIML:

<Direct Input>
SELECT Name FROM dbo.People WHERE Age > 25
</Direct Input>

then the xml processor will fail at Age > 25 As a remedy change your BIML into:

<Direct Input>
<![CDATA[SELECT Name FROM dbo.People WHERE Age > 25]]>
</Direct Input>, or

<Direct Input>
SELECT Name FROM dbo.People WHERE Age > 25
</Direct Input>

6. Special C# characters

C# also as some special characters that you will need to escape with a backslash . Most notably:

  • the backslash itself \ ( \\ )
  • single quote ‘ ( \’ )
  • double quote “ ( \” )

As an example escaping the backslash in a file location  
string FileName = “C:\\Dir\\File.txt”;
or use the verbatim string construction:
string FileName = @”C:\Dir\File.txt”;

7. Learn some basic C#

C# is the principal language of the .NET framework and is widely used for all sorts of programs: Web Services, Web Applications, Windows Form applications, SSIS Scripts, SQL CLR Stored Procedures etc. An investment in learning some C# will pay off. There is an abundant supply of websites and books with relevant information.

To get you started: read the chapters Basics  and  Flow control  off This tutorial: http://zetcode.com/lang/csharp/

8. Learn from the samples

Steal / use the samples on:

SQL2012 Windowing Functions In The Data Warehouse–2. Reporting

This is the second post of a diptych on the magical windowing functions in data warehouse scenarios. With these functions you can greatly simplify the TSQL you write. Many complex queries with CTE’s, temp tables and sub queries can be rewritten to simpler, better maintainable and better performing queries.  In this post I’ll dive into some possibilities for reporting.

For the examples in this post I’ll use the Contoso Retail Data Warehouse database. A sample database for data warehouses provided by Microsoft.

Year To Date (YTD) Calculations

On the Internet you’ll find a lot of examples on using the running total technique to calculate year to date values. In this example I need the monthly sales and the YTD sales for every store.

SELECT CalendarMonth
  , StoreName
  , PeriodSales
  , SalesYTD = SUM(PeriodSales) OVER 
     (PARTITION BY StoreName, CalendarYear ORDER BY CalendarMonth) FROM
  (
  SELECT CalendarYear
    , CalendarMonth
    , StoreName
    , PeriodSales = SUM(sal.SalesAmount)
  FROM FactSales sal
  JOIN DimDate dat ON sal.DateKey = dat.Datekey
  JOIN DimStore sto ON sal.StoreKey = sto.StoreKey
  GROUP BY CalendarYear, CalendarMonth, StoreName
  ) SalesByMonth
ORDER BY StoreName, CalendarMonth

The sub query “SalesByMonth” aggregates the sales amount for every store per month. The windowing function SUM() OVER() calculates the YTD sales. Which will result in the required dataset:

image
The SUM(SUM()) OVER() Construction

Since you can use Windowing Functions over an aggregated we don’t need the sub query and we can simplify this query to:

SELECT CalendarMonth
  , StoreName
  , PeriodSales = SUM(SalesAmount)
  , SalesYTD = SUM(SUM(SalesAmount)) OVER 
     (PARTITION BY StoreName, CalendarYear ORDER BY CalendarMonth) FROM FactSales sal
JOIN DimDate dat ON sal.DateKey = dat.Datekey
JOIN DimStore sto ON sal.StoreKey = sto.StoreKey
GROUP BY CalendarYear, CalendarMonth, StoreName
ORDER BY StoreName, CalendarMonth

The second SUM in: “SUM(SUM()) OVER() GROUP BY “ is used in conjunction with the GROUP BY clause to calculate the monthly sales first.
The first SUM in: “SUM(SUM()) OVER() GROUP BY  is then used in conjunction with the OVER clause to calculate the YTD sales.

Comparing to previous year

Adding the figures of the previous year as a comparison is a common reporting requirement. You can easily realize this by using the LAG function returning the results 12 months back in time. Building upon our earlier query:

SELECT *
  , PeriodSalesPrevYear = LAG(PeriodSales,12,0)
      OVER (PARTITION BY StoreName ORDER BY CalendarMonth)
  , YTDSalesPrevYear = LAG(SalesYTD,12,0)
      OVER (PARTITION BY StoreName ORDER BY CalendarMonth)
FROM
(
  SELECT CalendarMonth
    , StoreName
    , PeriodSales = SUM(SalesAmount)
    , SalesYTD = SUM(SUM(SalesAmount))
        OVER (PARTITION BY StoreName, CalendarYear ORDER BY CalendarMonth)
  FROM FactSales sal
  JOIN DimDate dat ON sal.DateKey = dat.Datekey
  JOIN DimStore sto ON sal.StoreKey = sto.StoreKey
  GROUP BY CalendarYear, CalendarMonth, StoreName
) Base
ORDER BY StoreName, CalendarMonth

Which results into:

image
How Do We Do Compared to the other stores?

In this example I use the RANK() Function to determine the store’s rank in the total monthly sales and the store’s sales as a percentage of the total monthly sales:

SELECT CalendarMonth
  , StoreName
  , PeriodSales = SUM(SalesAmount)
  , StoreRank = RANK() OVER
      (PARTITION BY CalendarMonth ORDER BY SUM(SalesAmount) DESC)
  , StoreShare = 100*SUM(SalesAmount)/
      SUM(SUM(SalesAmount)) OVER (PARTITION BY CalendarMonth)
FROM FactSales sal
JOIN DimDate dat ON sal.DateKey = dat.Datekey
JOIN DimStore sto ON sal.StoreKey = sto.StoreKey
GROUP BY CalendarMonth, StoreName

Compare to (Average of) Previous Periods

In a recent client engagement a report which was used to audit the monthly invoice process gave a lot of troubles. The SQL query behind it was very difficult to comprehend and consisted of several sub queries. By using windowing functions our team was able to greatly simplify the query. The requirement can be restated/simplified to our example as: Give us the current month sales, the previous 3 and the average of those previous 3. This is the resulting query:

SELECT CalendarMonth
  , StoreName
  , PeriodSales = SUM(SalesAmount)
  , SalesPrevPeriod1 = LAG(SUM(SalesAmount),1,0)
      OVER (PARTITION BY StoreName ORDER BY CalendarMonth)
  , SalesPrevPeriod2 = LAG(SUM(SalesAmount),2,0)
      OVER (PARTITION BY StoreName ORDER BY CalendarMonth)
  , SalesPrevPeriod3 = LAG(SUM(SalesAmount),3,0)
      OVER (PARTITION BY StoreName ORDER BY CalendarMonth)
  , AveragePrevPeriods = AVG(SUM(SalesAmount))
      OVER (PARTITION BY StoreName ORDER BY CalendarMonth
      ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING)
FROM FactSales sal
JOIN DimDate dat ON sal.DateKey = dat.Datekey
JOIN DimStore sto ON sal.StoreKey = sto.StoreKey
GROUP BY CalendarYear, CalendarMonth, StoreName
ORDER BY StoreName, CalendarMonth

image

I especially like the way you can use the window frame clause to limit the average to 3 periods: ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING

More Information:

SQL2012 Windowing Functions In The Data Warehouse–1. Dimensions

Windowing functions, introduced in SQL Server 2005 and greatly enhanced in SQL Server 2012, have something magical: Within the context of one row in the result set you have access to the contents of the other rows of the result set.

With these functions you can greatly simplify the TSQL you write. Many complex queries with CTE’s, temp tables and sub queries can be rewritten to more simple, better maintainable and better performing queries.  In this post I’ll dive into some possibilities for dimensions.

In SQL 2008 and earlier

In SQL 2008 and earlier I will typically build a dimension table based upon the Type 2 Slowly Changing Dimensions system. And then use a view with a self join to present the user with the historical and/or current attribute values. I will use this small example of a customer dimension:

image

When Marco moved to Paris on July 20th 1988 and married Jose for both of them a new row was added with the new attributes and the EndDate of the old row was changed to the date of the change. This is how attributes changes are handled for Type 2 Slowly Changing Dimensions.

To consume the dimension information I will typically use a view in the model schema as a source for Analysis Services / PowerPivot / Report:

CREATE VIEW model.Customer AS
SELECT his.Id
  , his.Number
  , his.Name
  , his.City
  , CurrentCity = cur.City 
  , his.MaritalStatus
  , CurrentMaritalStatus = cur.MaritalStatus
FROM dim.Customer his
JOIN dim.Customer cur
  ON his.Number = cur.Number
WHERE cur.EndDate =’9999-12-31′

Which will result into:

image
In SQL 2012  Using the LAST_VALUE Function

If your not working for the  Oracle at Delphi the last value will typically be the current value of an attribute. So in 2012 this view can be replaced with:

CREATE VIEW model.Customer AS
SELECT Id
  , Number
  , Name
  , City
  , CurrentCity = LAST_VALUE(City)
    OVER(PARTITION BY Number ORDER BY StartDate
     ROWS BETWEEN UNBOUNDED PRECEDING
     AND UNBOUNDED FOLLOWING)
  , MaritalStatus
  , CurrentMaritalStatus = LAST_VALUE(MaritalStatus)
    OVER(PARTITION BY Number ORDER BY StartDate
     ROWS BETWEEN UNBOUNDED PRECEDING
     AND UNBOUNDED FOLLOWING)
FROM dim.Customer

Although the LAST_VALUE function seems a bit awkward to write due to the long window frame clause: “ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING” it has some nice advantages:

  • No need to use a self join, which will enhance performance.
  • No need to use the EndDate column.
  • Much easier to maintain. Because the purpose of the LAST_VALUE function will be more obvious for your successors.
SCD Type 0 with the FIRST_VALUE Function

Occasionally you may stumble upon a request/requirement to show the original value of an attribute. (e.g. the sales that landed the customer). In that case you can simply add a column using the FIRST_VALUE function:

FirstCity = FIRST_VALUE(City) 
  OVER(PARTITION BY Number ORDER BY StartDate)

Mapping Queries Using Windowing Functions

When you load fact tables you will want to lookup the surrogate keys of the dimensions. In the most simple variant you would use (in SQL 2008)

SELECT Number, Id FROM dim.Customer
  WHERE EndDate =’9999-12-31′

In SQL 2012, assuming you will not store the EndDate in your ETL process, you can use:

SELECT Number, Id FROM
(SELECT Number, Id, RowNumber = ROW_NUMBER() OVER(PARTITION BY Number ORDER BY StartDate DESC)
FROM dim.Customer) Sub
WHERE RowNumber=1

Unfortunately you will have to use the sub query construct here because it’s not yet possible to use Windowing Functions in the WHERE clause.

But why Would you not add an EndDate in the ETL Process?

If you don’t end date rows the ETL process gets much easier, faster and less error-prone: You don’t have the distinguish between new and changed rows you’ll just add both in the same way to the dimension table. And you don’t have to identify and update the ‘old’ rows.

And of course if you really need the EndDate you can just get if with the new LEAD function:

EndDate = LEAD(StartDate, 1, ‘9999-12-31’)
    OVER(PARTITION BY Number ORDER BY StartDate)

More Information:

MSDN (Books on Line) about the OVER Clause

TSQL: MERGE With EXCEPT to Handle Type 2 Loading

Type 2 Loading is a pattern that is often used to incrementally load dimensions in a data warehouse. But you can also use it for loading satellites in a data vault or loading tables in a historical/persistent staging area. It consist of three parts:

  1. Identify new rows and add them to the target database.
  2. Identify changed rows and add them to the target database.
  3. End date the old rows in the target database that have been replaced with those of step 2.
MERGE

There are several ways to implement a Type 2 loading pattern. In this blog post I’ll describe using the TSQL statement: MERGE. I have come to like it and compared to other methods it seems to be a good performing option for handling slowly changing dimension type solutions. (See this study from Microsoft’s Matt Mason)

Warren Thornthwaite from the Kimball Group did a good explanation on using the SQL MERGE Statement for Slowly Changing Dimension in  Design Tip #107 Using the SQL MERGE Statement for Slowly Changing Dimension Processing.

In his example for processing type 2 dimension he uses a sample that is similar to the one below:

INSERT INTO dimCustomer (AuditID,StartDate,EndDate, CustomerID, CustomerName, Adress..)
SELECT @AuditID, GETDATE(),NULL,CustomerID, CustomerName, Adress,..
FROM

( MERGE DWH.dbo.dimCustomer tgt
USING STG.dbo.Customer src
ON src.CustomerID = tgt.CustomerID

WHEN NOT MATCHED THEN
INSERT VALUES (@AuditID, GETDATE(), NULL, CustomerID, CustomerName, Adress ..)

WHEN MATCHED
AND tgt.__EndDate IS NULL
AND (tgt.CustomerName <> src.CustomerName
AND (tgt.Adress <> src.Adress)

..
THEN
UPDATE SET tgt.EndDate = GETDATE()

OUTPUT $Action Action_Out, src.CustomerID, CustomerName, Adress ..
) AS MERGE_OUT
WHERE MERGE_OUT.Action_Out = ‘UPDATE’;

The outcome of the MERGE statement is used as source for the INSERT INTO statement at the top.
(Afore mentioned step 2: Identify changed rows and add them to the target database)
In the Merge statement there are two actions:
1.   Insert any new rows
3.   End date the rows in the source that have changed.

NULL <> NULL

A problem with this example is the implicit assumption that there are no NULL values stored in the columns. Because otherwise the highlighted expressions with nulls would yield to true (NULL<>NULL) and rows would unnecessary be added to the model.

There are however two SQL expressions that consider NULL Values as equal: INTERSECT and EXCEPT. So we can easily circumvent this problem by changing the highlighted rows into:

AND EXISTS (SELECT  tgt.CustomerName, tgt.Adress …
EXCEPT SELECT  src.CustomerName, src.Adress …)

Four Approaches to Data Warehousing

The requirements of a Data Warehouse solution continually change during its life time. For some organizations you’ll typically expect more changes than others. Think of maturity, history of organizational changes: mergers, acquisitions etc.

You should adopt a solution type that will yield the best value taking these things into account.

Data Warehouse Solutions

When confronted with a new data warehouse solution that I am going to design or redesign I will typically choose between one of the following four approaches:

  1. One-Shot Data Warehouse
  2. Keeping History in Dimensions
  3. Historical / Persistent Staging Area
  4. Data Vault

In this blog post I’ll briefly dive into these approaches and discuss some of the implementation issues.

The choice for a specific approach depends on a number of circumstances, notably:

  • The expected total size of the solution
  • The number of source systems and how the information in them overlap
  • The average expected life-time of source systems: source system volatility
  • Current user requirements
  • Expected number of changes in user requirements: requirements volatility
1. One-Shot Data Warehouse (small- and mid-sized solution)
gun

Current computing possibilities will typically enable the complete build (from scratch) of a data warehouse solution for small- and mid-sized solutions within the typically nightly maintenance window. This can be a very efficient solution with a high return on investment. Some people call it the destroy and rebuild approach since you are removing all previous data from the data warehouse before rebuilding it.

An obvious disadvantage of this approach is that there is no track of changes in the source system: If a customer moves from London to Paris all the historical sales will be reported as Paris sales. Another disadvantage is the sheer amount of data that is transferred every night. That can accumulate to an amount that can not been loaded in the nightly maintenance window.

2. Keeping History in Dimensions

The problem of keeping track of history has been a major issue in data warehousing. In the theories  by Ralph Kimball Slowly Changing Dimensions play an import role. In his books he mentioned some patterns to handle history in dimensions.

Type 1: Change history

For some attributes the business isn’t interested in keeping historical information. This of course is most appropriate when correcting previous errors. Or in other situations where there is no business value in keeping track of historical information. For example when a person’s name changes due to marriage. The rows that were imported earlier in the dimension table are corrected by an UPDATE statement.

Type 2: Keep history

In this pattern history is preserved because a new record is entered for every change in a dimension attribute. The old row is marked as inactive and an end date is added to the row. Any new fact tables records that are inserted after this dimension row is changed will link to the newly added row.

Other types

Type 1 and Type 2 are the most commonly used. However there are some alternatives. They are described in this Wikipedia article.

Other advantages

Besides the tracking of history another major advantage is that you’ll only need changed and new rows (the delta rows) from your source system. And thus diminishing the time you need to load the data in your data warehouse. However getting only these delta rows can be challenging.

3. Historical / Persistent Staging Area

In this approach you’ll typically get a copy of all relevant source tables and add temporal information in a ‘StartDate’ and a ‘EndDate’ column. Again you’ll only process the delta rows: loading new and changed rows. And whenever a row has changed you’ll end date the old row.

Based upon this historical staging area you can adopt method 1 or 2 to load the data in your dimensional model.

What are the reasons for having a persistent staging area?

In an earlier blog post I gave this answer:

Well …first of all it could be a demand from auditors or data governance initiatives. Possible driven by external regulations.(Sarbanes-Oxley, Basel I, Basel II, HIPAA, ..)

A second reason which can be tied to the first has to do with keeping control of / reporting on data quality issues in the source systems and thus:

  • Identify possible improvements in processes. (e.g. same data entered twice)
  • Increase / decrease confidence in the data, information and decisions

The third reason has to do with agility. The ability to respond to changes rapidly. Small changes like changing a dimension attribute from type 1 to 2 or adding additional attributes to a dimension. But also large changes like big organizational changes, mergers, new ERP implementations. By having a persistent staging area  it’ll be far easier to respond to these changes and rebuild the data marts (with history). Resulting in quicker, better reporting after such a change.

4. Data Vault

The Data Vault is a special implementation of a persistent staging area. Source tables are split based on column aspects separating the structural items (business keys and the associations between the business keys) from the descriptive attributes.

And thus the same advantages apply to the Data Vault as well. The power of the Data Vault comes from the integration on business keys. This approach is therefore especially suitable if you have source systems with overlapping information and/or if source systems change frequently over time.

This wiki article by Ronald Kunenberg gives a nice overview of the Data Vault Methodology with a lot of additional references.

Conclusion

In this article I discussed some aspects that you should consider when you choose an approach for (re)designing your Data Warehouse. This table summarizes these aspects:

Aspect /  Approach:1234
Simple, fast solutionx   
Only process delta rows xxx
Keep track of historical information xxx
Solve auditing / data governance issues  xx
Control over data quality in source systems  xx
Easily adapt to changing user requirements  xx
Source systems with overlapping information   x
Frequently changing source systems   x

Dimensional modeling and Data Vault – a happy marriage?

A lot of emphasis on the Internet lately has been placed on the difference between both methods. This blogpost from Dan Linstedt and this discussion on the Kimball forums are examples of this strange focus.

A focus I don’t totally understand since both techniques serve different purposes and can happily live together in most data warehouse solutions:

  • Data Vault is all about the back-room, efficiently collecting, integrating and preserving data from the source systems.
  • Dimensional modeling is all about the front-room, publishing the organization’s data assets to effectively support decision making.

From a Dimensional modeling (Kimball) perspective Data Vault is an implementation of a persistent staging area. Problems that some people have with this view/name seem to boil down to the definition of what a data warehouse is. I personally don’t have a problem with this name nor do I consider it to be less important. And yes it will qualify as a data warehouse.

Furthermore Dimensional modeling is generally acknowledged to be the best way to publish data to the business users and has a wide-spread adoption. A focus for further acceptance of Data Vault could be that it is the best implementation of a persistent staging area.

So let’s try to make this case … but first:

What are the reasons for having a persistent staging area?

Well …first of all it could be a demand from auditors or data governance initiatives. Possible driven by external regulations.(Sarbanes-OxleyBasel IBasel IIHIPAA, ..)

A second reason which can be tied to the first has to do with keeping control of / reporting on data quality issues in the source systems and thus:

  • identify possible improvements in processes. (e.g. same data entered twice)
  • increase / decrease confidence in the data, information and decisions

The third reason has to do with agility. The ability to respond to changes rapidly. Small changes like changing a dimension attribute from type 1 to 2 or adding additional attributes to a dimension. But also large changes like big organizational changes, mergers, new ERP implementations. By having a persistent staging area (preferably integrated on business keys) it’ll be far easier to respond to these changes and rebuild the data marts (with history). Resulting in quicker, better reporting after such a change.

Of course you need to balance these advantages against the extra costs. Before we take a look at these extra costs and other raised disadvantages let’s examine some …

Similarities between Data Vault and Dimensional modeling

Let’s first focus on the methodology of both approaches. They both advocate a bottom-up iterative implementation: Identify scope, select business area, build the solution and .. focus on a new iteration.

Both use a limited number of strict table types with a well-defined function: Dimension- and fact tables in Dimensional modeling and hub-, link- and satellite tables in Data Vault modeling. Where a hub tables and its satellites typically roll-up into a dimension and a link tables and its satellites typically roll-up into a fact table.

Another “fit” is the use of a volatile staging area as a synchronization- and recovery point.

Focus on the extra costs

Some would argue that since you are building two data warehouses you’ll end up with twice the costs. That’s seems far fetched … remember both use a staging area and the existence of the Data Vault area will make it easier to construct the Dimensional model on top of that. Furthermore Data Vault modeling lends itself (as does Dimensional Modeling to some extent) to automation, which greatly reduces costs of building and maintaining the data warehouse solution.

Others complain about the extra storage that is needed to support this solution. Why store an extra copy of the source? Well I hope the described reasons for having a persistent staging area answer these questions. However you should of course take the extra storage costs into account when evaluating your choices. But please look at opportunity costs: Do you need an extra DBA or an extra SAN to support the extra layer?

A third complaint has to do with query performance. It is suggested that the extra joins introduced with Data Vault modeling will impact query performance. This of course depends on size, hardware, database and indexing strategy. But even if there is a significant impact in most cases it won’t have a major effect. After all we are only using these queries to populate the Dimensional model and this will only apply to new and changed rows.

Conclusion

Consider the combination of both methods when you are (re)building your data warehouse. In this blogpost I have given you some handles to evaluate this option. Do you need to keep control on data quality issues or do you want the agility to respond to changes quicker? Can you quantify these advantages? Compare them to the extra costs and make a better informed decision.

And of course you can cont(r)act me if you want any help in this evaluation.