IN2BI, Microsoft Business Intelligence
data, stories & insights
rss | email | twitter
  • vr
    24
    mei 13

    SQL2012 Windowing Functions In The Data Warehouse–1. Dimensions

    Windowing functions, introduced in SQL Server 2005 and greatly enhanced in SQL Server 2012, have something magical: Within the context of one row in the result set you have access to the contents of the other rows of the result set.

    With these functions you can greatly simplify the TSQL you write. Many complex queries with CTE’s, temp tables and sub queries can be rewritten to more simple, better maintainable and better performing queries.  In this post I’ll dive into some possibilities for dimensions.

    In SQL 2008 and earlier

    In SQL 2008 and earlier I will typically build a dimension table based upon the Type 2 Slowly Changing Dimensions system. And then use a view with a self join to present the user with the historical and/or current attribute values. I will use this small example of a customer dimension:

    image

    When Marco moved to Paris on July 20th 1988 and married Jose for both of them a new row was added with the new attributes and the EndDate of the old row was changed to the date of the change. This is how attributes changes are handled for Type 2 Slowly Changing Dimensions.

    To consume the dimension information I will typically use a view in the model schema as a source for Analysis Services / PowerPivot / Report:

    CREATE VIEW model.Customer AS
    SELECT his.Id
      , his.Number
      , his.Name
      , his.City
      , CurrentCity = cur.City 
      , his.MaritalStatus
      , CurrentMaritalStatus = cur.MaritalStatus
    FROM dim.Customer his
    JOIN dim.Customer cur
      ON his.Number = cur.Number
    WHERE cur.EndDate ='9999-12-31'

    Which will result into:

    image

     

    In SQL 2012  Using the LAST_VALUE Function

    If your not working for the  Oracle at Delphi the last value will typically be the current value of an attribute. So in 2012 this view can be replaced with:

    CREATE VIEW model.Customer AS
    SELECT Id
      , Number
      , Name
      , City
      , CurrentCity = LAST_VALUE(City)
        OVER(PARTITION BY Number ORDER BY StartDate
         ROWS BETWEEN UNBOUNDED PRECEDING
         AND UNBOUNDED FOLLOWING)
      , MaritalStatus
      , CurrentMaritalStatus = LAST_VALUE(MaritalStatus)
        OVER(PARTITION BY Number ORDER BY StartDate
         ROWS BETWEEN UNBOUNDED PRECEDING
         AND UNBOUNDED FOLLOWING)
    FROM dim.Customer

    Although the LAST_VALUE function seems a bit awkward to write due to the long window frame clause: ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING” it has some nice advantages:

    • No need to use a self join, which will enhance performance.
    • No need to use the EndDate column.
    • Much easier to maintain. Because the purpose of the LAST_VALUE function will be more obvious for your successors.

     

    SCD Type 0 with the FIRST_VALUE Function

    Occasionally you may stumble upon a request/requirement to show the original value of an attribute. (e.g. the sales that landed the customer). In that case you can simply add a column using the FIRST_VALUE function:

    FirstCity = FIRST_VALUE(City) 
      OVER(PARTITION BY Number ORDER BY StartDate)

     

    Mapping Queries Using Windowing Functions

    When you load fact tables you will want to lookup the surrogate keys of the dimensions. In the most simple variant you would use (in SQL 2008)

    SELECT Number, Id FROM dim.Customer
      WHERE EndDate ='9999-12-31'

    In SQL 2012, assuming you will not store the EndDate in your ETL process, you can use:

    SELECT Number, Id FROM
    (SELECT Number, Id, RowNumber = ROW_NUMBER() OVER(PARTITION BY Number ORDER BY StartDate DESC)
    FROM dim.Customer) Sub
    WHERE RowNumber=1

    Unfortunately you will have to use the sub query construct here because it’s not yet possible to use Windowing Functions in the WHERE clause.

     

    But why Would you not add an EndDate in the ETL Process?

    If you don’t end date rows the ETL process gets much easier, faster and less error-prone: You don’t have the distinguish between new and changed rows you’ll just add both in the same way to the dimension table. And you don’t have to identify and update the ‘old’ rows.

    And of course if you really need the EndDate you can just get if with the new LEAD function:

    EndDate = LEAD(StartDate, 1, '9999-12-31')
        OVER(PARTITION BY Number ORDER BY StartDate)

     

    More Information:

    MSDN (Books on Line) about the OVER Clause

    Jamie Thomson Debunking Kimball Effective Dates part 2 – Windowing Functions

    Introduction blog series on Windowing Functions by Fabiano Amorim on simple talk

    Blog post with a instructional video on the SQL 2012 Windowing Functions Leaving the Windows Open by Jeremiah Peschka

    Book: Microsoft® SQL Server® 2012 High-Performance T-SQL Using Window Functions by Itzik Ben-Gan

  • za
    18
    mei 13

    Logging in SSIS with BIML 2 – Logging Package Execution in the Control Flow

    This is the second post in a series on on logging in SSIS. In the first post I described logging package and tasks within a package by using an event handler. In this post I’ll describe a second method: Logging Package Execution in the control flow. It’s a very common pattern:

    • At the start of the package create a record in an audit table.
    • Use the identity of this row in the dataflow and add it to every row you insert in the target table.
    • Use several counter to count the rows passing through the dataflow.
    • At the end of the package update the audit row that we created in the first step with the end time and the row counters.

    Data Lineage

    The most important advantage of this method lies in step 2 by adding a reference to the audit row in every row that we insert into the target table we will be able to track back every row to when and how it was inserted. This has saved my ass several times by be being able to roll back a faulty import 2 weeks ago and repopulate it with the correct data.

    Andy Leonard wrote a nice blog post on how you can set this up in SSIS. In this post I’ll focus on creating a similar package with BIML. My package will look like this:

    image

    The Audit Table

    To store the package execution information we need an audit table. Create it with the following statement:

    CREATE TABLE meta.SsisAudit(
        AuditID int IDENTITY(1,1) NOT NULL
      , PackageName  varchar(50)
      , PackageGUID uniqueidentifier
      , PackageVersionGUID uniqueidentifier
      , ExecutionID uniqueidentifier
      , StartDate datetime
      , Enddate datetime
      , PackageDuration AS DATEDIFF(second, StartDate
            , EndDate) PERSISTED
      , ExtractRowCount int
      , InsertRowCount int
      , UpdateRowCount int
      , IsProcessed bit DEFAULT(0)
      , CONSTRAINT PK_SsisAudit PRIMARY KEY CLUSTERED
        ( AuditID ASC));
    GO

    The Stored Procedures

    To log the package execution information we’ll use two stored procedures that write the information to the SsisAudit table. The first one is used at the start of the package and will return the id of the inserted row. The second is used to update this row with the end date and the row counters .

    CREATE PROCEDURE [meta].[uspNewAuditRow]
        @PackageName varchar(50)
      , @PackageGUID uniqueidentifier
      , @PackageVersionGUID uniqueidentifier
      , @ExecutionID uniqueidentifier
      , @StartDate datetime
    AS
    INSERT INTO meta.SsisAudit
      ( PackageName
      , PackageGUID
      , PackageVersionGUID
      , ExecutionID
      , StartDate
      , IsProcessed)
    VALUES
      ( @PackageName
      , @PackageGUID
      , @PackageVersionGUID
      , @ExecutionID
      , @StartDate
      ,0);

    /* Return the id of the inserted row*/
    SELECT CAST(SCOPE_IDENTITY() AS int) AS AuditID
    GO


    CREATE PROCEDURE [meta].[uspUpdateAuditRow] 
        @ExtractRowCount int
      , @InsertRowCount int
      , @UpdateRowCount int
      , @AuditID int
    AS
    UPDATE meta.SsisAudit
    SET EndDate = SYSDATETIME()   
      , ExtractRowCount = @ExtractRowCount
      , InsertRowCount = @InsertRowCount
      , UpdateRowCount = @UpdateRowCount
      , IsProcessed = 1
    WHERE AuditID = @AuditID
    GO

    BIML Snippets

    To create the logging facility in an SSIS Package use the following code snippets in the BIMLScript that creates the package:

    <!--Variables-->
    <Variables>
       <Variable Name="RcExtract"
          DataType="Int32"
           Namespace="User" >0</Variable>
       <Variable Name="RcInsert"
          DataType="Int32"
          Namespace="User" >0</Variable>
       <Variable Name="RcUpdate"
          DataType="Int32"
          Namespace="User" >0</Variable>
       <Variable Name="AuditID"
          DataType="Int32" Namespace="User" >0</Variable>
    </Variables>

    <Tasks>
       <!--Create audit row and get id-->
       <ExecuteSQL Name="SQL Create AuditRow and Get AuditID"
          ConnectionName="metaDB"
          ResultSet="SingleRow">
          <DirectInput>
    EXEC [meta].[uspNewAuditRow] ?,?,?,?,?
          </DirectInput>
          <Parameters>
             <Parameter Name="0" Direction="Input"
                DataType="Int32"
                VariableName="System.PackageName" />
             <Parameter Name="1" Direction="Input"
                DataType="Guid"
                VariableName="System.PackageID" />
             <Parameter Name="2" Direction="Input"
                DataType="Guid"
                VariableName="System.VersionGUID" />
             <Parameter Name="3" Direction="Input"
                DataType="Guid"
                VariableName="System.ExecutionInstanceGUID" />
             <Parameter Name="4" Direction="Input"
                DataType="DateTime"
                VariableName="System.StartTime" />
          </Parameters>
          <Results>
             <Result Name="0"
                VariableName="User.AuditID" />
          </Results>
       </ExecuteSQL>

       <!--In Dataflow:-->
       <Dataflow>
       <Transformations>

        <!--Add meta data –>
        <DerivedColumns Name="DC add meta data">
           <Columns>
              <Column Name="AuditID"
                 DataType="Int32"
                 ReplaceExisting="false">
    @AuditID
                </Column>
             </Columns>
             </DerivedColumns>
          </Transformation>
       </Dataflow>

       <!--Update audit row-->
       <ExecuteSQL Name="SQL Update Audit Row"
          ConnectionName="metaDB"
          ResultSet="None">
          <DirectInput>
    EXEC [meta].[uspUpdateAuditRow] ?,?,?,?
          </DirectInput>
          <Parameters>
             <Parameter Name="0" Direction="Input"
                DataType="Int32"
                VariableName="User.AuditID" />
             <Parameter Name="1" Direction="Input"
                DataType="Int32"
                VariableName="User.RcExtract" />
             <Parameter Name="2" Direction="Input"
                DataType="Int32"
                VariableName="User.RcInsert" />
             <Parameter Name="3" Direction="Input"
                DataType="Int32"
                VariableName="User.RcUpdate" />
          </Parameters>
       </ExecuteSQL>

     

    ABOUT BIML

    You can leverage Business Intelligence Markup Language – or BIML – to automate the creation of Microsoft SQL Server Integration Services (SSIS) Packages. BIML is a creation of Varigence, Inc. and is available in proprietary products, open source projects, and has been published as an open language specification. The popular open source BIDSHelper project includes Biml functionality, enabling anyone to write and execute Biml code for free.

    Do you need a head start with the automation of SSIS packages: consider my BIML Workshop.

  • do
    03
    jan 13

    Logging in SSIS with BIML 1 – Logging Task Execution

    The project deployment model in SSIS (2012) offers great default logging functionality. However you may find yourself in need of additional functionality, especially if you use the package deployment model or an earlier version of SQL Server. In this series on logging I dive into 3 methods:

    1. Logging Task Execution using the OnPostExecute Event
      (this post)
    2. Logging Package Execution in the control flow.
    3. Logging Errors using the OnError Event.

    In this post I’ll describe the first method: Logging Task Execution using the OnPostExecute Event. This event is fired after the execution of each task and package. And thus provides a great way to log relevant information like package duration and row counts in the data flow.

    The SsisEventLog Table

    To store these events we need a table:

    CREATE TABLE  meta.SsisEventLog (
          EventID int IDENTITY(1,1) NOT NULL
        , ExecutionID uniqueidentifier 
        , EventLogDate datetime 
        , PackageStartDate datetime 
        , PackageDuration AS datediff(second, PackageStartDate
            , EventLogDate) PERSISTED
        , PackageGUID uniqueidentifier 
        , PackageVersionGUID   uniqueidentifier 
        , PackageName  varchar(50)
        , TaskGUID   uniqueidentifier 
        , TaskName   varchar(50)
        , ExtractRowCount int 
        , InsertRowCount int 
        , UpdateRowCount int 
        , CONSTRAINT  PK_SsisEventLog 
            PRIMARY KEY CLUSTERED (EventID)
    )

    Some remarks:

    • The PackageDuration column is calculated using the datediff function. It calculates the difference in seconds between the time the event was fired and the start date of the package.
    • If the TaskGUID equals to the PackageGUID the event is fired directly after the package was executed.
    • The row count columns will be filled by variables we define in the package and we’ll have to find a way to reset them after logging the event.

    To log this information we’ll use a stored procedures that writes the information in the SsisEventLog table and returns a row with 3 columns, all with the value of 0 to reset the variables after logging.

    CREATE PROCEDURE [meta].[uspLogEvent]      
          @ExecutionID [uniqueidentifier]
        , @PackageStartDate [datetime]
        , @PackageGUID [uniqueidentifier]
        , @PackageVersionGUID [uniqueidentifier]
        , @PackageName [varchar](50)
        , @TaskGUID [uniqueidentifier]
        , @TaskName [varchar](50)
        , @ExtractRowCount [int]
        , @InsertRowCount [int]
        , @UpdateRowCount [int]
    AS       

    INSERT INTO [meta].[SsisEventLog](
          ExecutionID
        , EventLogDate
        , PackageStartDate
        , PackageGUID
        , PackageVersionGUID
        , PackageName
        , TaskGUID
        , TaskName
        , ExtractRowCount
        , InsertRowCount
        , UpdateRowCount
    )
    VALUES(
          @ExecutionID
        , SYSDATETIME()   
        , @PackageStartDate
        , @PackageGUID
        , @PackageVersionGUID
        , @PackageName
        , @TaskGUID
        , @TaskName
        , @ExtractRowCount
        , @InsertRowCount
        , @UpdateRowCount
    );
    SELECT 0 AS ExtractRowCount
        , 0 AS InsertRowCount
        , 0 AS UpdateRowCount
    ;

    Creating the log event handler in the package

    Before we can create this event handler you have to create three variables of type Int32 with value 0 as in this image:

    OnPostExecuteVariables

    To create the event handler in a package open the Event Handlers tab and select the package in the Executable dropdown box and select the OnPostExecute in the Event handler dropdown box.

    Next drop a Execute SQL Task from the toolbox on the canvas and use the following setting in the General tab:

    • Name: SQL Log Event
    • ResultSet: Single row
    • ConnectionType: OLE DB
    • Connection: The name of the connection with your log table
    • SQLSourceType: Direct imput
    • SQLStatement:
      EXEC [meta].[uspLogEvent] ?,?,?,?,?,?,?,?,?,?

     

    OnPostExecute1

    In the Parameter Mapping tab add 9 mappings using the following variable names:

    • System:: ExecutionInstanceGUID
    • System:: StartTime
    • System:: PackageID
    • System:: VersionGUID
    • System:: PackageName
    • System:: SourceID
    • System:: SourceName
    • User:: RcExtract
    • User:: RcInsert
    • User:: RcUpdate

    OnPostExecute2

    In the Result Set tab add 3 results as in the image below.

    OnPostExecute3

    And we are ready with the implementation of the event handler in the SSIS Package.

    The Results

    After executing a package with one Execute SQL Task and one Data Flow Task you will see three rows in the SsisEventLog table. Two for the tasks and one for the package:

    OnPostExecuteResult

    OnPostExecuteResult2

    Using this method you can now easily troubleshoot any (performance) issues with the package.

    Using BIML to implement the event handler

    To implement this event handler with BIML use the following code snippets in the BIMLScript that creates the package:

    <!--Variables-->
    <Variables>
        <Variable Name="RcExtract"
            DataType="Int32"
            Namespace="User" >0</Variable>
        <Variable Name="RcInsert"
            DataType="Int32"
            Namespace="User" >0</Variable>
        <Variable Name="RcUpdate"
            DataType="Int32"
            Namespace="User" >0</Variable>
    </Variables>

    <!--Events-->
    <Events>
    <Event EventType="OnPostExecute"
           Name="OnPostExecute"
           ConstraintMode="Linear">
    <Tasks>
        <ExecuteSQL Name="SQL Log Event"
            ConnectionName="NorthwindSales"
            ResultSet="SingleRow">
            <DirectInput>
    EXEC [meta].[uspLogEvent] 'OnPostExecute',?,?,?,?,?,?,?,?,?,?
            </DirectInput>
            <Parameters>
                <Parameter Name="0" Direction="Input"
                    DataType="Guid"
                    VariableName="System.ExecutionInstanceGUID" />
                <Parameter Name="1" Direction="Input"
                    DataType="DateTime"
                    VariableName="System.StartTime" />
                <Parameter Name="2" Direction="Input"
                    DataType="Guid"
                    VariableName="System.PackageID" />
                <Parameter Name="3" Direction="Input"
                    DataType="Guid"
                    VariableName="System.VersionGUID" />
                <Parameter Name="4" Direction="Input"
                    DataType="String"
                    VariableName="System.PackageName" />
                <Parameter Name="5" Direction="Input"
                    DataType="Guid"
                    VariableName="System.SourceID" />
                <Parameter Name="6" Direction="Input"
                    DataType="String"
                    VariableName="System.SourceName" />
                <Parameter Name="7" Direction="Input"
                    DataType="Int32"
                    VariableName="User.RcExtract" />
                <Parameter Name="8" Direction="Input"
                    DataType="Int32"
                    VariableName="User.RcInsert" />
                <Parameter Name="9" Direction="Input"
                    DataType="Int32"
                    VariableName="User.RcUpdate" />
            </Parameters>
            <Results>
                <Result Name="0"
                    VariableName="User.RcExtract" />
                <Result Name="0"
                    VariableName="User.RcInsert" />
                <Result Name="0"
                    VariableName="User.RcUpdate" />
            </Results>
        </ExecuteSQL>
    </Tasks>
    </Event>
    </Events>

    ABOUT BIML

    You can leverage Business Intelligence Markup Language – or BIML – to automate the creation of Microsoft SQL Server Integration Services (SSIS) Packages. BIML is a creation of Varigence, Inc. and is available in proprietary products, open source projects, and has been published as an open language specification. The popular open source BIDSHelper project includes Biml functionality, enabling anyone to write and execute Biml code for free.

    Do you need a head start with the automation of SSIS packages: consider my BIML Workshop.

  • di
    18
    dec 12

    SQL Server, SSIS, SSAS and SSRS on ONE Server

    Best practice dictates that we use a separate server for each of these SQL Server Services. And this seems logical because otherwise these services will compete over server resources. In IT we call this competition: contention.

    However there are some great reasons to put these services on one box:

    • Licensing: SQL Server licensing can be expensive. And you need licenses for every server on which a services runs.
    • Better resource utilization: Less servers, less power usage, less maintenance and -monitoring cost.
    • Sometimes the network is the problem as a lot of data moves from the SQL Engine to SSIS or SSAS resulting in network congesting. If services run on the same machine, SQL Server uses the Shared Memory Protocol which is faster and leads to less network congestion.

    In a scenario with all SQL/BI Services on one server we need some strategies to diminish the described contention. This is of course a big subject and I can only touch the basics in this post and give you some tips.

    TIP 1: Limit the software and services on the server

    This seems logical but I have been to several sites where developers remote desktop into the production server to do maintenance- and other jobs. This is certainly not a best practice. Better is to not install any client tools on the server and use a separate developer/test server to do maintenance.

    TIP 2: Get an overview of your typical daily workload

    image

    In this image I have mapped a typical daily workload to the services needed to perform the workload:

    • From 07:00 AM to 23:00 is the extended working day window. Some users start early, others end late: all are extensively opening SSRS reports and querying SSAS cubes.
    • From 23:00 to 01:00 AM backups are running for all (source) systems. This is the period we do our own SSAS and SQL Server backups.
    • From 01:00 AM tot 07:00 AM is our process window

    In this period:

    • we use SSIS to land data from our source systems into the staging database
    • we use SSIS to load data from our staging database into the data warehouse
    • we’ll process the dimensions and cubes of our SSAS databases.
    • we warm the cache of our SSAS database and start distributing reports using SSRS subscriptions and SSIS for large Excel reports.

    With this knowledge we have a clear understanding of potential contention issues. Furthermore we can schedule to stop and start services on a need to use basis. We can use the operating system commands NET START and NET STOP for this purpose.

    TIP 3 Set Maximum Memory Setting for SQL Server

    With the max server memory setting you can place a limit to the amount of buffer pool memory used by SQL Server. Which leads to the question to what amount should we limit the buffer pool memory?

    During our extended working day window we will probably not benefit from a large buffer pool memory: Most queries will probably be answered by Analysis Services. However in the process windows we will benefit from a larger buffer pool memory. We can change this property during the day and schedule these changes by using the sp_configure system stored procedure. Based on 24GB of RAM, reserve 4 GB for the operating system and:

    • During working day window set max server memory to 4 GB, reserve 8 GB for SSAS and 8 GB for SSRS.
    • During the process window set max server memory to 10 GB, reserving 10GB for SSIS and 8GB for SSAS.

    TIP 4 Set SSAS Memory Settings

    The SSAS memory settings are available in the msmdsrv.ini file and in properties window of the server. If the value is between 0 and 100 than this means a percentage of total available physical memory. Above 100 means bytes. Change these settings to:

    • Memory\HardMemoryLimit: from the default of 0 to 32
    • Memory\TotalMemoryLimit: from the default of 80 to 28
    • Memory\LowMemoryLimit:from the default of 65 to 24

    This means that Analysis Services will start freeing up memory once its has reached the LowMemoryLimit threshold of 24% of physical memory. This process will get more aggressive if it reaches the other thresholds.
    While you’re at it change some other memory properties:

    • OLAP\Process\BufferMemoryLimit from 60 to 20
    • OLAP\Process\AggregationMemoryLimitMax from 80 to 14
    • OLAP\Process\AggregationMemoryLimitMin from 10 to 4

    TIP 5 Set SSRS Memory Settings

    In the RSReportServer.config file add the WorkingSetMaximum property and set it to 8GB (for our 24GB example):
    <WorkingSetMaximum>8000000</WorkingSetMaximum>
    Restart the reporting services service, so other memory related properties that are based on WorkingSetMaximum get set.

    TIP 6 Monitor and adapt the memory setting

    Use the settings in tip 3, 4 en 5 as a starting point but monitor memory usage and change these properties to map them to your workload.

    TIP 7 Use Windows System Resource Manager to restrict CPU utilization

    With Windows System Resource Manager (WSRM) you can create custom resource allocation policies to restrict resources (CPU, RAM) to an application. You can map your workload windows to policies and define these policies e.g.:

    • policy ProcessData which runs when Analysis Services is processing data, set the CPU percentage for SQL to 45% and 45% for SSAS
    • policy ProcessIndex which runs when Analysis Services is processing aggregations, set the CPU percentage for SQL to 10% and 80% for SSAS

    wsrm_resource_allocation

    Next you can setup calendar rules to apply the different policies to the correct time frames.

    However this has the drawback that you have two calendar schemes: one in SQL agent and one in WSRM that you have to keep in sync. Furthermore it’s possible that there aren’t any CPU contention issues. My advice would be to monitor CPU usage and plan resource allocation policies if necessary based on the monitoring results.

    TIP 8 Don’t write to the same disk you’re reading from

    In the daily process window we will typically load large amount of data in our data warehouse:

    • We load data from the source system in a staging database.
    • From the staging database we load the data in the data warehouse.
    • From the data warehouse we load the Analysis Services cubes.

    In this scenario put the data warehouse database files on another drive as the staging database and the Analysis Services cubes.

    Conclusion

    In this post I focused on diminishing contention issues when using one server for your BI Solution running SQL Server, Analysis Services, Integration Services and Reporting Services. Key aspect is understanding your workload and the role the different applications play as well as the properties you can tweak to diminish contention.

    References

  • do
    06
    dec 12

    The Red Light Focus

    Red Light District

    I’m a big fan of Stephen Few who has written some nice books on data visualization and dashboard design. What I dislike in his writings (and that of a lot of other business intelligence practitioners) is the overemphasize on bad performance.

    The winner and runner-up of the recently held dashboard competition by Stephen Few both overemphasize the worst performers of a fictitious high school mathematics class. This is done by ranking them from worst to good grades and by adding additional (red) signs marking the worst performers.

    Of course this overemphasizing isn’t that strange because our analytical mind has a problem-solving attitude. And yes: bad performers are a problem. And most of the time the solution to tackle these underachievers is easy: fire them.

    But the good- and top performers are much more interesting. If you analyze these sweet-spots you may find interesting behavior that can be repeated on more places in your organization, e.g.:

    • In department XYZ a strict quality control on incoming key products is conducted, resulting in far less scrap and rework.
    • In company ABC management enforces a great service identity resulting in less headcount turnover and great customer satisfaction.
    • Sales representative Brit has found a great way to identify prospective buyers with significant budgets and her new customers realize 200% more than the average new customer.
    • James uses the online colleges of Khan Academy to deepen his understanding of Mathematics which contributed to his better grades.

    These best practices can probably be reproduced by others in your company or class. And can potentially cause much more impact than firing underachievers. However identifying the root cause of top performing actions is more difficult: You probably need to ask a lot of questions to these overachievers to determine better- and best practices.

    This difficulty shouldn’t be a reason to ignore them!

  • zo
    02
    dec 12

    BIML Script To Disable Indexes And Later Rebuild Indexes

    In an earlier post I described the necessity to drop non-clustered indexes before loading data in your data warehouse and rebuilding them afterwards. Later I found a better approach: Instead of dropping the indexes you can just disable them with:
    ALTER INDEX <IndexName> ON <TableName> DISABLE
    The loading process will now perform faster because SQL Server won’t update these indexes when you are loading. After completion of loading you can rebuild them with:
    ALTER INDEX <IndexName> ON <TableName> REBUILD
     
    The BIML code at the bottom of this post is part of a larger script in which I loop through a dataset that describe a series of packages. In relevant packages this script is added and executed. In the image of the package you can see what happens:
    • With the Execute SQL Task ‘SQL GetIndexList’ I will get a list of statements I need to disable and rebuild the non-clustered indexes of the target table. The result (Full result set) of this SQL statement is stored in an object variable with the name IndexList
    • In the For Each Loop Container ‘FELC Disable Index’ I loop through the rows of this IndexList and use the Execute SQL Task ‘SQL Disable Index’ for each row and thus disabling the relevant non-clustered indexes.
    • Next is the DataFlow Task … which I have kept empty for this purpose.
    • I end with a For Each Loop Container ‘FELC Rebuild Indexes’ in which I rebuild every index I disabled earlier.

    HandleIndexesWithBiml

    WARNING

    In most data warehouse scenarios this is a viable pattern. However in a scenario were you load relatively few rows in a large table with many non-clustered indexes, the rebuild of these indexes can take very long.

    The Biml Script to create this package:

      <Biml xmlns="http://schemas.varigence.com/biml.xsd">
       
        <# string TableName="[Production].[ProductCategory]";#>
        <Connections>
          <OleDbConnection Name="TargetDatabase"
                      ConnectionString="Data Source=.\SQL2012; Initial Catalog=AdventureWorks2012; Provider=SQLNCLI10.1; Integrated Security=SSPI;"
                      CreatePackageConfiguration="true">
          </OleDbConnection>
        </Connections>
     
        <Packages>
          <Package Name="PKG Handle Indexes Gracefully" ConstraintMode="Linear">
            <Variables>
              <Variable Name="IndexList" DataType="Object" />
              <Variable Name="DisableStatement" DataType="String" />
              <Variable Name="RebuildStatement" DataType="String" />
            </Variables>
            <Tasks>
              <ExecuteSQL Name="SQL GetIndexList" ConnectionName="TargetDatabase" ResultSet="Full">
                <DirectInput>
    SELECT
       DisableStatement = 'ALTER INDEX ' + QUOTENAME(i.Name) + ' ON '
        + QUOTENAME(SCHEMA_NAME (o.SCHEMA_ID)) + '.'
          + QUOTENAME(o.name) + ' DISABLE'
       ,RebuildStatement = 'ALTER INDEX ' + QUOTENAME(i.Name) + ' ON '
        + QUOTENAME(SCHEMA_NAME (o.SCHEMA_ID)) + '.'
          + QUOTENAME(o.name) + ' REBUILD'
    FROM sys.indexes i
    JOIN sys.objects o
        ON o.object_id=i.object_id
    WHERE i.is_primary_key = 0       
    AND i.is_unique_constraint = 0  
    AND i.is_disabled = 0    
    AND i.type_desc = N'NONCLUSTERED'       
    AND o.TYPE = N'U'  
    AND QUOTENAME(SCHEMA_NAME (o.SCHEMA_ID)) + '.'
        + QUOTENAME(o.name)='<#=TableName#>'
                </DirectInput>
                <Results>
                  <Result Name="0" VariableName="User.IndexList" />
                </Results>
              </ExecuteSQL>
             
              <!--Loop through indexes and disable them-->
              <ForEachAdoLoop Name="FELC Disable Indexes"
                              SourceVariableName="User.IndexList"
                              ConstraintMode="Linear" >
                <VariableMappings>
                  <VariableMapping Name="0"
                                   VariableName="User.DisableStatement" />
                </VariableMappings>
                <Tasks>
                  <ExecuteSQL Name="SQL Disable Index"
                              ConnectionName="TargetDatabase"
                              ResultSet="None" >
                    <VariableInput VariableName="User.DisableStatement" />
                  </ExecuteSQL>
                </Tasks>
              </ForEachAdoLoop>

              <Dataflow Name="DFT YourDataFlowTask" />
              <!--DataFlow Logic-->
             
             
              <!--Loop through indexes and rebuild them-->
              <ForEachAdoLoop Name="FELC Rebuild Indexes"
                              SourceVariableName="User.IndexList"
                              ConstraintMode="Linear" >
                <VariableMappings>
                  <VariableMapping Name="1"
                                   VariableName="User.RebuildStatement" />
                </VariableMappings>
                <Tasks>
                  <ExecuteSQL Name="SQL Rebuild Index"
                              ConnectionName="TargetDatabase"
                              ResultSet="None" >
                    <VariableInput VariableName="User.RebuildStatement" />
                  </ExecuteSQL>
                </Tasks>
              </ForEachAdoLoop>
             
             
            </Tasks>
          </Package>
      </Packages>
    </Biml>

    ABOUT BIML

    You can leverage Business Intelligence Markup Language – or BIML – to automate the creation of Microsoft SQL Server Integration Services (SSIS) Packages. BIML is a creation of Varigence, Inc. and is available in proprietary products, open source projects, and has been published as an open language specification. The popular open source BIDSHelper project includes Biml functionality, enabling anyone to write and execute Biml code for free.

    Do you need a head start with the automation of SSIS packages: consider my BIML Workshop.

  • do
    15
    nov 12

    DWH Deck 2.0

    In  my presentation at the DWHAutomation conference in Amsterdam on 20.09.2012 I gave a demo of the new version of the DWH Deck: DWH Deck 2.0. Here are the slides of that presentation:

    This new release is now available for interested parties.
    (If you are interested send me an e-mail: marco@in2bi.nl)

    The DWH Deck is a simple tool that allows you to build and maintain your data warehouse better and faster. The basic idea is best described by looking at the input and the output:

    INPUT:

    • Meta data that describes the the source object
    • Templates that describe target table and load pattern
      (I have provided 9, but you can change these and add additional templates)

    OUTPUT:

    • SQL Statements to create target tables and additional objects,
    • SQL Statements to create stored procedures to load the data
    • BIML Files to create SSIS Packages that will handle the data load
       

    In these blog posts you’ll find additional information:

  • do
    15
    nov 12

    DWH Deck 2.0: How to change or add templates

    In previous blog posts I described the basics of working with the DWH Deck. In this post I’ll dive a bit deeper into the templates.

    Importance of templates

    The template concept is the basis of the DWH Deck:
    The DWH Deck combines the information from the source system with the selected templates to create the target data warehouse and the necessary ETL.

    image

    In the template tab of the DWH Deck there are 3 areas, indicated above:

    1. The area that describes the naming conventions you want to use for your target schema, target table and job.
    2. The area where you can define additional (housekeeping) columns that you want to add to the target table:
      • AuditID
      • Start date
      • End date
      • Record source
      • Surrogate key
    3. The template statement area. In this part you can edit the selected template statement. You can choose between:

     

    Placeholder tokens

    At several places you can use placeholders that are replaced with the actual value when a job is created.

    In the next image of the dimension template the placeholder @ObjectName is marked. When you create a job for the Customer dimension this placeholder is replaced with the name of the source object. (Customer) Which will result in:

    • Load_dimCustomer for the job name
    • Customer for the target table name (in schema dim)
    • Additional surrogate key column with the name: dimCustomerKey

    image 

     

    In the template statement 2 types of placeholders are used:

    • Type A starts with one @ character (e.g.: @SourceConnection)
    • Type B representing a column list that starts with two @@ characters (e.g.: @@SourceColumns)

    The current Type A placeholders are:

    • @CreatedOn, the date the job was created
    • @TemplateID, the id of the template used
    • @TemplateName, the name of the template used
    • @MetaConnection, the name of the connection with the solution (meta) tables
    • @MetaServer, the name and instance of the server for the meta connection
    • @MetaDatabase, the name of the database for the meta connection
    • @SourceConnection, the name of the source connection
    • @SourceServer, the name of the database for the source connection
    • @SourceDatabase, the name of the source database
    • @SourceObject, the source object e.g: [dbo].[vw_Customers]
    • @ObjectName, the name of the source object e.g.: Customers
    • @JobID, the id of the job
    • @JobName, the name of the job
    • @TargetConnection, the name of the target connection
    • @TargetServer, the name and instance of the server for the target connection
    • @TargetDatabase, the name of the target database
    • @TargetSchema, the name of the target schema
    • @TargetTable, the name of the target table

     

    The current Type B placeholders are:

    • @@SourceColumns
      ... [col1], [col2]..
    • @@NotSelectedSourceColumns
    • @@TargetColumns
    • @@DefintionTargetColumns
      … [col1] INT IDENTiTY(1,1), [col2] varchar(50) ..
    • @@PKColumns
    • @@BKColumns
    • @@srcColumns
      … src.[col1], src.[col2]
    • @@tgtColumns
      … tgt.[col1], tgt.[col2]
    • @@WhereList
      … src.[col1]=tgt.[col1] AND src.[col2]=tgt.[col2]
    • @@DimWhereList
      … his.[col1]=cur.[col1] AND his.[col2]=cur.[col2]
    • @@LKPColumns
      … <Column SourceColumn=col1 TargetColumn=col1 />
    • @@SCDColumns
      … <Column SourceColumn=col1 TargetColumn=col1  MappingType=Key />
    • @@DimViewColumns
      … cur.[col1], his.[col1] as col1_his

    The Type B placeholders are defined in a separate stored procedure: meta.usp_ColumnLists.
    You can view and add definitions that are used for these placeholders.

    In this blog post I described the templates.
    This concludes my first series of post on this version of the DWH Deck.

  • wo
    14
    nov 12

    DWH Deck 2.0: Step 1 Create solution database and other connections

    The DWH Deck is a simple tool that allows you to build and maintain your data warehouse better and faster.

    In this blog post I’ll describe the steps necessary to create the solution database and other connections that will be used as source or target connection.

    In this sample we will use the Northwind database as the source connection.
    And an empty database NorthwindDW1 which we will use as the solution- and target database.

    image

    In the DWH Deck first create the solution:

    • Enter DW1 in the Name Textbox
    • Enter Server[\Instance] in the Server Textbox
      (Do not use “localhost” or “ .” for your server name)
    • Choose NorthwindDW1 in the Database Listbox.
    • Click the Create solution button

    image

    The DW1 connection is added to the tree view, beneath Connections.
    In the Message text box feedback is provided about the actions performed by the DWHDeck.

     

    Next check in SQL Server Management Studio that the tables and stored procedures are added to the database.

    • In Object Explorer open the NorthwindDW1 database.
    • Open Tables, Programmability and Stored Procedures and check the tables and stored procedures that are added to the solution database.

    image

     

    Next we will create the source connection to the Northwind database:

    • Click Connection in the menu bar.
    • Enter Northwind in the Name Textbox
    • Enter Server[\Instance] in the Server Textbox
    • Choose Northwind in the Database Listbox.
    • Click Save in the menu bar.

    image

     

    After we have created the connections we can start creating the jobs.
    This will be the subject of my next blog post.

    Overview of the DWH Deck 2.0

  • wo
    14
    nov 12

    DWH Deck 2.0: Step 2 Add Jobs by selecting source objects and applying a template

    The DWH Deck is a simple tool that allows you to build and maintain your data warehouse better and faster.

    In a previous blog post we created the solution database and added connections which we’ll use as source or target connection in this post.
    In this post we will create jobs that will describe a source to target mapping.

    Add the first job with the DWH Deck

    • Click Job in the menu bar
    • Select Northwind in the Source connection listbox
    • Select [dbo].[Categories] in the Source table listbox
    • Select 1 Staging in in the Template listbox
    • Select DW1 in the Target connection listbox
    • Click Save in the menu bar

    image

    Repeat above steps for the following tables:

    • Customers
    • Employees
    • OrderDetails
    • Orders
    • Products

    image

    In this step we created the jobs that describe the source to target mapping.
    After we have created these jobs we can publish them.
    This will be the subject of my next blog post.

    Overview of the DWH Deck 2.0