Sunday, September 19, 2010

Super fast Data Warehouse dimensions loading with SSIS and Change Tracking

The Challenge: Updating Type 1 dimensions as fast as possible. Source tables can be quite large and retrieving only changes and new data will reduce the time it takes to update your dimensions dramatically.



As always you are limited by the source table design and as you know the design often does not have an ETL process in mind. Chances are the tables don't have a "created date" and "updated date" field, and if they do are they indexed? Indexing such fields would benefit an ETL process that retrieves only changes but will hurt performance on the OLTP system. Some times the owners of the source systems don't mind adding an index to the table in order to support the ETL but should they? If the source is a replicated source it might be acceptable to add such indexes. There are a lot of "IFs" in the previous sentences. The ETL architect should be resourcefull in a way that he can work around this challenges by using different Techniques.


This scenario has been discussed many times and has been addressed in many ways in the past. In some cases the table is small enough that the easiest thing to do is to just extract the whole table and use a slowly changing dimension to update the target dimension. In other cases more sofisticated techniques are used such as Andy Leonard's approach to incremental load or  using MD5 hashes to identify changes in source data and only do the necessary updates in the target dimension..


This techniques were developed in a pre SQL Server 2008 world. And were great answers then. But SQL 2008 came around and implemented Change Tracking and Change Data Capture.


The Solution: Change tracking is a database engine feature that allows you to do just what we want. Retrieve current values for the records that have changed since the last extraction (including new records and deletions) from a table. The only requirement is that the source table must have a primary key. Assuming that Change tracking has been enabled on the database and the table in question (see Configuring and managing change tracking) this is how it works in a nutshell:


1.- Prior to extracting the records from the source to load the dimension for the first time, the current Commit Sequence Number (CSN) is retrieved from the source database by using the function change_tracking_current_version()


2.- This value needs to be stored somewhere (I store it on the SSIS control table I use for every package where I log run times, outcome, and any other useful information to support recovery from failure and troubleshooting).


3.- The entire data set that makes the dimension is extracted and loaded into it.


4.- On the next run it must be established if the CSN last used is still valid in the source system. When change tracking is enabled a retention policy is put into place (see Configuring and managing change tracking). This is done by comparing the stored CSN from the previous run to the value returned by this function change_tracking_min_valid_version (table_object_id).
If the CSN from the previous run is lesser than the minimum valid version the tables must be re-synched (Back to step 1). This should not happen if your package runs more often than the retention policy established when change tracking was enabled for the database. The default retention policy is 2 days.


5.- Once established that the last CSN is still valid you can go ahead and obtain the current CSN as done prior to the first load. Store this CSN in a temporary location (you don't want to store it in the control table until you have successfully committed your data in the destination) and proceed to retrieve the changes for that table by using the changetable() function. Here is an example from MSDN:


 This function will return the primary keys for all the rows that have changed since the last CSN.



SELECT CT.ProductID, CT.SYS_CHANGE_OPERATION, CT.SYS_CHANGE_COLUMNS, CT.SYS_CHANGE_CONTEXT FROM CHANGETABLE(CHANGES SalesLT.Product, @last_synchronization_version) AS CT


 This function will return the primary keys for all the rows that have changed since the last CSN and other fields of interest for every row.



SELECT CT.ProductID, P.Name, P.ListPrice, CT.SYS_CHANGE_OPERATION, CT.SYS_CHANGE_COLUMNS, CT.SYS_CHANGE_CONTEXT FROM SalesLT.Product AS P RIGHT OUTER JOIN CHANGETABLE(CHANGES SalesLT.Product, @last_synchronization_version) AS CT ON P.ProductID = CT.ProductID

 Note the use of OUTER JOIN. This is required to make sure that the change information is returned for those rows that have been deleted from the user table. 

6.- Send your changes through a Slowly Changing Dimension to update your dimension. 

7.- Update your control table to reflect the last CSN used to retrieve changes. 
I always include all this steps and related tasks within a sequence container and set the transaction option for the container as required. Also make sure that the tasks included in the sequence container have the transaction option set as supported. In this way I can ensure that the last used CSN recorded in the control table is accurate and the the changes associated with it are reflected in the dimension. 

A picture is worth a thousand words:
Here is an example of what the control flow would look like for one dimension:

The first two Execute SQL task queries the control table to retrieve the last CSN used to retrieve the last changes committed to the target database and to obtain the current CSN or change tracking version from the source database. Inside the DimWDList the first Execute SQL task checks if the table has data. Using and expression in the precedence constraint we can control what tasks will execute next. If the Dimension does not contain data (Think re-population or first run) the "Full Population" data flow gets executed. 
This data flow retrieves all data from the source table and does a bulk insert or fast load in the target table. If there is data already in the target table, the "Checks min valid ver" task gets executed (Step 4 from paragraph above). Retrieves the minimum valid change tracking version for the source table and stores this value in a variable. Select CHANGE_TRACKING_MIN_VALID_VERSION(object_id('List')) Next, the precedence constraints evaluate an expression to determine if the CSN used to retrieve the previous updates is still valid. if it is, the retrieve updates component gets executed as described in step 5.
The OLE DB source task looks like this..
if the CSN used to retrieve the previous updates is no longer valid, the table needs to be re-synched and the "Re-synch" data flow gets executed. It retrieves all records from the source table and pushes them through a slowly changing dimension to bring the dimension to a current state without having to remove or disable foreign keys in order to truncate.  Under normal operations almost always the package should execute the "Retrieve changes" component. The other data flows were added to avoid manual intervention in the event that the package does not run successfully for a period of time grater than the change tracking retention policy specified by the source. 
Conclusions: Change tracking allows you to retrieve only those rows that have changed with minimum effort. Therefore reducing dramatically I/O and the extraction time. Transformations need to be applied only to the records that need to be updated and the slowly changing dimension only updates those records that need it and inserts new records. 
I have improved the performance of packages packages that populate 20+ dimensions go from 15 minutes to under 1 minute. Do i need to say more? :-) Change tracking is the way to go... 
P.S: This example does not necessarily represent the most efficient way to handle the dimension update process as a whole. The purpose of this example is only to explain how to use change tracking in this particular scenario. Also, the logic to delete records from the dimension deleted in the source is not represented in this example as most likely you will not want to do this.

5 comments:

  1. Good article Luis. Seems that CT is the fastest route if you do not have changing historical data.

    ReplyDelete
  2. Hi,

    Can you Please provide a this Sample SSIS ?

    ReplyDelete
  3. Luis really it's very helpful.
    Please provide us a sample SSIS?

    ReplyDelete
  4. The Challenge: Updating Type 1 dimensions as fast as possible. Source tables can be quite large and retrieving only changes and new data will reduce the time it takes to update your dimensions dramatically. http //www.survival-warehouse.com

    ReplyDelete