Monday 29 April 2013

Using Checksum in SSIS for Delta Loads for Incremental Load.

The checksum function in SQL is a great way to compare two rows to see if the data matches. The checksum command returns a number that represents the value of the data in the row. When you compare the checksum value of two rows, and the values match, this confirms the data on the rows match. If a change has occurred on one of the rows then the checksum value will change.

Here is a table that has some data that I am going to load into a second database. The ID column is the primary key and therefore unique.

image

  Here is the table that needs to be updated.

image

  Notice that Mike Davis has moved to a different address in the same city. So this is the only row that needs to be updated. If I use an SSIS package to load this data I could truncate and reload the entire table. If the table only has a few rows this would be fine. But if the table contains millions of rows this would be a very long process. We can use a lookup and compare each column to see if there are any changes. This is fine if there are only a few columns. If the table has hundreds of columns then it would take a long time to map every column in the lookup transform and would perform badly due to all of the comparisons. With the checksum value we can compare just one column in a conditional split transform.

Here is the dataflow in an SSIS package that is using the checksum value to compare the input table to the table to update. Below we will look at each transform in detail.

image

  The source is an OLEDB source and is using the following SQL command.

  SELECT Checksum(ID,First,Last,Address1,Address2,City,State,ZipCode) as ‘Checksum’
,ID ,First ,Last ,Address1 ,Address2 ,City ,State,ZipCode
FROM InputData

  The results from this query are shown below.

image

  Notice the checksum value in the first column. The checksum function is wrapped around every column in the table. If you only want to compare a subset of the columns then only add this subset in the checksum function. The checksum Syntax is Checksum(Col1, Col2, Col3…) as Alias.

  The next transform is a lookup and is doing a compare of the ID columns and getting the checksum from the table that needs to be updated.

  I set the no matching rows to redirect to the no match output.

image

  This lookup transform is using a query with the checksum value in the SQL query also.

image

  The results of this query are shown below. Notice all the checksums match the input table’s checksums from above, except for the row that has changed, which is the row with ID 1.

image

  In the Columns node of the lookup transform I mapped the ID columns together and I placed a check next to the Checksum on the lookup table. This way we can compare the Checksums in the next transform, the conditional split.

image

  If the lookup does not find a match then the record is new and needs to be added to the table regardless of the checksum value. The no match output goes to an OLEDB destination.

The matching rows go to the conditional Split which compares the Checksum Values. The conditional split has two outputs. The “No Changes” output is the matching checksums. The “Updates Needed” output is the non matching checksums. If the checksums do not match this indicates that something on the incoming row is different from the existing table and needs to be updated. We know the table has a matching ID because of the preceding lookup finding the matching ID’s.

image

  The “Update Needed” rows go to an OLEDB Command transform. This transform is running an update statement for every column in the row with the matching ID.

Here is the update command.

UPDATE [Testing].[dbo].[TableToUpdate]
SET
[First] = ?
,[Last] = ?
,[Address1] = ?
,[Address2] = ?
,[City] = ?
,[State] = ?
,[ZipCode] = ?
WHERE [ID] =?

Here is the parameter mapping in the OLEDB Command transform.

image

  Each row is mapped to the proper column to update. The ID column is mapped to the where clause. Go back to the picture of the entire data flow and see if it makes sense now.

This is not the only way to accomplish this. I could have written the updated rows to a staging table and done a set based update to the table to improve performance.

On a side note all of this work could be done with one simple add-on. Task Factory has an Upsert destination and does all of this work and the batch update so it is very fast. If you used the task factory Upsert destination your data flow would look like this.

image


HAPPY CODING........

1 comment:

  1. If you are going to copy my blogs please link back to the original and give me credit please. http://mikedavissql.com/2011/08/15/using-checksum-in-ssis-for-delta-loads/

    ReplyDelete