I've been writing about SSIS for quite some time now and probably one of the more popular articles post I have written is this one:
Get all from Table A that isn't in Table B
which I wrote way back in 2004.
Its not popular because of the quality of the writing, oh no. Its because of the subject matter. It details two ways that a SSIS developer might discover whether a row already exists in the target or not:
Use a MERGE JOIN or
Use a LOOKUP
Of the two options I think its fair to say that over time, the choice of using the LOOKUP component has proved to be the most popular.
Well interestingly Greg van Mullem has defined another way of achieving the same thing and he has written about it here: http://www.mathgv.com/sql2005docs/SSISTransformScriptETL.htm
Greg's method uses a script component to do the comparison between source and target, and highly useful code it is too. I recommend you digest Greg's article fully because there is a lot to learn from it - particularly in regard to accessing relational sources from within a script component.
The main reason Greg defined his own method was that he determined that while the LOOKUP option was useful, it had its limitations. He figured that while this method could determine that a row was already existing in the target, it couldn't determine whether the row had changed or not. Well I beg to differ. The key to determining whether the row has changed or not is make columns from the LOOKUP dataset available in the pipeline subsequent to the LOOKUP component - that way they can be compared with the incoming values using a DERIVED COLUMN or CONDITIONAL SPLIT component.
That was a rather verbose explanation so let me try and demo this. I have built a package as so:
The sequence container on the left hand side merely sets up some lookup data for us. It creates it in a table called tempdb..CustomerLookupData. Here's the entire contents of that table after it has been created and populated.
Very simple stuff!
Now here is the real crux of this post. The data-flow that contains my LOOKUP component:
The important things to show you here are the LOOKUP component and the CONDITIONAL SPLIT. Here's the LOOKUP:
Remember that you need to setup the LOOKUP component to redirect error rows. This means that any rows that don't find a match in the lookup cache will go down the error output. It isn't an error per se but for these purposes it proves to be very very useful.
What I need to point out to you here is that I am putting a column from the lookup dataset (i.e. from the CustomerLookupData table) into the pipeline after I have done the lookup and I have given it the alias LookupAge.
In the CONDITIONAL SPLIT pictured here:
I compare LookupAge with the Age column of the incoming row. If they are different then I direct them to the "Age has changed" output, otherwise they are the same and they go to the default output called "Age has stayed the same".
I've got data viewers on the three outputs:
Age has changed
Age has stayed the same
so we can see what appears in each data path. Here is the executed data-flow:
As you can see the package has successfully determined:
I'm not trying to say that this is the right way and Greg's is wrong - I'm a great advocate of there being more than one way to achieve a goal within SSIS. I was slightly concerned however that anyone reading Greg's article may think that this wasn't possible without resorting to writing code - and that simply isn't the case.
If you want to try this for yourself then download the package from here. It executes against the SQL Server instance called localhost so make sure you have an instance of SQL Server running there - or just change the connection manager.
Hope that is useful
UPDATE 2006-10-23: Someone who goes by the pseudonym "The Viewmaster" has done a performance comparison of the various mthods of achieving this. I will let you read the results for yourself here.