Welcome to EMC Consulting Blogs Sign in | Join | Help

SSIS Junkie

SSIS: A case study of using LOOKUPs instead of MERGE JOINs

Recently Mag (surname withheld :) from California contacted me with a few queries around a package that she was building. The package was a straightforward fact table load that required to use the dimensional natural keys in the fact data to get the corresponding surrogate keys from the dimension tables. All pretty simple stuff right? Well unfortunately Mag was loading 2.5million rows of fact data and it was taking far far too long. After a fairly lengthy email exchange it transpired that Mag was using MERGE JOIN components in the pipeline to get the surrogate keys and eventually we agreed it would be best to try using LOOKUP components instead.

Well Mag did just that and its not hyperbole to say that the performance improvement she got was stupendous. She has been kind enough to write a case study and allow me to publish it herein. I really recommend you take a read, the results are pretty astounding.

==============================================================================================

SQL Server Integration Services

Sort and Merge Components vs. Lookup Component

 

 

As I have worked with SQL Server Integration Services (SSIS) to load data into a dimensional database, I have seen the huge difference when using sort and merge components and lookup components to populate over a million records.   At first, I adopted the sort and merge components in my approach to populate a fact table and later saw that using the sort and merge component was too memory intensive and found using the Lookup component to be a much better approach. Let me tell you about my findings.

 

Let’s start with detailing my hardware components and software.  I used the all-great Microsoft Virtual Server.  I created a virtual server with 45 GB static (non-dynamic) hard disk and gave my virtual server 1 GB of memory. The swap file virtual memory setting on the virtual server that I created was set to 1536-3072MB. As for the software: I loaded Windows 2003 with all the service packs.  I installed the IIS windows application.   I installed SQL 2005, Integration Services, Analysis Services and Reporting services, Report Builder on the same server.

 

Then I proceeded to build my dimension and fact table population packages.  The first fact table population package I built was using the Sort and Merge components.  After having huge memory problems with this package when transferring in 2.5 million records Jamie Thomson advised me to change my approach and build a different package using the Lookup component.  There I used the Lookup component in lieu of the sort and merge components and boy, what a difference a component makes!!

 

Let’s look at in detail the configuration of the package that used the lookup component.

20051012CaseStudy1.JPG

Next edit the Lookup component.   Choose the table or view that you will need to lookup the desired field from.  In this case we will want to compare the patient_id coming in from the stg_fact_transaction table with the patient_id in the Dim_Patient table.

20051012CaseStudy2.JPG

Next be sure to join the staging table field with the dimension table field.  In this case its patient_id. 

Then in the Dimension table choose the surrogate key field by clicking the box next to the field. 

This is so that the field you want copied to your fact table will be the surrogate key in the dimension table.

In the Output Alias box, alias the surrogate key field something that is similar to what you named the foreign key in your fact table that pertains to that dimension. 

Click OK.

20051012CaseStudy3.JPG

For your next dimension you repeat the same steps as above. Those steps summarized are:

  1. Add the lookup component again.
  2. Edit the lookup component.
  3. Choose the table or view that you will need to lookup the desired field from.
  4. Join the staging table field with the dimension table field, alias accordingly.

 

Keep doing that until you are done with all the dimensions you will be looking up.  Then lastly add the destination component.

 

 

Add the destination

20051012CaseStudy4.JPG

Map the input column to the destination column. 

 

That is it!! This will successfully populate the Foreign Keys in your Fact table and relate the dimensional descriptions to your additive facts!!!

 

Using Lookup incredibly reduced populating 2.5 million records in my fact table from 2 days to 45 minutes!!!!

================================================================

 

Great work Mag. That's a fantastic performance improvement. Music to the ears of the dev team in Redmond no doubt!!! And thanks again for producing this!

 

-Jamie

Published Friday, October 21, 2005 7:36 AM by jamie.thomson

Comments

 

Michael Barrett said:

Great work!

The performance of the pipeline can probably be further improved by modifying the lookup queries in the lookup component so that only the needed columns are cached in memory. In the case of the Dim_Patient table this lookup query would be:

SELECT Patient_id, SKPatient_Id
FROM Dim_Patient
October 21, 2005 7:59 AM
 

Umberto Gatti said:

45 minutes from 2 days isn't an improvement, it's a miracle :)

Anyway, this case study was interesting, but i'd like to know more about using memory restriction in the advanced tab. When am I supposed to use it and in which ways can it improve lookups?

Thanx
October 21, 2005 8:49 AM
 

Professional Association for SQL Server (PASS) SIG said:

October 21, 2005 8:43 PM
 

Frans van Bree said:

You might consider using a SQL update statement, if speed is your main concern.

Firstly, I assume the looked-up fields have an index set. So in the example there should be an index on a field like DIM_PAT.Patient_Id. This will make lookups lightning fast.

Secondly, the single UPDATE-statement might be something like:

UPDATE STG_FACT_TRANSACTION
SET
FK_SKPatient_Id = ISNULL(pa.SKPatient_Id,-1),
FK_SKBillDate_Id = ISNULL(bd.SKBillDate_Id,-1),
FK_SKEntryDate_Id = ISNULL(ed.SKEntryDate_Id,-1)
FROM
STG_FACT_TRANSACTION s
LEFT OUTER JOIN DIM_PAT pa ON s.PATIENT_ID = pa.PATIENT_ID
LEFT OUTER JOIN DIM_DATE bd ON s.BILLDATE = bd.DATE
LEFT OUTER JOIN DIM_DATE ed ON s.ENTRYDATE = eb.DATE

Add all the other lookups in a simular fashion. (I made up the fact that the time dimension is called DIM_DATE and has a lookup-field called DATE)

You can add a WHERE-clause to optimize things even further:

WHERE
ISNULL(s.FK_SKPatient_Id,-1) = -1 OR
ISNULL(s.FK_SKBillDate_Id,-1) = - 1 OR
ISNULL(s.FK_SKEntryDate_ID,-1) = -1

What you lose, amongst others, is detailed logging functionality. Something the built-in task would support.

Hopefully this has some value in your discussion.

Greetings,

Frans van Bree
October 27, 2005 9:49 AM
 

Frans van Bree said:

Correction:

UPDATE STG_FACT_TRANSACTION
SET
FK_SKPatient_Id = ISNULL(pa.SKPatient_Id,-1),
FK_SKBillDate_Id = ISNULL(bd.SKDate_Id,-1),
FK_SKEntryDate_Id = ISNULL(ed.SKDate_Id,-1)
FROM
STG_FACT_TRANSACTION s
LEFT OUTER JOIN DIM_PAT pa ON s.PATIENT_ID = pa.PATIENT_ID
LEFT OUTER JOIN DIM_DATE bd ON s.BILLDATE = bd.DATE
LEFT OUTER JOIN DIM_DATE ed ON s.ENTRYDATE = eb.DATE
October 27, 2005 10:02 AM
 

SSIS Junkie said:

I thought it would be worth publishing a list of guidelines that I see as SSIS development best practices.

November 8, 2006 9:52 PM
 

SSIS Guy said:

One reason why it was also taking so long is because the sort component is much slower than if you let database engine do the sort for you.  If you have data from an oledb source then pass that to the sort component, that's slower than if your oledb source is already sorted.  So instead of selecting a table as your oledb sourece, use a sql query with an order by.

January 26, 2007 6:08 PM
 

Shim said:

Hey guys,

i'm a DTS user, moving toward SSIS. and i have a question regards to mapping source and destination data.

In DTS, we were able to write up a script during the mapping process. so i could of say something like,

if field A is blank, then set the value to NUll

or

if field B is blank, then set the value to 0

Can we do something like that in SSIS? i can't find anywhere to write up a script in the mapping process

Shimmy

August 9, 2007 7:27 AM
 

jamie.thomson said:

Shimmy,

You can use a Derived Column component to do the same thing.

But it used to work in DTS (5) - Changing values in the pipeline

(http://blogs.conchango.com/jamiethomson/archive/2006/07/03/SSIS_3A00_-But-it-used-to-work-in-DTS-_2800_5_2900_-_2D00_-Changing-values-in-the-pipeline.aspx)

-Jamie

August 9, 2007 5:20 PM
 

HR said:

I want to know is the lookup component works the same as a LEFT JOIN or more like a INNER JOIN?

I've take out the data with a OLE DB Source,then I use another OLE DB component to populate the data from the iother DB and finally I uses a MERGE JOIn component. All of this to do a LEFT JOIN query.

It would be better if I use LOOK Up component?

September 5, 2007 8:08 AM
 

Muhamed Ahmed Kukaradza said:

To HR said

Lookup doesn’t work like INNER JOIN or LEFT OUTER JOIN. In INNER JOIN or LEFT OUTER JOIN records may multiply like rabbits ;)

September 5, 2007 3:11 PM
 

jamie.thomson said:

Muhamed is absolutely correct.

I think its OK to say "lookup is closer to INNER JOIN than LEFT OUTER JOIN" but it would be dangerous to think of them as the same thing.

-Jamie

September 5, 2007 4:53 PM
 

HR said:

I belived that LOOKUP component took all the records in table A and "look" on the table B for the fields where you conditions match, for example:

Table A contains IDemployee and Name

Table B contains IDempoloyee and DeparmentName

I belived that if you use a OLE DB component to populate the records from table A. Then use a LOOK UP component to find the DeparmentName it would OutPut all the records for the OLE DB component (table A) with the daparmet name. Returning deparmentName as NULL whe condition doesn´t match. (like a Left Outer Join).

If it isn't that the way it works, could you ilustrate me how if its done?

Thanks in advance!

P.S.: I congratulate you guys for this blog, it has helped me a lot. Thanks again!

September 6, 2007 9:19 AM
 

jong said:

HR, I have the same exact question.  

What I would like to do is use the lookup component to basically perform an update on a datasource. If the ID exists in the lookup column, update the table. If the id doesn't exist in the lookup column, move on.

It doesn't appear to work like this.  Does anyone have a good example?

November 29, 2007 3:20 AM
 

jamie.thomson said:

November 29, 2007 3:25 AM
 

Ashwin said:

What if I had 45 million rows in the fact table and 2 million rows in one of the dim tables. Whats the option for me? Do I go with Lookup , Merge or just the SQL statement. I cannot afford more than 45mins to load my fact table.

January 2, 2008 9:35 PM
 

jamie.thomson said:

Ashwin,

I have absolutely no idea. There are so many variables in play here its impossible for anyone except you to say what is the best configuration.

I can tell you my best guess. The number of rows in the fact table (I presume you mean number of rows in the pipeline) is largely irrelevant. When using the Lookup then the biggest bottleneck is USUALLY (not always) the size of the cache. 2m rows in the cache is quite a lot but again, its impossible for me to say whether SSIS cna handle it or not because of the paucity of informaiton here. Try it and let me know - I'd be interested to find out.

-Jamie

January 3, 2008 8:38 AM
 

Ashwin said:

Thanks for the info. This is what I would like to do. The process runs for every 2 hours and loads the Dimension and Fact tables. You can call it as incremental loads. So while trying to load the fact table using the Lookup, the lookup is caching the fact table rows before the data flow task gets executed which is taking forever. I did wait for more than an hour and lost my patience and stopped the process execution. I am not even sure if SSIS can handle heavy loads. I am not even thinking of using MergeJoin. The only other option I have is to build the SQL query on a fly and passing it to the Ole DB source transformation.

Thanks

Ashwin

January 4, 2008 7:29 AM
 

jamie.thomson said:

The lookup will always cache data before the data flow starts. That's how it works - there's not a lot of point otherwise.

Certainly it seems as though Lookup is not an option for you given your data volumes. Why is Merge Join not an option? You shouldn't read the article above and automatically assume that merge Joins are bad. They are very useful given the right scenario.

Whichever way you look at it, trying to cross 45m rows with another 2m is a HUGE amount of work to do in 2 hours. its not that SSIS isn't up to it - your data volumes are big and your time window is small. Something needs to change..perhaps beefier hardware? Who knows...the point is that you can't blame SSIS for not achieving something that might even be bound by the laws of physics. That's not to say it can't be achieved, but there's a lot of factors in play here.

-Jamie

January 4, 2008 7:16 PM
 

KoolMay said:

I have around 30 dimensions and i need to built a fact table taking the keys from the dimesnion tables. Could someone explain to me how to built this?

Thanks in advance

January 12, 2008 10:19 AM
 

SSIS MOJO said:

Very nice post. I have now learned some thing I can take to my next job :)

February 27, 2008 6:28 PM
 

Sanjay Sajeevan said:

How does merge join compare with lookup if the data is pre-sorted ?

One disadvantage I saw with lookup was that lookup doesn't have expressions

I can't change the reference table name dynamically

Whereas in merge join, I can have 2 inputs ( coming from OLE DB source )

I can set the table name dynamically in the OLE DB Source

March 21, 2008 7:35 AM
 

jamie.thomson said:

Sanjay,

It would depend on many things, not the least the size of the data. Why don't you try it and find out.

You're absolutely right about the inability to use expressions - this is a major shortcoming. FYI: This has been fixed in SSIS2008.

-Jamie

March 21, 2008 2:06 PM
 

Pedro Perfeito (PedroCGD) said:

Good job... and if your try the new SQL 2008, using the cache component, you will be reducing your package to 5 minutes!! :-)

Cheers!!!!

March 21, 2008 7:43 PM
 

Jonhy said:

Hi guys

I am new on SSIS, and I am trying to follow above sample to update 1 column on one table.

But when I run the SSIS, it dispaly an error saying that some other columns does not allow nulls.

Seems my SSIS is trying to Insert new records insted of update the existing...

I do not know what I am missing to use the update insted of the insert... any light will be really appreciate it

Thanks

May 1, 2008 12:10 AM
 

jamie.thomson said:

Hi Johny,

The only component within the pipeline that can do UPDATEs is the OLE DB Command component. Is that what you are using?

-Jamie

May 1, 2008 9:41 AM
 

Jonhy said:

Thanks Jamie

Nope, I was using a SQL Destination and OLEDB Destination, I tried the OLEDB Command and it worked :)

But I notice it is slow to make the updates, seems it is updating record by record...

There is not other option to update tables... the situation is that I have table A on one Server and Table B on other server, then I am trying to update Table A with values of Table B... but I was looking for options insted of use LinkedServers

Thanks

May 2, 2008 3:40 PM
 

jamie.thomson said:

Jonhy,

Yes, OLE DB Command is very very slow.

The only other option is to INSERT the data into a temporary table and then issue a set-based update from the Execute SQL Task.

Regards

Jamie

May 2, 2008 3:59 PM
 

RABINDRA JHA said:

Hi,

I want to insert data in fact table in such a way that only new data is inserted.I have source in which the data is increasing day by day.for example In fact table earlier if there 100 rows I want to insert 10 more rows.One method I know is delete fact table when we run the query and reinsert the data but I think it is not gud solution.So please suggest me how can I insert data if the row is found in destination fact table then it will update it and if no it will insert it.

May 14, 2008 7:55 AM
 

jamie.thomson said:

Rabindra,

Take a read of this:

Checking if a row exists and if it does, has it changed?

http://blogs.conchango.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx

-Jamie

May 14, 2008 9:46 AM
 

RABINDRA JHA said:

Thanks Jamie,it is working but I have one confusion as we dnt have primary key in fact table so how I will map from available  input column to available output column.I have map all the foreign keys.

May 14, 2008 12:05 PM
 

jamie.thomson said:

Do you mean in your lookup? The existence of a PRIMARY KEY is irrelevant. It only makes sense to do this if you compare against a candidate key (i.e. a combination of columns that uniquely defines each row).

-Jamie

May 14, 2008 12:12 PM
 

RABINDRA JHA said:

Thanks a lot..

May 14, 2008 1:24 PM
 

RABINDRA JHA said:

Hi, I have more query.Error output of lookup is actual destination.I want the output in actual output.How can I do this.

May 14, 2008 2:07 PM
 

jamie.thomson said:

Sorry, you'll have to rephrase. I don't know what you mean.

May 14, 2008 2:17 PM
 

RABINDRA JHA said:

Hi,

I want to insert data in fact table in such a way that only new data is inserted.I have source in which the data is increasing day by day.for example In fact table earlier if there 100 rows I want to insert 10 more rows.As in ur blog I have taken lookup tranformations and redirect rows to error output all the unmatched column will go in error output and this will insert new data in the destination table .  Instead of using error output I want unmatched column in main output.Can I do this.It is just reverse of lookup.Unmatched column in main output and matched column in error output.

May 14, 2008 2:24 PM
 

jamie.thomson said:

The rows that do not find a match go to the error output. There is no actual error here per se, its just that that is how they are treated. Once the rows are outputted from the component it doesn't matter what colour the arrow is, it only matters what you do with the data.

Sure, it would be nice if the output containing the unmatched rows were not an error output but its no big deal.

-Jamie

May 14, 2008 2:31 PM
 

rabindra said:

The whole answer is not visible in this blog.Some part is hidden.

May 14, 2008 3:00 PM
 

RABINDRA JHA said:

Hi,

If you find the solution please reply me.

May 14, 2008 3:07 PM
 

jamie.thomson said:

The whole answer to what? What is hidden?

May 14, 2008 3:11 PM
 

jamie.thomson said:

The solution to what?

May 14, 2008 3:16 PM
 

RABINDRA JHA said:

Hi,

Some Light green cover your answer.But I find the way just copy it and paste it nodepad.We will see complete answer.

Solution of lookup ummatched row in loopkup output not in error output

May 15, 2008 6:04 AM
 

rabindra said:

Hi,

Can You me one thing suppose I got new concept then I jsut want to know how to write article about this in this blog site

May 15, 2008 6:28 AM
 

jamie.thomson said:

Hi Rabindra,

Yes, one of the diagrams bleeds off the page. Sorry about that. If you want to see the diagram in its entirety then right-click and save a local copy.

"Can You me one thing suppose I got new concept then I jsut want to know how to write article about this in this blog site"

Are you saying that you would like to submit articles for inclusion on my blog? I would be willing to accept submissions if they are of sufficient quality.

Regards

Jamie

May 15, 2008 10:01 AM
 

RABINDRA JHA said:

ok..Thanks a lot..

May 15, 2008 11:41 AM
 

Alok Singh said:

Hi ,

I am currently using Lookup in SQL 2000 where I have two tables ..

one table contains ID's and other contains ID and its corresponding description.

Both these tables are on Open GL Progress database and I am connecting to this using ODBC drivers to fetch data .

Now these two tables are in two different databases and both of these databases are running on different port numbers (which is possible in progress) ,I can not directly join these two tables in a single sql statement because of progress limitations and hence I have created a lookup in DTS where I search the corresponding description for a Id.

Now I am trying to build this same thing in SSIS and there I have found that I can create lookups only on OLE DB connection and not on ODBC connections.And hence I can not do a remote lookup.

Since this desription table is huge (around 5 million records) I can not afford to do a daily refresh of this dimension table.

I am badly stuck at this point .. Can someone help me out ...

Thanks in advance.

May 21, 2008 1:39 PM
 

Tod means Fox | ETL Subsystem 14: Surrogate Key Manager said:

June 24, 2008 4:29 PM
 

BI_Junkie said:

Hi Jamie,

Came across a situation where I had 2 tables on 2 different servers as follows

Table1 on Server1

-------------------

IDStartRange bigint not null

IDFinishRange  bigint not null

Description nvarchar(255)

Table2 on Server2

-----------------

ID bigint not null

Description nvarchar(255)

I am trying to update "Description" column on Table2 from Table1, based on the condition "Table2.ID >= Table1.IDStartRange and Table2.ID <= Table1.IDFinishRange".

As you can see i cannot use "Merge Join" as it uses equi-joins.

It will be too expensive to fill all the possible IDs in Table1 (or in Query temporarily) between IDStartRange and IDFinishRange as they span across millions.

Is there anyway we can make use of any of built-in "Data flow transformations"?

Cheers

Raj Chidipudi

August 14, 2008 3:15 PM
New Comments to this post are disabled

This Blog

Syndication

Powered by Community Server (Personal Edition), by Telligent Systems