Recently Mag (surname withheld :) from California contacted me with a few queries around a package that she was building. The package was a straightforward fact table load that required to use the dimensional natural keys in the fact data to get the corresponding surrogate keys from the dimension tables. All pretty simple stuff right? Well unfortunately Mag was loading 2.5million rows of fact data and it was taking far far too long. After a fairly lengthy email exchange it transpired that Mag was using MERGE JOIN components in the pipeline to get the surrogate keys and eventually we agreed it would be best to try using LOOKUP components instead.
Well Mag did just that and its not hyperbole to say that the performance improvement she got was stupendous. She has been kind enough to write a case study and allow me to publish it herein. I really recommend you take a read, the results are pretty astounding.
SQL Server Integration Services
Sort and Merge Components vs. Lookup Component
As I have worked with SQL Server Integration Services (SSIS) to load data into a dimensional database, I have seen the huge difference when using sort and merge components and lookup components to populate over a million records. At first, I adopted the sort and merge components in my approach to populate a fact table and later saw that using the sort and merge component was too memory intensive and found using the Lookup component to be a much better approach. Let me tell you about my findings.
Let’s start with detailing my hardware components and software. I used the all-great Microsoft Virtual Server. I created a virtual server with 45 GB static (non-dynamic) hard disk and gave my virtual server 1 GB of memory. The swap file virtual memory setting on the virtual server that I created was set to 1536-3072MB. As for the software: I loaded Windows 2003 with all the service packs. I installed the IIS windows application. I installed SQL 2005, Integration Services, Analysis Services and Reporting services, Report Builder on the same server.
Then I proceeded to build my dimension and fact table population packages. The first fact table population package I built was using the Sort and Merge components. After having huge memory problems with this package when transferring in 2.5 million records Jamie Thomson advised me to change my approach and build a different package using the Lookup component. There I used the Lookup component in lieu of the sort and merge components and boy, what a difference a component makes!!
Let’s look at in detail the configuration of the package that used the lookup component.
Next edit the Lookup component. Choose the table or view that you will need to lookup the desired field from. In this case we will want to compare the patient_id coming in from the stg_fact_transaction table with the patient_id in the Dim_Patient table.
Next be sure to join the staging table field with the dimension table field. In this case its patient_id.
Then in the Dimension table choose the surrogate key field by clicking the box next to the field.
This is so that the field you want copied to your fact table will be the surrogate key in the dimension table.
In the Output Alias box, alias the surrogate key field something that is similar to what you named the foreign key in your fact table that pertains to that dimension.
For your next dimension you repeat the same steps as above. Those steps summarized are:
- Add the lookup component again.
- Edit the lookup component.
- Choose the table or view that you will need to lookup the desired field from.
- Join the staging table field with the dimension table field, alias accordingly.
Keep doing that until you are done with all the dimensions you will be looking up. Then lastly add the destination component.
Add the destination
Map the input column to the destination column.
That is it!! This will successfully populate the Foreign Keys in your Fact table and relate the dimensional descriptions to your additive facts!!!
Using Lookup incredibly reduced populating 2.5 million records in my fact table from 2 days to 45 minutes!!!!
Great work Mag. That's a fantastic performance improvement. Music to the ears of the dev team in Redmond no doubt!!! And thanks again for producing this!