Just like Andrea Merkel, Tottenham Hotspur and Britney Spears' career, the Lookup task in SSIS 2005 is in serious need of a makeover. Whilst it is one of the more useful components in the SSIS toolbox it has a number of shortcomings including:
- Inability to dynamically build the SQL statement that populates the cache when in 'Full Cache' mode
- Inability to populate the cache from anywhere other than a table (actually there is a little-known method in SSIS2005 that can be used to populate the cache from a text file but I'll save that for a later date)
- Having to use the error output for rows that do not result in a cache hit
- If the dataflow containing the lookup component is used in a ForEach loop the lookup cache has to be populated on every iteration. This is often not the desired behaviour.
- The same lookup cache cannot be used in seperate dataflows.
I'm pleased to say that in SSIS 2008 these shortcomings have all been overcome and in this blog entry I'm going to show you the various improvements that have taken place.
The first thing you'll notice is that the Lookup component now has a new UI. Its not particularly flashy but its new nonetheless. I've highlighted a couple of specific details that I'll cover below.
The first big improvement I want to talk about is the ability to use a cache that is populated prior to the containing dataflow even executing. This is made possible by the provision of a new connection manager type in SSIS 2008 called the Cache Connection Manager. Let's take a look at it.
There's not much to talk about in this UI so let's about what the Cache Connection Manager actually does. Purely and simply it caches data using either an in-memory cache or a file. The default behaviour is to use the in-memory cache.
The cache is populated by a new component called the Cache Transform which very simply populates a cache designated by the Cache Connection Manager. What I particularly like about the Cache Transform component is that it can be used as a destination component or as a component mid-flow. It doesn't transform the data in any way, and therefore that same data can be used downstream in the same dataflow.
So, that's what it does but what does it all mean? Well chiefly it means that the cache can now be populated from text files, raw files, XML files, Excel files, proprietary-format binary files, web services, a recordset, or anything that can be accessed using an ADO.Net provider; in SSIS 2005 we were limited to populating it from an OLE DB source. The reason that this is all possible is because this new cache type is not populated from a SQL statement like the Lookup cache in SSIS2005, it is populated from a dataflow. Not only that but the cache can be used in multiple lookups thus solving the problem where each iteration in a loop needs to repopulate the cache. And more still, because we can dynamically build SQL statements within an OLE DB Command we can now change at execution-time the data that is stored in the cache. Those may seem like small enhancements but SSIS developers have been crying out for these features since SSIS 2005 was first thrust upon the world so kudos to the SSIS team for listening to the feedback and coming up with a better story around Lookups.
Once I'd found all this out my curiosity started to get the better of me and I got to poking around in these new features a little more. I was pleased to discover that if you use a file cache then the resultant file can be read by a raw source adapter so obviously the SSIS team have leveraged their own proprietary format for storing the file. Offhand I'm not sure if this has any practical use so if you do end up making use of this 'side-feature' then let me know. Be warned however, reusing the raw file format doesn't mean that you can use the raw file destination adapter to populate the file because the cache connection manager stores its own proprietary metadata in some extra columns that I'm guessing it must require for its own purposes.
When I first heard that this new cache feature was coming I was hoping that the in-memory cache might be persisted after the package had finished executing thus allowing us to use it in another package. Alas this isn't the case but all is not lost because the file cache WILL exist after the package has completed execution therefore you can use that in another package. Good news.
I then asked myself the question "What happens if we try and put more data into a cache that already has some data in it? Does it overwrite the data or append it?" Well the first thing I did when investigating this was to put another Cache Transform component on the "no match" output (more on this later) from the lookup component to see if the data that was not in the cache originally could be appended to it. When I executed I got the following error:
- Error: 0xC0010206 at Data Flow Task 1, Cache Transform : The cache is being read from component "Lookup" (67) and cannot be written to by component "Cache Transform"
You can't write to a cache while another component in the same dataflow is trying to read from it - that makes sense. So, instead what I did was push that data into a raw file and then tried to append to the cache from within another dataflow. This time I got a different error:
- Error: 0xC0010200 at Data Flow Task 2, Cache Transform : The component "Cache Transform 2" (18) cannot write to the cache because component "Cache Transform" (622) has already written to it.
So, I can't append to or even repopulate a cache and that is hugely disappointing. The ability to append to the cache would mean that the 'Dynamic Lookup Cache' feature that Informatica has could be replicated within SSIS. As it is that is not the case and I shall be campaigning hard to get that behaviour changed before SSIS 2008 gets released.
One more final note on the cache connection manager feature. Books Online states "When you persist the cache to a file, the system loads the cache faster". That's a bit ambiguous. Does that mean that a cache file is faster than the in-memory dataset or faster than using the existing OLE DB Connection Manager cache? Furthermore, it may be faster to populate but does it perform faster when the Lookup component is reading from it? I'll endeavour to find out and if/when I do, I'll let you know.
No Match Output
In the past three years I have noticed a number of people on the SSIS forum complaining about the fact that the error output had to be used for the rows that resulted in a cache miss. This has changed in SSIS2008; there is a new output on the Lookup component called the 'Lookup no match output' and the package developer can choose to direct rows that result in a cache miss to that output.
This doesn't add any new functionality (other than the ability to catch "real" errors whilst at the same time redirecting rows) however it is much more intuitive than the method of accomplishing the same in SSIS 2005.
Dynamic Lookups on OLE DB cache
The OLE DB cache is how I now refer to the existing cache feature in SSIS 2005's Lookup component and which still exists in SSIS 2008. In fact, unless you have just cause to use the new Lookup features described above there is no real need to move away from the traditional cache type (which indeed is still the default choice), especially when you learn what I am about to tell you.
As I said at the top of this blog entry one of the big complaints about the Lookup component in SSIS 2005 is that the SQL statement that populates the cache cannot be built dynamically at execution-time. Well, I'm glad to say that the SSIS team have made the eminently sensible decision to allow us to set that SQL statement (which is stored within the custom property [SQLCommand]) using an expression.
I'm shocked that this didn't happen in an earlier service pack actually given that this is simply a case of flipping the IDTSCustomProperty90.ExpressionType property but never mind, its there now and that is great news for SSIS developers!
Now for some bad news...
As you can see the Lookup component has been greatly enhanced but there is one new feature that I hoped would be included but isn't. The Lookup component still can't do case-insensitive lookups. I'm rather disappointed about that. There is an eminently simple workaround to this problem but I'd still rather not have to do it.
All in all though I don't think we have much cause to complain. There have been some major enhancements to the Lookup component that have addressed some big gaps in the SSIS 2005 functionality. In my opinion the new Lookup component is probably the headline change to SSIS 2008.
I've put together a video in order to demo some of these new features. I've embedded the video below although if you're reading this in an RSS reader its likely that you won't be able to see it so either go and see it here or download the video and watch it at your own leisure in better fidelity.
I didn't have a microphone available to me when I recorded the video so rather than wait until I got one I thought I'd record it as a homage to the silent movie era and instead use the noble art of "pointing at stuff" to try and get my point across.
Some things to notice:
- I am populating the cache using the statement select object_id from [master].[sys].[tables] . The lookup component's input comes from select object_id from [master].[sys].[objects] so the data in the cache is a subset of the data in the input. Hence, some rows in the input will result in a cache hit (6 of them in fact) and the rest (65) result in a cache miss.
- Those rows that result in a cache miss are diverted to the 'no match output' which for clarity I have called "Not tables".
- The lookup cache is defined by a connection manager called 'Cache Connection Manager' It is populated by the Cache Transform which exists in a separate dataflow to the Lookup component.
This screen shot was taken from an interim build between CTP4 and CTP5
UPDATE, 7th May 2008: John Welch has written a useful blog post explaining how to add data into a cache file. Read it here.