In my dim and distant past I was fortunate enough to have some exposure to a product called Informatica. Informatica is the market leading enterprise Extract, Transform & Load (ETL) product and will therefore be a direct competitor to SQL Server Integration Services (SSIS) when it is released later this year.
We are currently engaged in a piece of work in which we are comparing Informatica functionality with SSIS functionality. We are attempting to demonstrate that, functionally, SSIS operates in the same high end space as Informatica. The output from this engagement will be a whitepaper that will be freely available in the public domain before the first half of this calendar year is out.
About 7 months ago I carried out an ad-hoc comparison of the 2 products and my key conclusions were:
- The SSIS architecture is simpler than Informatica which makes installation and deployment easier
- Informatica's architecture allows a more modular and reusable approach to implementing data flow. I have previously lamented SSIS's shortcomings in this area here.
- The Port-to-port method of designing data-flow that Informatica uses in a mapping is preferable to the transformation-to-transformation method that SSIS leverages (In the intervening time period I have since changed my opinion about this - both methods have their plus and minus points.)
- Informatica's metadata architecture provided a better mechanism for logging than SSIS (I have since *completely* altered my opinion on this).
- Informatica's metadata architecture allows for real-time monitoring of a batch even on a scheduled batch. SSIS's architecture does not.
All in all I felt SSIS had some way to go to catch up with Informatica. Well, what a difference 7 months makes :)
All of the conclusions I came to 7 months ago were based on a visual feature by feature comparison. Our current engagement on the whitepaper is a real-life scenario and so has afforded me the opportunity to delve much deeper into the products and really produce a more balanced view of functionality and (crucially) performance which I wasn't able to do before.
Perhaps a little bit of context is necassary before I go any further. For the whitepaper we are comparing the products using the scenario of processing web logs that capture activity from our website (so get on there and create me some test data :). This involves extracting data from multiple files all at once, transforming the pertinent information from what is essentially a long data string, aggregating the data to provide summary information & loading the results into a star schema model. It is a classic data mart scenario.
Here are some of my findings thus far:
- Informatica does not have an equivalent of SSIS’s UNION component . This is a big problem for what I’m trying to do because a lot of the logs that I’m trying to load are in different folders (to represent the different web servers). Informatica requires 2 pipelines (see #3 below) to extract this data whereas SSIS can just have 2 source adapters in the same data-flow and UNION the data together.
- SSIS’s method of loading multiple files (i.e. the "Multiple Flat Files Source Adapter") is a lot better than Informatica's. With Informatica you have to, externally to Informatica, build a list of files to process and then pass that list back into Informatica. To make this dynamic at runtime you would have to shell out to an external process to produce the file list. This is not pleasant – especially compared to SSIS’s very simple method of just specifying “*.log” in the source adapter.
- I have a pipeline (built in both SSIS & INFA) that filters out all comment lines from the web logs, extracts all the individual fields (e.g. Timestamp, ClientIP address, URI, Referrer etc...), and inserts into a SQL Server table. The SSIS pipeline is working on ~150000 rows and completes in ~23 seconds. The Informatica method (which uses 2 pipelines cos of #1 above) takes ~45 seconds. Even 1 Informatica pipeline on its own (working on about half the records) took ~27 seconds. Bear in mind also that the SSIS pipeline was run from BIDS and as I have previously mentioned, BIDS places a large overhead on the execution of a package. I would suppose that Informatica does not have the same restriction. In short, SSIS seems a lot quicker!
- SSIS’s method of dynamically setting the destination at runtime (i.e. configurations) is a lot better than the Informatica equivalent. With Informatica you have to configure each task to use the dynamic value, and setting up the value itself is a manual process because you have to manually handcraft what is termed a parameter list. SSIS does this using the configuration wizard. Let me say again, EVERY Informatica task that uses an external connection has to have a dynamic configuration set up using this method; with SSIS you do it in one place, on the Connection Manager.
All pretty interesting stuff I hope you'll agree! And very refreshing for those of us hoping to see SSIS gain a strong foothold in the enterprise ETL market.
At the moment I don't know what I don't know but I hope to learn more about the 2 products in the coming weeks. Watch this space for more information as and when I have it!
[If any of the information above is not accurate then please let me know. I stand here to be shot down]
I have been informed via a posted comment (see below) that Informatica does indeed have a UNION transformation as of v7.1. The latest version that I have had visibility of is v6.2.1.
Thank you very much to the anonymous commenter for this information!