A long time ago my colleague The Galloping Data Architect and I wrote a paper entitled "A Comparison of Enterprise Class Data Integration Tools. SQL Server Integration Services for the Informatica Customer". You can get it from here and read what I had to say about it on this blog here.
I guess that at least one person has read it because today I got asked the following question via email:
I'm working on a conversion project and I'm trying to compare performance of SSIS with Other ETL Tools, especially Informatica PowerCenter. Which one do you think is better ETL performer, when source and destination being SQL Server databases? Is there any benchmark available?
I'll presume that the questioner was referring to the throughput performance of an Informatica Mapping versus a SSIS Dataflow Task - for the purposes of this blog entry I'll refer to this metric as "pipeline throughput". The short answers to the two questions posed there are "I don't know" & "Not that I know of". That's not very helpful though so its only fair that I expand and explain a bit more.
As anyone that has worked on benchmarking in this industry will tell you there are many many factors that may influence benchmark tests and hence producing a single definitive answer is a difficult thing to do and in a lot of cases proving that your benchmark tests are trustworthy is more difficult than actually performing the tests. Also bear in mind that there are so many "knobs to tweak" these days I suspect that the "winner" in a performance comparison often comes down to the skill of the person performing the test rather than the abilities of the software. The scenario in which you are testing is a factor; some tools may perform better in some scenarios than others. Furthermore, it is rare that a published benchmark will be truly unbiased - usually you will find that a benchmark is affiliated with the "winner" in some way so I generally take those benchmarks with a proverbial grain of salt. The old adage that you can prove anything with statistics is one that tends to hold true.*
Anyway, I digress. Back to talking about ETL tools. I stated earlier that I didn't know whether Informatica or SSIS had quicker pipeline throughput and to that I would add "...and I don't really care either". The main contributors to pipeline throughput will be the hardware that you run it on and after that you may be able to tweak the software for extra performance gains. At the heart of most (any???) performance metrics on a computer system is the number of CPU cycles that you can eek out in a given time period and that holds true no matter what test you happen to be running and which software tools you happen to be testing. Hence, its eminently possible that I could take the same software, same operating system, same tests and run them on two different hardware configurations and derive completely conflicting conclusions.
I'm still no closer to helping the guy who asked the question above to decide which is quicker. And I'm not going to get any closer either - partly because I don't want to say something that someone could later prove to be false (remember - you can prove anything with statistics). I will say this though, I suspect that the difference in performance will be so minutely negligible that it does not even warrant consideration. There are bigger differentiating factors to be taken into account when evaluating ETL tools, primarily cost. For example, Ab Initio routinely claim that their ETL tool is quicker than any other out there and given that none of their competitors ever challenges them on that claim I suspect that its true for the scenarios in which Ab Initio want to be quickest. That's why companies like Vodafone use Ab Initio to process their call records and why Capital One use it to process their credit card transactions. Those companies need the raw speed that Ab Initio can provide; they can also afford to pay the big BIG bucks to get it. Other factors that should influence purchasing decisions include existing hardware/software infrastructure, existing skills base, ease of administration, feature set, ongoing support costs, developer productivity and many many more.
I suspect I haven't much helped the guy who asked the questions above but I hope I've explained WHY I haven't helped him. The simple truth is that I don't know whether SSIS or Informatica is faster and, more than that, I suspect I'll never know.
*Note that not all benchmarks are subjective, the Transaction Processing Performance Council benchmarks are the most objective that I know of in the database world and Kevin Kline has talked about those a lot if you are interested in knowing more about them from a SQL Server perspective.