There has been, and still is, quite alot of conjecture in Conchango as to whether one should use SSIS data-flows or stored procedures for ETL. Traditionally we have implemented ETL solutions using stored procedures. DTS was used but as little more than a workflow engine, occasionally using the data-pump for importing data from files. I dare say that many people took this approach with DTS.
Now that SSIS has come along you have much more powerful data movement engine than the DTS data pump and also alot more inherent functionality. But is that compelling enough to stop using T-SQL stored procedures and only use SSIS as a workflow engine? Well here's a comparison of the two approaches. These are only my opinions and I dare say I've forgotten some important things but I think I've raised some salient points:
Advantages of using Stored Procedures
- A SQL statement will outperform a SSIS data-flow when the data transform is table-to-table on the same server
- No new learning curve as ETL developers will already know how to write SQL code. SSIS data-flows are a completely new technology.
- Utilise transactions within the database rather than use MSDTC
- Easier to generate stored procedures from known metadata than it is with a data-flow (at the time of writing it is anyway)
Advantages of using a SSIS data-flow
- Handle data from heterogenous sources in the same place
- Consume data from sources that can't be accessed using a SQL statement
- Data can be transformed without the need for an intermediate staging area
- If data does need to be persisted temporarily it can be to a raw file whereas T-SQL requires temporary tables for which there may be a management/security overhead
- Extra transformation functionality (e.g. Fuzzy logic, data mining, text mining, insert to Analysis Services)
- Visual representation of the "work"
- Bad data can be captured to a different data sink for examination later
- Exception Handling
- Use .Net Data providers as an alternative to OLE DB Providers (e.g. mySAP Business Suite .Net Data Provider)
- Data-flows are, to some extent, self-documenting
- User-controlled parallel execution of data-flows is possible where it isn't really in the inherent batch operated world of stored procedures
- "Heavy-lifting" of data can occur on a different to machine to that storing the data (thanks to John in the comments section)
- By default, events containing very pertinent information such as "component "<component name>" (5824)" wrote 2398156 rows" are raised
These are essentially feature lists and should all be considered. Remember though that in situations with complex transformations requirements you are more likely to "hit the wall" with T-SQL because there are things that it simply cannot do. For simpler implementations the lines become more blurred and you have a choice to make.
What is clear is that there is no "one size fits all". If you have to make a decision about which approach to take then consider these points and decide which approach best suits your requirements.
And let me know what approach you prefer. I'm interested to know.
UPDATE, 9th April 2006: I've just come across a very useful webcast that explains how to integrate hand-coded ETL scripts into SSIS. It is here: http://msevents.microsoft.com/cui/WebCastEventDetails.aspx?EventID=1032276873&EventCategory=5&culture=en-US&CountryCode=US. What makes this relevant to this blog post is that the use of T-SQL and SSIS components is compared in place so will help you to make a rational judgement on which method you should go for.