Welcome to EMC Consulting Blogs Sign in | Join | Help

SSIS Junkie

SSIS: A distinct component please

One of the most widely used components in SSIS is the Sort component.


As many of you will know, the Sort component is a fully blocking component rather than a partially blocking component or a row transform component. [N.B. This paper from Elizabeth Vitt talks about fully blocking, partially blocking, and row transform components, as does this webcast from Donald Farmer)

The Sort component actually has two functions:

  1. It sorts data (that much is obvious)
  2. It can eliminate duplicates

The reason that the Sort component has this second function is that in order to determine duplicates it must do a Sort at some point - hence it makes sense to put these two functions into the same component.

I've been using the Sort component alot today to eliminate duplicates and as I was doing so, something occurred to me. I wanted my output to have duplicates eliminated but I didn't care whether the output was sorted or not. Therefore, why do I have to wait until ALL the rows have been consumed from the upstream buffer by the Sort component until it start to pass rows to the new downstream buffer? If I don't care about the output being sorted, why can't the component pass each new row into the output buffer without waiting for all rows from upstream? In other words, why can't my component that eliminates duplicates be a partially blocking component rather than a fully blocking component?

At that point I paid a visit to Connect and submitted this suggestion . Namely, I want a new component, a partially blocking component, that eliminates duplicates but doesn't bother to sort the output for me. If you think this would be useful then click-through and add a comment to the Connect submission (12 hours after originally posting this and some people already have). We're more likely to get it that way. And reply to this post as well (you'll need to sign-up) - let me know what you think.




Published Friday, December 08, 2006 10:27 PM by jamie.thomson



reckless said:

This is a good idea. I have actually implemented this already using a script component and an in-memory hash-table to keep track of the distinct values from the input. If a value does not exist in the hash-table it is immediately passed downstream. I don't know if this is the most performant of solutions, but it works - and does not hold op all the records before they are passed downstream. :-)

December 9, 2006 9:14 AM

Michael's BI corner said:

As a follow up to Jamie Thomsons post on "SSIS: A distinct component please" (found here) I thought I...

December 12, 2006 3:18 PM

Alberto Ferrari said:

Jamie, take a look at http://www.sqlbi.eu/Home/tabid/36/ctl/Details/mid/374/ItemID/9/Default.aspx, I have written the component mainly to solve memory problems but, doing it, it has been very easy to make it partially blocking.


February 8, 2007 10:00 AM

Marco Russo said:

I'm just writing a post on my blog about this topic. Another option is the Aggregate component. It is still a fully blocking component, but at least it doesn't force a sort, which is the very slow part of the whole process.

March 7, 2007 8:42 PM

SSIS Junkie said:

A few items related to SSIS have come to light over the past few days that are definately worth mentioning.

March 10, 2007 5:15 PM

SQLBI - Marco Russo said:

Almost one month ago I made a post about a brand new Distinct component for SSIS made by Alberto Ferrari

March 18, 2007 8:55 AM
New Comments to this post are disabled

This Blog


Powered by Community Server (Personal Edition), by Telligent Systems