Welcome to EMC Consulting Blogs Sign in | Join | Help

SSIS Junkie

SSIS Nugget: A simple demo of Union and Aggregation

While I was at the PASS Summit recently I was lucky enough to meet SQL Server guru and all round nice bloke Chris Hedgate. Chris had some polite things to say about my blog but did offer some constructive criticism - I should try and put more introductory type stuff on here cos most of what i write is too in depth for people just starting out with SSIS.

Fair point I thought, and I promised that I'd come up with some such content. Well...its only taken me two months!!

I've built a package that demos how SSIS handles one of the most common ETL operations - aggregation. Let's be a bit more specific though and explain exactly what I've defined.

In this mythical scenario I am processing log files from an online retailer. I have 3 log files in total:

  • Views - The number of times that a customer clicked a link to view details on a product
  • Picks - The number of times a customer put a particular product in their shopping cart
  • Purchases - The value of each customer's purchase of each product

Here's a screenshot of my data-flow:

20051116dataflow.JPG

The three "boxes" (properly called source components) represent my three source files.

The "Union All" component amalgamates those three streams of data into one - very similar to the UNION operator in ANSI SQL. Let me explain how the UNION ALL component works by showing you the editor (which is displayed by doubleclicking on the component):

20051116UnionAll.JPG

Some things to notice here:

  • My three inputs are listed along the top. You can also see a column headed "Output Column Name" which specifies the name of the columns output from the Union All component
  • Columns in each of the three inputs are mapped to these output columns. Notice that each input has a Customer column and a product column. these are the columns that we are going to GROUP BY when we aggregate later on in our data-flow.
  • The inputs do not contain all of the columns in the output. Where an input does not contain the column you can see <ignore> which will result in NULL values in our data-flow. We will deal with these later.

After the Unon All component comes our Sort component. This component does exactly what it says on the tin - it sorts the data. We have to tell it how to sort the data though and we do that, again, in the component editor:

20051116Sort.JPG

More things to notice:

  • We are sorting by the columns that we are going to GROUP BY when we aggregate
  • We can specify whether to sort ascending or descending

Next is the all important Aggregate component. This is the thing that takes our sorted data and groups it over some defined columns and carries out an aggregate function on each of the other columns. The Aggregate component doesn't actually require the inputted data to be sorted, so the prior Sort component wasn't actually required, I just included it so as to demo some extra functionality. Here's the Aggregate editor:

20051116Aggregate.JPG

Note that we specify whether to group by a column or to apply a aggregate function to it.

 

At this point we can view the data at runtime using the very handy Data Viewers that SSIS provides. Here is a Data Viewer that shows us what the data looks like in the pipeline when it comes out of the Aggregate component:

20051116DataViewer1.JPG

This is pretty good. We have aggreated our data and can easily see, per customer and product, what they are interested on our website and what they have spent their money on. Notice though that we still have some NULL values in there and in an ideal world we'd set them to be zero instead. We can do that using a Derived Column component:

20051116DerivedColumn.JPG

Here we have used SSIS's expression language (which I promise you is very easy to learn) to check for a NULL value in the pipeline. If it finds one it replaces it with a zero. Simple.

 

So there you have it - an easy aggregator. I dare say that if you're still reading then you may be interested in running this for yourself. Well I've provided the package and the three source files in a zip file here: http://blogs.conchango.com/Admin/ImageGallery/blogs.conchango.com/jamie.thomson/20051116AggregatorDemo.zip

All you have to do to run the demo is place the three source files into C:\temp, add the package to an existing Integration Services project, and run it. You'll see just how easy it is to quickly build working, production-quality packages.

Please let me have feed back. Did you find this easy? Did the demo work? Should I not have bothered? :)

 

-Jamie

 

 

Published 16 November 2005 22:33 by jamie.thomson

Comments

 

Kristian Wedberg said:

Since I'm sure there are at least *some* limits on how much you can blog Jamie:,) my vote would certainly be for you to focus on more complex SSIS issues:

* There's already introductory articles available

* The SSIS walk throughs and 'How to' guides in BOL covers the basics quite well

What to do after the basics and how to apply SSIS to real world problems, *that's* where BOL falls short, and your articles become very very valuable...

My 0.02SKr
November 17, 2005 16:18
 

Professional Association for SQL Server (PASS) SIG said:

November 18, 2005 16:06
 

Chris Hedgate said:

I for one very much enjoyed this short intro. As much as I would like to I do not spend a lot of time browsing BOL or articles on topics that are not in my main areas of focus. I do however read a lot of blogs on many different subjects and have learnt a lot of things that are off-topic to me from them.

Cheers Jamie!
November 21, 2005 15:43
 

Ronald said:

Hoi,

Very nice, dou't forget to make the numeric colums the right datatype, otherwise you don't see the 'sum' in the grid combobox.

Groet, Ronald
March 29, 2006 20:58
 

Tim McCurdy said:

Well, this Aggrgate thing WOULD be nice except that it limits what you can do.  For example, there are no Min / Max operators for String DataTypes (and as anyone can tell you, Min and Max work perfectly well for Strings).  I REALLY need this functionality, it just sucks how some SSIS Components get your hopes up and then cause you to resort to writing custom script components (which I find happens 80% of the time I need something moderate hard).

October 29, 2007 12:45
 

jamie.thomson said:

Tim,

I agree. That is a huge limitation of the Aggregate component. I've requested it before but I'd appreciate it if you could do the same by heading to http://connect.microsoft.com/sqlserver/feedback and explaining what you want and (crucially) why you want it.

Thanks

Jamie

October 29, 2007 13:11
 

Tim Costello said:

The aggregating max(string) thing is a huge problem for me at the moment and I'm not sure how to get around it.  Do you know of a solution?  Could you point me to any resource on the web that might help me figure out how to get around this?  I'm OK with coding a solution, but I can't figure out how to get it done.  Any help would be appreciated.

November 28, 2007 21:02
 

jamie.thomson said:

Tim,

The oly way to solve it is to use the script component or write your own. I don't know of a third party tool that does it.

-Jamie

November 28, 2007 21:12
 

Shay said:

hi guys,

does the union all component always does a group by?

February 12, 2008 06:44
 

jamie.thomson said:

Shay,

The Union All never does a Group By.

-Jamie

February 12, 2008 13:15
 

Chris Anzalone said:

I know this is an old thread, but it was exactly what I was looking for right now. I'm new to SSIS and this was a great example of a common task that helped me get my feet wet with the product.

Thanks Jamie!

btw - you mention data viewers with the Aggregate Component and show a screen shot. Where can I fiond out more about the data viewers and how to use them? Is this available in Debug mode?

November 11, 2008 01:34
 

jamie.thomson said:

Hi Chris,

Nowhere really that you can find out more about them - just learn about them by using them (they're very easy).

They AE available in debugmode (in fact that's the only place where they are available). To use one, right-click on a data path (i.e. the green or red line between 2 components in a data flow) and select Data Viewers - should be fairly self-explanatory from there.

-Jamie

November 11, 2008 08:45
New Comments to this post are disabled

This Blog

Syndication

Powered by Community Server (Personal Edition), by Telligent Systems