Here is the latest collection of submissions to http://connect.microsoft.com/sqlserver/feedback that pertain to SQL Server Integration Services (SSIS) and which I think are worthy of your attention.
Take a read of the items below and if you agree with the aim of the submission, please click through and vote for it – it only takes a few seconds. If you want to leave a comment supporting your vote, that would be even better.
Don’t feel obliged to vote on everything carte blanche, only vote for those that you would like the SSIS product team to concentrate on. Most of these are already closed but that doesn’t mean they can’t be opened again. And remember, voting really does make a difference; if you don’t believe me take a read of Doug Laudenschlager’s blog post Your vote on Microsoft Connect influenced SQL Server 2008 Service Pack 1.
Ability to create an empty raw file with required metadata
In the raw file destination adapter, have the option to create the an empty raw file with the appropriate metadata when (for example) you close the raw file destination adapter editor.
Or perhaps just have a button in raw file destination adapter editor called "Go create an empty raw file"
#225974 : Create an empty raw file with required metadata
A better performing distinct component
SORT component is an asynchronous, fully blocking component. This makes sense of course but it seemed to me that if 'Remove rows with duplicate sort values' check box is checked then it could be made into a partially blocking component instead.
Let me explain. We know that the component is only going to output one row for duplicates. That doesn't mean though that the output necessarily is required to be sorted does it? Hence, why not just give us a new component that puts every new combination of values into the pipeline as soon as it is encountered. Sure, it still has to do a SORT internally in order to determine whether a combination of values is new or not, but if the output doesn't need to be sorted, why bother waiting until we have all the values before we start processing them downstream?
#244313 : SSIS: Improve performance of DISTINCT sort component
Keep Raw files sorted
I have an input into a raw file destination that has IsSorted=TRUE. Any output from a raw file source that uses that raw file will have IsSorted=FALSE.
Proposed solution: Store IsSorted property of the input and SortKey property of the columns in a raw file
#242351 : Should IsSorted metadata be stored in raw files?
Improve DtExecUI by adding a package browser
There is no ability be able to browse the package and select a value to be set via the /SET option of dtexec. Add the ability to browse the package. Simply use the browser that is used in the configuration wizard.
#124577 : Put package browser on Set Values tab of dtexecui.exe
To read more in this series of SSIS Connect digests go to SSIS Connect Digest
I recently inherited some SQL that someone else had written and had the job of “tidying it up” before it gets pushed out to production. Here’s a slightly simplified (yes, simplified) version of that SQL:
2: select asset_class
3: , case when volume > 0 then 'profit'
4: else 'loss'
5: end as profit_or_loss
6: , customer
7: , sum(volume) as total_volume
8: from t1
9: where asset_class = 'options'
10: group by
12: , case when volume > 0 then 'profit'
13: else 'loss'
15: , customer
16: union all
18: select asset_class
19: , case when volume > 0
20: then 'profit'
21: else 'loss'
22: end as profit_or_loss
23: , customer
24: , sum(volume) as total_volume
25: from t2
26: where asset_class = 'swaps'
27: group by
29: , case when volume > 0 then 'profit'
30: else 'loss'
32: , customer
The SQL basically takes data from two tables (t1 & t2), aggregates each, carries out some inline expressions (for profit_or_loss) and finally unions it all together. On the surface it looks fine but there are a few problems here, namely that there is a lot of repeated code; it violates the principle of don’t repeat yourself (DRY) which preaches “single point of maintenance” and “deduplication” of code. If you take the time to check it out you’ll see that identical aggregations are carried out on the two datasets (lines 7 & 24) as are the same conversions for profit_or_loss (lines 3-5 & 19-22). Not only that but we have expressions appearing in both the SELECT clause and GROUP BY of both halves of the query (lines 12-14 & 29-31), another violation of DRY.
A bit of refactoring is called for. First job, eliminate those expressions for profit_or_loss which appear in the GROUP BY clauses:
2: select asset_class
3: , profit_or_loss
4: , customer
5: , sum(volume) as total_volume
6: from (
7: select asset_class
8: , customer
9: , case when volume > 0 then 'profit'
10: else 'loss'
11: end as profit_or_loss
12: , volume
13: from t1
14: where asset_class = 'options'
16: group by
18: , profit_or_loss
19: , customer
20: union all
22: select asset_class
23: , profit_or_loss
24: , customer
25: , sum(volume) as total_volume
26: from (
27: select asset_class
28: , customer
29: , case when volume > 0 then 'profit'
30: else 'loss'
31: end as profit_or_loss
32: , volume
33: from t2
34: where asset_class = 'swaps'
36: group by
38: , profit_or_loss
39: , customer
Ok cool, we’ve moved the expression for profit_or_loss into a subquery (aka derived table) and hence expressions have gone from our GROUP BY clauses … but we’ve still got DRY violations. The aggregation (lines 5 & 25) and expression for profit_or_loss (lines 9-11 & 29-31) still appear in two places. More refactoring….
1: select asset_class
2: , customer
3: , sum(volume) as total_volume
4: from (
5: select asset_class
6: , case when volume > 0 then 'profit'
7: else 'loss'
8: end as profit_or_loss
9: , customer
10: , volume
11: from (
13: select asset_class
14: , customer
15: , volume
16: from t1
17: where asset_class = 'options'
18: union all
20: select asset_class
21: , customer
22: , volume
23: from t2
24: where asset_class = 'swaps'
27: group by
29: , customer
There, much better; the aggregation only occurs once as does the expression for profit_or_loss (which in the code that I inherited occurred four different times). Our SQL is DRY and its going to be a lot easier to maintain for whomever picks the code up from me.
You’ll notice we’ve got 2 levels of nested subqueries (aka derived tables). I make no apologies for that - derived tables are a great mechanism for eliminating repeated code and if you take but one bit of advice away from this blog post it would be this: derived tables are your friend.
Would you have refactored this the same way? Maybe you might have moved the derived tables into a dedicated view. Perhaps you might even have put the derived tables into a WITH clause like so:
1: with t as (
3: select asset_class
4: , customer
5: , volume
6: from t1
7: where asset_class = 'options'
8: union all
10: select asset_class
11: , customer
12: , volume
13: from t2
14: where asset_class = 'swaps'
16: q as (
17: select asset_class
18: , case when volume > 0 then 'profit'
19: else 'loss'
20: end as profit_or_loss
21: , customer
22: , volume
23: from t
25: select asset_class
26: , customer
27: , sum(volume) as total_volume
28: from q
29: group by
31: , customer
Let me know in the comments!
Views keep your SQL queries DRY - Seth Schroeder
Just lately I’ve been doing a lot of messing around with Windows Azure the main focus of which has been exploring Azure storage (tables, queues, blobs) and how those things can be manipulated from Azure worker roles. One output from that “messing around” has been my Tweetpoll application (read more: Tweetpoll my first Windows Azure application is live).
Since then I’ve found some blog resources that would have been useful during development had I found them earlier so I’m linking to them from here mainly so that I can easily find them again but also so that other people might be made aware of them because there’s some really good content here.
Firstly, Dom Green has developed a library that provides rich application tracing capabilities on top of Azure’s bog standard logging mechanisms. Read more at AzureTrace … managing all your Azure Logging Needs. I’ve put together something slightly similar myself recently but its not as feature rich as Dominic’s and, as anyone at
Conchango EMC Consulting will tell you, I’m no .Net expert by any stretch of the imagination so I’ll be working to put Dom’s library into my own stuff forthwith.
Secondly, Dom linked to a great series of articles from Neil Kidd which cover how we could (and should) build reliable Azure worker roles. Those links:
Thank you Dom. Thank you Neil. There’s some really useful material here that I’ll be devouring constantly over the coming weeks and months.
Of late one may have noticed that I have become very interested in high-brow, generally vaporous, disciplines such as RESTful data services, data interoperability and cloud computing whilst coincident with that has been the inexorable rise of the term “mashup” in the information technology lexicon.
“Mashup” means different things to different people but to me its simply the practice of combining data from multiple places with the aim of discovering or passing on knowledge that wasn’t known before. Well hey, that sounds a lot like what I do in my day job; the main difference being that I don’t generally hear the term “mashup” being bandied about the London meeting rooms that I frequent to the same extent that it does in the funky web 2.0 and swanky startup world; the term I hear (and use) is the considerably less cool “data integration”. Fundamentally though I don’t think there’s that much difference between the two so maybe enterprise data integration people like myself have something to learn from these so-called mashup players.
One of my favourite mashup tools out there is one I’ve spoken about before – Yahoo Pipes. If you haven’t had a look at this it really is worth taking a glance. Yahoo Pipes enables you to extract data from multiple web-based data sources, transform it using a series of operations like sorting, joining, unioning and filtering before finally outputting that transformed data in one of a number of different formats; its a data pipeline for web-based data (A pipeline? Oh, there’s something else I’ve talked about before – noticing a pattern here?). Here is an example of a Yahoo Pipe: Yahoo Finance Stock Quote Watch List Feed w/Chart
Extracting. Transforming. Sorting. Unioning. Filtering. Outputting. This Yahoo Pipes thing is starting to sound awfully like ETL tools such as SQL Server Integration Services (SSIS) wouldn’t you say? They even look a little bit like each other with their boxes joined up with lines between them:
I’m now reminded of what my good friend Andy Britcliffe of Sharpcloud once said to me upon reading my blog post (and viewing the embedded video) Consuming web services in SSIS 2008 a full two years ago. I distinctly remember Andy’s words on that occasion: “SSIS is the ultimate mashup tool”! I didn’t disagree!
Most mashup tools share one common characteristic in that they invariably require someone with some technical nous to set them up in advance so that they can be used by the less tech-savvy amongst us and the same applies in enterprises as well; data is distributed by the IT guys to the information workers and this distribution of data typically takes months whereas the consumers of that data want it available in hours. In both arenas I sense a shift occurring; now the consumers of the data are being empowered to find and interrogate data for themselves and in the enterprise this is happening through the adoption of tools such as Qlikview, Omniscope and (in the near future) Microsoft’s Gemini. I find this to be a fascinating development not because it means there may be less work for me to do (admittedly that would be nice) but because information workers now have the opportunity to be much more productive in their daily jobs and I expect those who invest in learning these new technologies to be the cream that rises to the top of enterprises in the near future.
Up until recently I hadn’t been all that interested in Microsoft’s Gemini project, indeed I was very sceptical of it, but as I started to formulate some of the thoughts that I’m writing about here I began to realise how important it will be when it gets released sometime (hopefully) in early 2010. I earlier described mashups as being “the practice of combining data from multiple places with the aim of discovering or passing on knowledge that wasn’t known before” and that description fits very well with Gemini. If you don’t know what Gemini is take a look at this video:
That demo glosses over the main point I’m making which is that here we see data that is originally pulled from multiple sources and combined in a familiar place (Excel) where the end user can consume it. The person speaking in the video is Donald Farmer and he has a blog entry with many other links to Gemini resources at Microsoft Project Gemini links.
At the top of this email I also talked about how I’m interested in data services, that is data available over the web that we can consume via an API and use for our own knowledge discovery and I was introduced to such a data service just yesterday when listening to Jon Udell’s “Interviews with Innovators” podcast. In the most recent episode Jon interviewed Stephen Willmott whose company 3scale Networks has taken it upon themselves to make data held by the United Nations freely available via a data service to anyone that would like to consume it [UPDATE: Read Jon's own writeup of the interview at Influencing the production of public data]. For example, if you want to know the United Kingdom population’s annual growth rate since 1991, that data is available, for free, at http://undata-api.appspot.com/data/query/Population%20annual%20growth%20rate%20(percent)/United%20Kingdom?user_key=XXXX (you need to sign-up for a free user-key and substitute it for XXXX in order for this query to work) and is returned like so:
“Wouldn’t it be cool” I thought, “if I could consume that data inside of Excel using Gemini”, perhaps in this example to combine it with birth rates over the same period to discover if there is a correlation between the two. At the time though I didn’t know if Gemini made it possible to consume data directly from data sources so I went straight to ask the man who would know, the aforementioned Donald Farmer. I contacted Donald over Twitter and here is the conversation that ensued:
- Me: @donalddotfarmer Is there a list of data sources types from which #Gemini can get data? Interested in data from web APIs e.g. undata-api.org (link)
- Donald: @jamiet I'll need to check out that site in particular, but we do support Atom feeds. (link)
- Me: @donalddotfarmer Ahh that's good news. How about POX/RSS? Does Gemini allow us to parse it or use XQuery? (link)
- Donald: @jamiet No we don't support XQuery - we just consume Atom feeds as they come - the users can then filter and sort in Gemini (link)
- Me: @donalddotfarmer OK, so Atom only right now. Looking forward to getting hands dirty, think I know what 1st feature request will be :) (link)
Lots of techy abbreviations in there so let me summarise. Gemini will be able to consume data from web services that deliver it in the popular Atom XML dialect (more on Wikipedia) which is great news and no great surprise given that Microsoft announced in February 2008 that Atom would be their XML syndication format of choice going forward (see my blog post Windows Live Dev announcements for a more complete commentary). I happen to know that the United Nations data provided by Stephen Willmott is not currently delivered in Atom format but no matter, at least things are moving in the right direction and as I alluded during my last tweet to Donald I’ll be asking for support for other syndication formats in the future.
This has turned into a rather rambling blog post so I’ll call a halt here. As always though I’d be interested to know other people’s thoughts on data services, usage of that data in enterprises or anything else I’ve mentioned herein so if you have any thoughts please leave comments in the space below!
I like to think that most of the stuff I blog about on here is in some way useful but sometimes I just want to put something out there that i think is simply damn cool, this is one of the latter cases.
Today I stumbled upon Joe Pruitt’s PoshBing – a Powershell library for Microsoft’s new Bing search engine. Its a wrapper around the Bing API and enables you to call the Bing search engine and consume the results inside the Powershell pipeline. A picture speaks a thousand words so here are some example of what you can do with it:
Basic web search
Not sure how useful it is but its all very cool stuff I’m sure you’ll agree (if you’re a fan of Powershell)! Download the Powershell script file containing this good stuff from http://poshbing.codeplex.com/ and happy querying!
As an aside, Joe has also built a Twitter wrapper around his Powershell Bing wrapper which means you can now use Twitter to use Powershell to query Bing; simply head for @askbing and ask it whatever you want. Probably not all that useful but nonetheless very very cool.
Please check out more of my Powershell related blog posts:
I’ve just read a mildly interesting article The Twitpocalypse is Near: Will Your Twitter Client Survive? which talks about how any Twitter client using a signed 32bit integer to store twitter status_ids is going to break in the next few days when the number of tweets surpasses the maximum possible value of a signed 32bit integer which is 2,147,483,648.
That got me thinking about how much space Twitter must be using up to store all of those status updates. If we disregard storage of who posted a tweet and where it was posted from then we can assume (that’s the first of many assumptions in this blog post) that Twitter’s table of tweets looks something like this:
CREATE TABLE statuses (
According to the MySQL documentation BIGINTs take up 8 bytes and TIMESTAMPs take up 4 bytes. The space used by a VARCHAR depends both on the length of the value being stored and the character set being used so if we assume the character set is UTF-8 and we also assume that that means each character takes up 2 bytes (not a safe assumption apparently) then the maximum space used by a particular status will be 281 bytes (140*2 + 1 byte to represent the length). Hence, the maximum possible length of a record in this table is 293 bytes.
Of course, very few tweets are 140 characters in length so we need to know what the average tweet length is. I’ve examined the underlying data for my Tweetpoll application* and rudimentarily calculated that the average tweet length is around about 92 characters. Thus, we can expect that the average length of a tweet in the table can be calculated as:
|status_id ||+ ||status ||+ ||post_dt |
|8 ||+ ||(92 * 2) + 1 ||+ ||4 |
That’s 197 bytes in total per row. So, when twitter hits that maximum possible value of a 32bit signed integer sometime in the next 7 days I estimate that this table is going to be occupying 197 * 2147483648 = 423054278656 bytes. Or, in numbers we can understand, 394GB.
By the way, don’t fret that Twitter itself is going to break any time soon; they use 64bit unsigned integers to store status_ids so the Twitter service itself is going to be OK for a while. Storing status_ids in 64bit unsigned integers means the theoretical maximum number of tweets is 18,446,744,073,709,551,615 or, as Programmable web point out, 2.7 billion tweets for every person on the planet. When that limit is (theoretically) reached in the year [fill in arbitrarily chosen year here] Twitter are going to need something in the region of 3.2million petabytes of disk space to store them all (i.e. 3,634,008,582,520,781,668,155 bytes). To put that into perspective, that’s about the same as 68 billion Blu Ray disks completely filled up with tweets.
Now, back to some real work….
Disclaimer: I’m sure I don’t need to point out that this blog post is purely based on assumptions and my, usually rather hopeless, mathematical abilities so the numbers are completely bogus and anyone else attempting this calculation would probably come up with totally different ones!
*At the time of writing Tweetpoll is viewable because I don’t have to pay for it but if it is not available by the time you come to read this then its because Windows Azure has reached general availability in which case you can read more about Tweetpoll at Tweetpoll – My first Windows Azure application is live
Google have, today, released what looks like a very interesting product in Google Squared which is an extension to the search engine that tabulates your search results. Interesting because it presents results in a structured format that I assume one may, one day, be able to consume in any manner of ways (more on that in a minute).
Here’s an example, the results of a Google Squared search for “prime ministers”:
or how about “SSIS tasks”:
Google Squared has pulled back relevant information and put it into a table format which does seem very useful but in many ways it highlights a problem with search engines in this day and age; the information simply isn’t complete and therefore can’t be trusted. We don’t have a definitive list of all prime minsters here and neither do we have a definitive list of all tasks provided with SQL Server Integration Services (SSIS) either. Based on this would you trust the results coming out of Google Squared without additional research? I certainly wouldn’t.
On the subject of consuming this data, wouldn’t it be great if there were an easy mechanism of consuming structured data like this? Something none-proprietary, open and based on a standard? Well, if you had read and taken any notice of my recent blog post Whatever happened to Live Clipboard? you’d be aware that a mechanism that purports to solving this problem (i.e. Live Clipboard) has already been mooted but unfortunately has failed to gain any traction in the industry. That’s a real shame because Live Clipboard could provide a way to (for example) copy data from a Google Squared result into an Excel spreadsheet and vice versa. As I said in that previous blog post:
Here’s hoping [the lack of adoption of Live Clipboard] changes soon because it sounds like a very useful technology and to a fella like me whose primary interest is data integration anything that uses well-known standards as a method of doing that is worthy of attention.
All that being said Google Squared is a very interesting advancement in the increasingly intermingling worlds of data integration and search, even if it does think that George Clinton is an ex-US Vice President :)
Any thoughts? Let me know in the comments!
[If you’re using SSIS to pull data out of Oracle then this may be of interest to you.]
In SQL Server 2008 Microsoft introduced a feature called Change Data Capture (CDC) which basically enables you to query a SQL Server table from within SSIS and ask for all rows that have been created or changed since a given point-in-time. Attunity have now released a product called Oracle-CDC for SSIS that purportedly allows you to do the same with a table stored in Oracle too.
- Non-intrusive, log-based Oracle CDC for Oracle 9i, 10g, and 11g
- Full integration into SSIS and the Business Intelligence Developer Studio
- Generation of SSIS packages for processing Oracle data, reducing development efforts and ensuring best practice implementation
- Automatic synchronization of full-load and CDC processing
- Monitoring and control of CDC processing
If you’re interested in knowing more then Attunity are running a webinar on 3rd June, details are at https://www1.gotomeeting.com/register/543263673.
Attunity are the company that provide the officially sanctioned Microsoft connectors for Oracle and Teradata so they have a pretty good pedigree; Oracle-CDC might just be worth paying attention to.
I’ve noticed some confusion in the community over the mythical “4000 character limit” in SQL Server Integration Services (SSIS) and thought it may be worth clarifying the situation.
The most common misconception I hear is “A SSIS string variable has a maximum length of 4000 characters”. This isn’t true and its easy to prove; simply create a string variable and paste in a value that is greater than 4000 characters (which is easily constructed in Notepad using copy/paste). You can then use a script task to display the whole value, thus proving that variable values an be greater than 4000 characters:
So then where does this notion of a 4000 character limit come from? Well, 4000 is the maximum length of the result of an expression and that’s easily proved as well by using REPLICATE(“a”, 5000) in an expression. If you use that expression to return the value of a variable then you will get, at design-time, the following message:
If you examine the message in the screenshot above you’ll notice that it mentions the maximum length of a DT_WSTR value is 4000. This is true, but DT_WSTR is not the type of a SSIS string variable and hence as I’ve just proven we can place a string greater than 4000 characters into such a variable. The SSIS variable type system is basically a subset of the .Net type system whereas expressions and dataflow columns use a type system that is proprietary to SSIS (i.e. DT_WSTR, DT_STR, DT_I4 etc…). It is of course utterly ridiculous that SSIS has two different type systems; I’m completely miffed as to why that is the case.
Hopefully that clears up the confusion! This information is true of both SSIS 2005 and SSIS 2008.
Anyone out there remember Live Clipboard? It was a very interesting incubation technology that came out of Microsoft’s Live Labs group way back in 2006 (I think) and how now been open sourced under the Creative Commons Attribution-ShareAlike License. It was also backed from way up on high – Ray Ozzie (now Chief Software Architect at Microsoft and also the guy who took over from Bill Gates) was its chief backer. Live Clipboard was in fact one of the few technologies that Ray blogged about on his, now defunct, MSN Spaces blog but sadly that blog post is no longer available.
So, what was Live Clipboard? From the site itself:
In other words, it is copy and paste for the web. You may think we already have that; after all, I can copy and paste some text from a text box on one page into a text box on another web page; but that isn’t really what this is about. Live Clipboard used XML markup to describe the data that was being copied thus if that XML was a well-known representation of the data (e.g. a microformat) then the receiving website could act upon that data accordingly. The canonical example is the one given at http://www.liveclipboard.org/:
“Let's say you have two sites both of which understand calendar data. I want to move an appointment from one site to another. With Live Clipboard, there is now an icon on each site, next to each piece of data that can be transported. Bring site A to the front, click on the icon and choose Copy, then bring site B to the front, click on the icon and choose Paste.“
Its not hard to envisage many other uses for such a technology, http://microformats.org has a number of fledgling microformat specifications that could all benefit from Live Clipboard:
Imagine finding someone’s contact details on their website and easily being able to transport those details into your address book with a couple of clicks – that’s the promise of Live Clipboard and microformats. Copy and paste is nearing ubiquity for smart devices (iPhone is expected to announce support in the next couple of weeks) and I doubt anyone reading this would contemplate using a PC that didn’t support it so I’m surprised that this similar concept for the web hasn’t taken off.
A number of large organisations have started to support Microformats most notably Google who recently announced that Googlebot would start seeking out Microformats and Microsoft themselves who have released Oomph, a microformats toolkit. Given that the use of microformats is now taking off I’m surprised that Live Clipboard hasn’t been heard of in such a long time. Here’s hoping that changes soon because it sounds like a very useful technology and to a fella like me whose primary interest is data integration anything that uses well-known standards as a method of doing that is worthy of attention.
Does anyone out there have any information to share about Live Clipboard?
Live Clipboard main site - http://www.liveclipboard.org/
Live Clipboard on Wikipedia - http://en.wikipedia.org/wiki/Live_Clipboard
UPDATE: Ray is well known for inviting people to contact him so I did that and asked him what had happened to Live Clipboard. Paraphrasing his reply: even though there was initial interest, once it was open sourced there was a lack of take-up within the wider web developer community.