Welcome to EMC Consulting Blogs Sign in | Join | Help

SSIS Junkie

  • The curtain falls

    In October 2004 I was in Orlando airport returning home from the annual SQL PASS summit and I happened to pick up a copy of MIT’s Technology Review magazine in which the cover story was an interview with Tim Berners-Lee called “Internet 2.0”.

    MIT Technology Review October 2004 cover

    One subject covered in the interview was the emerging phenomenon of blogging and it was then that I decided I wanted to get my own blog to talk about the stuff I was interested which, at the time, was the early beta of SQL Server Yukon (later renamed to SQL Server 2005).

    Soon after returning to the UK it was announced internally that Conchango were soon to support their own blog site at http://blogs.conchango.com and would anyone be interested in contributing. Inspired by the Berners-Lee article I threw my hat in and on 3rd November 2004 I wrote my first blog entry Intro where I said:

    So...I've got a blog. Now the inevitable question...what on earth to put in it?

    Well I guess I found something to write about because since that original post I’ve written another 674 of the things!:

    Conchango blogger statistics

    Why am I telling you this? Well, this blog post is going to be both the 676th and the last at http://blogs.conchango.com/jamiethomson for in a few weeks I shall be leaving Conchango EMC Consulting after five and a half very very happy years. Its been a painstaking decision to do this for I’ll be leaving behind a bunch of brilliantly talented colleagues and fun people whom it has been a genuine pleasure to be around but I feel the time is right to branch out on my own.

    Thanks for reading here for at least some of the past five years, see you on the other side!


    Given my interest in all things BI I thought I’d provide some stats to summarise my activity here over the past 5 years (also because I’m not going to have access to some of this data for much later):

    • # of posts: 676
    • # of comments: 4109
    • # of hits: 6.99million

    Distribution of blog posts per month (unsurprisingly my output declined once I’d run out of stuff to talk about in regard to SSIS 2005)

    blog posts per month on http://blogs.conchango.com/jamiethomson


    Number of people subscribed to my blog on Google Reader since March 2008

    google reader subscribers to http://blogs.conchango.com/jamiethomson

    Top posts in terms of hits

  • Reporting Services FixedColumnHeaders property bug [Note to self]

    [Every time I come across this problem I seem to spend a good five minutes scanning the interwebs trying to find the answer - maybe if I write it here it'll be easier to find in the future!]

    SQL Server Reporting Services 2008 has a bug concerning the FixedColumnHeaders property of a tablix.

    The property's description is given as :

    "Indicates whether the column headers remain displayed on the page when the user scrolls the tablix data region off the page."

    Well that's not actually true because setting that property to, well, anything has no discernible effect. Microsoft have acknowledge the bug (without actually calling it a bug) and provided a workaround in a knowledge base article:

    You cannot keep the column headers or the row headers visible when you scroll through a report in SQL Server 2008 Reporting Services

     I haven't yet checked whether this is fixed in the recent CTP2 release build of SQL Server 2008 R2. If I find out I'll try and remember to come back and update this post.


  • Editing Configuration files : SSIS Nugget

    Its been a long-g-g-g-g time since I did a SSIS-related post so here’s a little ditty that I picked up while reviewing some internal stuff today.

    Editing SSIS configuration (i.e. .dtsConfig) files isn’t particularly easy. Most people I see who try to do this use a traditional text editor like Notepad but that isn’t a great experience because the contents don’t look particularly friendly when you open them up. Observe:


    Instead you may wish to use Visual Studio which, if you’re messing about with SSIS configuration files, you should already have installed. The steps are pretty simple, firstly hit CTRL-O to open your config file. It may look something like this:


    Well, that’s not much better but with a little wave of Visual Studio’s magic wand (otherwise known as CTRL-K, CTRL-D) then it will get turned into this:


    which, I’m sure you’ll agree, is a whole lot friendlier. You can save that and it will continue to work just fine, SSIS will of course ignore the whitespace (why BIDS doesn’t create configuration files like this in the first place I do not know).


  • Peering into the SQL Azure documentation

    Some documentation for SQL Azure has been released on MSDN at http://msdn.microsoft.com/en-us/library/ee336279.aspx and upon reading there are some interesting nuggets of information that are worth calling out.

    Under a section headed “Key benefits of the service” the following statement appears:


    A key advantage of SQL Azure is the ease with which you can scale your solution. After partitioning your data, the service scales as your data grows.

    Hmmm…there’s two seemingly contradictory statements in there, “ease” and “partitioning your data”. Last time I looked there was nothing particularly easy about partitioning data, certainly not when it is going to be partitioned across multiple databases. I wonder if the captain of the Titanic proudly declared “Sailing to New York will be easy once we’re past all those meddlesome icebergs”!

    Another name for this partitioning technique is sharding for which Dare Obasanjo has a great discussion up at Building Scalable Databases: Pros and Cons of Various Database Sharding Schemes. Dare says database sharding is:

    the process of splitting up a database across multiple machines to improve the scalability of an application

    which is exactly what SQL Azure proposes except that they use the word “partitioning” rather than “splitting” and its across multiple virtual instances rather than multiple machines.

    Dare talks about numerous disadvantages of sharding but the key one is that you lose the benefit of referential integrity (RI) which is, to my mind, THE main reason for using a relational database in the first place. Let’s take a look at a simple example. Suppose (incredibly hypothetically) that Amazon adopted SQL Azure and chose to shard its product catalogue across multiple instances; they then face the very real situation where it is not possible to have full RI between their products, customers and the orders that link them together. An order would most likely contain products from different shards therefore which shard should the order live on? Maybe Amazon could shard the order line items according by product but how do they enforce RI back to the order header and, onward, to the customer? Clearly Amazon can’t enforce RI across shards so they have to use some different data storage mechanism in which case why bother using SQL Azure at all?

    This is an extreme case but its one that you have to consider when using SQL Azure and the problem is exacerbated when you consider that the maximum size of a database in SQL Azure is 10GB. “Oh, you want to store your new product in my database? You say it’ll only use up 5KB of storage? No can do, you’re maxed out! Sorry!” (Although arguably the 10GB limit is an advantage because at least your shard’s maximum limit is absolute rather than theoretical.)

    There are of course solutions to these problems and they generally involve writing your application to workaround these limitations. Doing that is, however, an expensive and time-consuming undertaking which is something anyone adopting SQL Azure needs to be aware of and you won’t find it written down anywhere in the SQL Azure documentation.

    Be aware, that’s all I’m saying! If your database is not going to grow anywhere near 10GB then SQL Azure might well be a good fit for you.


  • .Net Service bus demo by EasyJet at UK Azure Net user group meeting

    Yesterday evening I attended the second meeting of the UK Azure Net user group in London Victoria. Normally this wouldn’t in itself be blog worthy but one of the presentations given really piqued my interest and I wanted to draw some attention to it.

    Bert Craven from EasyJet gave a demo of an application that EasyJet are going to be trialling in the near future, one that allows their employees to use handheld devices to check-in passengers rather than have those passengers queue up at check-in desks. The really compelling part of the demo though was the underlying technology; Bert’s team are using the Azure .Net Service Bus to expose service endpoints from their existing firewalled systems in order to make them consumable from the handheld devices. It was (as I fed back to Bert last night) by some distance the best Azure demo I’ve seen yet

    Bert claimed that they had to do little more than change a configuration file in order to expose their services in this way and that’s when I had the “AHA” moment that had thus far escaped me and I began to understand why people like Simon Evans and Simon Davies have been speaking so enthusiastically about the .Net Service Bus. I realised the power inherent in this thing - it allows you to deliver features that previously would have taken reams and reams of code using little more than a configuration change.

    Bert’s presentation included a demo where a service and a client were able to communicate via a service bus relay in under a second – this involved two round trips to a datacentre on the west coast of the US no less. He also showed data being synchronised between two handheld devices (running on emulators) via the service bus and again in sub-second intervals. To say this was impressive is an understatement.

    Bert has blogged about his talk at Azure Service Bus Presentation and Demo Code and as the title suggests he has made demo code available there. I’m going away now to try it for myself.

    That’s all really. The .Net Service Bus is the ace in the pack of Microsoft’s Azure offering and if you work in B2B scenarios or with distributed systems it could be well worth spending some time understanding it.


    UPDATE: A video of Bert’s session is now available:

  • Sync is the word

    I recently tweeted the following rather silly little ditty…

    Sync is the word
    Its in Groove, its in GMail
    Sync works any time, any place, in motion
    Sync is the way we'll be storing

    Ahem, yes. Well… silly but with a relevant point. Sync is becoming an integral aspect of how we build and consume apps and I expect that trend to increase in the years to come. Let’s look at the evidence. Big tech companies are tripping up over themselves with their efforts to build synchronisation capabilities into their products:

    I often hear that we are living in an increasingly connected world but in some ways the opposite is true, we are actually living in an increasingly disconnected world due to the increasing number of devices in our pocket that are online either some or none of the time. Until online connectivity is ubiquitous sync is an essential part of any smartphone platform.

    Indeed, almost five years ago Russell Beattie said:

    “Syncing is *THE* most important piece of technology in the future of mobility. Voice is and will be the number one service, but after that it's syncing. Syncing! SYNCING!”
    from Syncing. Syncing! Syncing! Syncing! SYNCING!!!

    I quite agree! More about sync to follow…


  • String Aggregation in T-SQL & PL-SQL : SQL Nugget

    In my current day job I’m doing a lot of work against an Oracle back-end and I’ve just come across a situation where I need to do some string aggregation. Effectively I needed to turn this:

    Charles William
    Charles Harry
    Anne Peter
    Anne Zara
    Andrew Beatrice
    Andrew Eugenie

    into this:

    Charles William,Harry
    Anne Peter,Zara
    Andrew Eugenie,Beatrice

    In other words I wanted to take a list of children per parent and produce a comma-delimited list of each of their children.

    I know how to do this in T-SQL, you use the FOR XML PATH(‘’) construct like so:

    with t  as(
     select 'Charles' parent, 'William' child union
     select 'Charles', 'Harry' union
     select 'Anne', 'Peter' union
     select 'Anne', 'Zara' union
     select 'Andrew', 'Beatrice' union
     select 'Andrew', 'Eugenie' 
    SELECT parent, STUFF( ( SELECT ','+ child 
                            FROM t a 
                            WHERE b.parent = a.parent 
                            FOR XML PATH('')),1 ,1, '')  children
    FROM t b 
    GROUP BY parent

    which, yes, turned this:


    into this:


    Unfortunately I didn’t know how to accomplish it in Oracle however after a bit of searching around I found the answer:

    with t  as(
     select 'Charles' parent, 'William' child from dual union
     select 'Charles', 'Harry' from dual union
     select 'Anne', 'Peter' from dual union
     select 'Anne', 'Zara' from dual union
     select 'Andrew', 'Beatrice' from dual union
     select 'Andrew', 'Eugenie' from dual
    select parent, rtrim(xmlagg(xmlelement(e,child || ','))
                    .extract('//text()'),',') childs from t
    group by parent

    So, now you know. And so will I if I ever need to find this again!


  • The Longest Tweet

    I read an article today called The Longest Tweet In History which explained how it is now possible to send Tweets via Twitter that are longer than the supposed 140 character limit.

    As it happens I’ve written a noddy app called Tweetpoll onto Windows Azure that periodically polls Twitter’s public timeline to determine the distribution of Tweet length and displays that distribution at http://tweetpoll.cloudapp.net/. Here’s a screenshot of the latest distribution:


    This is based on a sample set of 2940150 tweets (so far). Unsurprisingly its a fairly smooth curve (I can’t explain the peaks though – probably a bug in my code) although before I started I expected the frequency of tweet lengths to increase exponentially and clearly that’s not the case because we have a sustained increase around 30-50 characters before dropping off again.

    Anyway, I digress. Its been running a couple of months now and I became puzzled when, soon after launching it, I began getting results that were greater than 140 characters (as you can see on the graph above). I didn’t have an explanation for those numbers so I set about uncovering why. Three days ago I deployed an update to the app so that it now explicitly captures all tweets greater than 140 characters and stores them somewhere. Thanks to my new best friend LINQPad and the following query:

    var results = svc.LongTweetsTable.ToList().Select(r => new { r.RowKey, Length = r.RowKey.Length})
                        .OrderByDescending(r => r.Length );

    I can now find out what’s going on in those tweets:


    Notice anything about those tweets? They all contain either “<” or “>” which are the escape characters for less-than/greater-than symbols in markup and hence the mystery is explained; the markup for tweets might well be longer than the actual tweet itself. Pretty logical if you think about it although it didn’t occur to me without actually examining the data. That’s an important lesson learned – make sure you know your data intimately.


  • Kapow – ETL for HTML

    A couple of weeks ago Chris Webb sent me an IM telling me about a new technology he’d just seen a demo of called Kapow. Chris has since blogged about it at Kapow Technologies and in that blog post he described Kapow as:

    “a cross between a screenscraper and an ETL tool”

    That’s a very apt description, for what Kapow enables you to do is build what are effectively ETL packages (although they call them Robots) that extract data from HTML pages and either (a) load it into a database for you or (more interestingly) (b) make that data available as a RESTful web service. A robot pulls out the data embedded in the markup and presents it as strongly-typed data entities.

    I never really thought about a web page as being structured data but actually nothing could be further from the truth; HTML is after all nothing more than a hierarchical dataset with the added luxury of metadata - otherwise known as the Document Object Model (DOM). Kapow takes that structured data and its rich metadata, parses it for us, and presents it to us in ways that are easily consumable.

    I was given a tour of Kapow by their UK rep Dominic Dunkley. Dominic had a great demo where he built, from scratch, a Kapow robot that:

    1. visited a search engine before…
    2. …entering a search term (in this case “EMC”)
    3. iterated over each page of results within which it…
    4. …iterated over each search result….
    5. …and extracted the title, URL and description before…
    6. …making all the search results available in a strongly-typed dataset

    It all took about 15 minutes and that was with him pausing to explain it step-by-step. That demo really grabbed my attention because I realised that not only does Kapow have the ability to parse the DOM but it also has notions of workflow and data composition which are of course two vital features of any ETL tool.

    I’m reminded of 3scale Networks that I mentioned in my blog post Enterprise Mashups a couple of weeks back (it was actually after reading that blog post that Chris got in touch with me); in that I describe how 3scale have taken information made freely available by the United Nations at http://data.un.org/ and made it available as an easily consumable data service. In essence this same service could be built using Kapow without being monotonously handcrafted which is how I suspect 3scale did it.

    I’m hoping I have some reason to use Kapow in the future because I think there are some very interesting scenarios that come into play here. Chiefly, as Chris pointed out in his blog post, we can pull data from any web site and use it for BI purposes. For example, suppose you work for an airline and you want to easily compare your advertised prices with those of your competitors – Kapow is a one-stop shop for enabling that.

    Impressive stuff. If you’re interested go and check out Kapow for yourself at http://kapowtech.com/index.php/solutions/web-and-business-intelligence


  • LINQPad and Azure

    Since I started dealing with Azure tables I’ve become frustrated that there is no ad-hoc query tool, nothing equivalent to SQL Server Management Studio. Then I heard about LINQPad and figured there must be a way to use it to query Azure tables using LINQ and indeed there is as I’ll explain here (this post assumes that you have a working knowledge of LINQPad and Azure storage).

    Firstly you need to reference the correct assemblies. I’ve been using the StorageClient that is provided with the Azure Samples in the Azure SDK (March 2009 CTP of the SDK linked here) which in turn uses ADO.Net Services client library hence you’ll need to reference the following assemblies:

    • StorageClient.dll
    • System.Data.Services.Client.dll

    You’ll also need to reference your DataServiceContext implementation. For demo purposes I am pulling data from my Tweetpoll application so I’m referencing the assembly containing my TweetPollDataServiceContext class - TweetPoll.dll:


    We also need to import the namespaces into LINQPad. You’ll need the following:

    • Microsoft.Samples.ServiceHosting.StorageClient
    • System.Data.Services.Client

    plus whichever namespace holds your DataServiceContext implementation:


    Thereafter you just need to write some code. You need to supply the following:

    1. Your Azure storage account name
    2. Your Azure storage shared key
    3. The name of your DataServiceContext implementation
    4. A LINQ query

    Here’s my code:

    var accountName = "jamiekt";  // 1) Enter your storage account name
    var sharedKey = "smXblLn+UgXR6ysbIhoeTfE3dyOZhAaONXOP/SUawRSLzCWwDXrhkpRG45A5aeAP5IEjSBEN2mEmPM8O5HnWGQ=="; // 2) SharedKey
    var uri = new System.Uri("http://table.core.windows.net/");
    var tableAccountInfo = new StorageAccountInfo(uri, null, accountName, sharedKey);
    var svc = new TweetPollDataServiceContext(tableAccountInfo); // 3) Specify your DataServiceContext
    // 4) Supply your query
    var query = from row in svc.Table 
                           select row;

    And here are the results in LINQPad:
    Hope that helps!!!


This Blog


Powered by Community Server (Personal Edition), by Telligent Systems