|
|
Once upon a time this blog was a hive of activity. Now however its pretty lifeless as you can probably tell so if are pining for more of the same you can find me over at http://sqlblog.com/blogs/jamie_thomson. I look forward to seeing you there!
-
In October 2004 I was in Orlando airport returning home from the annual SQL PASS summit and I happened to pick up a copy of MIT’s Technology Review magazine in which the cover story was an interview with Tim Berners-Lee called “Internet 2.0”.  One subject covered in the interview was the emerging phenomenon of blogging and it was then that I decided I wanted to get my own blog to talk about the stuff I was interested which, at the time, was the early beta of SQL Server Yukon (later renamed to SQL Server 2005). Soon after returning to the UK it was announced internally that Conchango were soon to support their own blog site at http://blogs.conchango.com and would anyone be interested in contributing. Inspired by the Berners-Lee article I threw my hat in and on 3rd November 2004 I wrote my first blog entry Intro where I said: So...I've got a blog. Now the inevitable question...what on earth to put in it? Well I guess I found something to write about because since that original post I’ve written another 674 of the things!: Why am I telling you this? Well, this blog post is going to be both the 676th and the last at http://blogs.conchango.com/jamiethomson for in a few weeks I shall be leaving Conchango EMC Consulting after five and a half very very happy years. Its been a painstaking decision to do this for I’ll be leaving behind a bunch of brilliantly talented colleagues and fun people whom it has been a genuine pleasure to be around but I feel the time is right to branch out on my own. My blogging activity will not die, I shall be continuing over at http://sqlblog.com/blogs/jamie_thomson and I would encourage you to point your RSS readers at http://feeds.feedburner.com/jamiet if you want to continue to read my inane wafflings. Thanks for reading here for at least some of the past five years, see you on the other side! @Jamiet Given my interest in all things BI I thought I’d provide some stats to summarise my activity here over the past 5 years (also because I’m not going to have access to some of this data for much later): - # of posts: 676
- # of comments: 4109
- # of hits: 6.99million
Distribution of blog posts per month (unsurprisingly my output declined once I’d run out of stuff to talk about in regard to SSIS 2005) Number of people subscribed to my blog on Google Reader since March 2008 Top posts in terms of hits
|
-
[Every time I come across this problem I seem to spend a good five minutes scanning the interwebs trying to find the answer - maybe if I write it here it'll be easier to find in the future!]
SQL Server Reporting Services 2008 has a bug concerning the FixedColumnHeaders property of a tablix.
The property's description is given as :
"Indicates whether the column headers remain displayed on the page when the user scrolls the tablix data region off the page."
Well that's not actually true because setting that property to, well, anything has no discernible effect. Microsoft have acknowledge the bug (without actually calling it a bug) and provided a workaround in a knowledge base article:
You cannot keep the column headers or the row headers visible when you scroll through a report in SQL Server 2008 Reporting Services
I haven't yet checked whether this is fixed in the recent CTP2 release build of SQL Server 2008 R2. If I find out I'll try and remember to come back and update this post.
@Jamiet
|
-
Its been a long-g-g-g-g time since I did a SSIS-related post so here’s a little ditty that I picked up while reviewing some internal stuff today. Editing SSIS configuration (i.e. .dtsConfig) files isn’t particularly easy. Most people I see who try to do this use a traditional text editor like Notepad but that isn’t a great experience because the contents don’t look particularly friendly when you open them up. Observe: Instead you may wish to use Visual Studio which, if you’re messing about with SSIS configuration files, you should already have installed. The steps are pretty simple, firstly hit CTRL-O to open your config file. It may look something like this: Well, that’s not much better but with a little wave of Visual Studio’s magic wand (otherwise known as CTRL-K, CTRL-D) then it will get turned into this: which, I’m sure you’ll agree, is a whole lot friendlier. You can save that and it will continue to work just fine, SSIS will of course ignore the whitespace (why BIDS doesn’t create configuration files like this in the first place I do not know). @Jamiet
|
-
Some documentation for SQL Azure has been released on MSDN at http://msdn.microsoft.com/en-us/library/ee336279.aspx and upon reading there are some interesting nuggets of information that are worth calling out. Under a section headed “Key benefits of the service” the following statement appears: Scalability A key advantage of SQL Azure is the ease with which you can scale your solution. After partitioning your data, the service scales as your data grows. Hmmm…there’s two seemingly contradictory statements in there, “ease” and “partitioning your data”. Last time I looked there was nothing particularly easy about partitioning data, certainly not when it is going to be partitioned across multiple databases. I wonder if the captain of the Titanic proudly declared “Sailing to New York will be easy once we’re past all those meddlesome icebergs”! Another name for this partitioning technique is sharding for which Dare Obasanjo has a great discussion up at Building Scalable Databases: Pros and Cons of Various Database Sharding Schemes. Dare says database sharding is: the process of splitting up a database across multiple machines to improve the scalability of an application which is exactly what SQL Azure proposes except that they use the word “partitioning” rather than “splitting” and its across multiple virtual instances rather than multiple machines. Dare talks about numerous disadvantages of sharding but the key one is that you lose the benefit of referential integrity (RI) which is, to my mind, THE main reason for using a relational database in the first place. Let’s take a look at a simple example. Suppose (incredibly hypothetically) that Amazon adopted SQL Azure and chose to shard its product catalogue across multiple instances; they then face the very real situation where it is not possible to have full RI between their products, customers and the orders that link them together. An order would most likely contain products from different shards therefore which shard should the order live on? Maybe Amazon could shard the order line items according by product but how do they enforce RI back to the order header and, onward, to the customer? Clearly Amazon can’t enforce RI across shards so they have to use some different data storage mechanism in which case why bother using SQL Azure at all? This is an extreme case but its one that you have to consider when using SQL Azure and the problem is exacerbated when you consider that the maximum size of a database in SQL Azure is 10GB. “Oh, you want to store your new product in my database? You say it’ll only use up 5KB of storage? No can do, you’re maxed out! Sorry!” (Although arguably the 10GB limit is an advantage because at least your shard’s maximum limit is absolute rather than theoretical.) There are of course solutions to these problems and they generally involve writing your application to workaround these limitations. Doing that is, however, an expensive and time-consuming undertaking which is something anyone adopting SQL Azure needs to be aware of and you won’t find it written down anywhere in the SQL Azure documentation. Be aware, that’s all I’m saying! If your database is not going to grow anywhere near 10GB then SQL Azure might well be a good fit for you. @Jamiet
|
-
Yesterday evening I attended the second meeting of the UK Azure Net user group in London Victoria. Normally this wouldn’t in itself be blog worthy but one of the presentations given really piqued my interest and I wanted to draw some attention to it. Bert Craven from EasyJet gave a demo of an application that EasyJet are going to be trialling in the near future, one that allows their employees to use handheld devices to check-in passengers rather than have those passengers queue up at check-in desks. The really compelling part of the demo though was the underlying technology; Bert’s team are using the Azure .Net Service Bus to expose service endpoints from their existing firewalled systems in order to make them consumable from the handheld devices. It was (as I fed back to Bert last night) by some distance the best Azure demo I’ve seen yet Bert claimed that they had to do little more than change a configuration file in order to expose their services in this way and that’s when I had the “AHA” moment that had thus far escaped me and I began to understand why people like Simon Evans and Simon Davies have been speaking so enthusiastically about the .Net Service Bus. I realised the power inherent in this thing - it allows you to deliver features that previously would have taken reams and reams of code using little more than a configuration change. Bert’s presentation included a demo where a service and a client were able to communicate via a service bus relay in under a second – this involved two round trips to a datacentre on the west coast of the US no less. He also showed data being synchronised between two handheld devices (running on emulators) via the service bus and again in sub-second intervals. To say this was impressive is an understatement. Bert has blogged about his talk at Azure Service Bus Presentation and Demo Code and as the title suggests he has made demo code available there. I’m going away now to try it for myself. That’s all really. The .Net Service Bus is the ace in the pack of Microsoft’s Azure offering and if you work in B2B scenarios or with distributed systems it could be well worth spending some time understanding it. @Jamiet UPDATE: A video of Bert’s session is now available:
|
-
I recently tweeted the following rather silly little ditty… Sync is the word Its in Groove, its in GMail Sync works any time, any place, in motion Sync is the way we'll be storing Ahem, yes. Well… silly but with a relevant point. Sync is becoming an integral aspect of how we build and consume apps and I expect that trend to increase in the years to come. Let’s look at the evidence. Big tech companies are tripping up over themselves with their efforts to build synchronisation capabilities into their products: I often hear that we are living in an increasingly connected world but in some ways the opposite is true, we are actually living in an increasingly disconnected world due to the increasing number of devices in our pocket that are online either some or none of the time. Until online connectivity is ubiquitous sync is an essential part of any smartphone platform. Indeed, almost five years ago Russell Beattie said: “Syncing is *THE* most important piece of technology in the future of mobility. Voice is and will be the number one service, but after that it's syncing. Syncing! SYNCING!” from Syncing. Syncing! Syncing! Syncing! SYNCING!!! I quite agree! More about sync to follow… @Jamiet
|
-
In my current day job I’m doing a lot of work against an Oracle back-end and I’ve just come across a situation where I need to do some string aggregation. Effectively I needed to turn this:
Parent |
Child |
| Charles |
William |
| Charles |
Harry |
| Anne |
Peter |
| Anne |
Zara |
| Andrew |
Beatrice |
| Andrew |
Eugenie |
into this:
Parent |
Children |
| Charles |
William,Harry |
| Anne |
Peter,Zara |
| Andrew |
Eugenie,Beatrice |
In other words I wanted to take a list of children per parent and produce a comma-delimited list of each of their children.
I know how to do this in T-SQL, you use the FOR XML PATH(‘’) construct like so:
with t as( select 'Charles' parent, 'William' child union select 'Charles', 'Harry' union select 'Anne', 'Peter' union select 'Anne', 'Zara' union select 'Andrew', 'Beatrice' union select 'Andrew', 'Eugenie' ) SELECT parent, STUFF( ( SELECT ','+ child FROM t a WHERE b.parent = a.parent FOR XML PATH('')),1 ,1, '') childrenFROM t b GROUP BY parent
which, yes, turned this:
into this:
Unfortunately I didn’t know how to accomplish it in Oracle however after a bit of searching around I found the answer:
with t as( select 'Charles' parent, 'William' child from dual union select 'Charles', 'Harry' from dual union select 'Anne', 'Peter' from dual union select 'Anne', 'Zara' from dual union select 'Andrew', 'Beatrice' from dual union select 'Andrew', 'Eugenie' from dual ) select parent, rtrim(xmlagg(xmlelement(e,child || ',')) .extract('//text()'),',') childs from tgroup by parent
So, now you know. And so will I if I ever need to find this again!
@Jamiet
|
-
I read an article today called The Longest Tweet In History which explained how it is now possible to send Tweets via Twitter that are longer than the supposed 140 character limit. As it happens I’ve written a noddy app called Tweetpoll onto Windows Azure that periodically polls Twitter’s public timeline to determine the distribution of Tweet length and displays that distribution at http://tweetpoll.cloudapp.net/. Here’s a screenshot of the latest distribution: This is based on a sample set of 2940150 tweets (so far). Unsurprisingly its a fairly smooth curve (I can’t explain the peaks though – probably a bug in my code) although before I started I expected the frequency of tweet lengths to increase exponentially and clearly that’s not the case because we have a sustained increase around 30-50 characters before dropping off again. Anyway, I digress. Its been running a couple of months now and I became puzzled when, soon after launching it, I began getting results that were greater than 140 characters (as you can see on the graph above). I didn’t have an explanation for those numbers so I set about uncovering why. Three days ago I deployed an update to the app so that it now explicitly captures all tweets greater than 140 characters and stores them somewhere. Thanks to my new best friend LINQPad and the following query: var results = svc.LongTweetsTable.ToList().Select(r => new { r.RowKey, Length = r.RowKey.Length})
.OrderByDescending(r => r.Length );
I can now find out what’s going on in those tweets:
Notice anything about those tweets? They all contain either “<” or “>” which are the escape characters for less-than/greater-than symbols in markup and hence the mystery is explained; the markup for tweets might well be longer than the actual tweet itself. Pretty logical if you think about it although it didn’t occur to me without actually examining the data. That’s an important lesson learned – make sure you know your data intimately.
@Jamiet
|
-
A couple of weeks ago Chris Webb sent me an IM telling me about a new technology he’d just seen a demo of called Kapow. Chris has since blogged about it at Kapow Technologies and in that blog post he described Kapow as: “a cross between a screenscraper and an ETL tool” That’s a very apt description, for what Kapow enables you to do is build what are effectively ETL packages (although they call them Robots) that extract data from HTML pages and either (a) load it into a database for you or (more interestingly) (b) make that data available as a RESTful web service. A robot pulls out the data embedded in the markup and presents it as strongly-typed data entities. I never really thought about a web page as being structured data but actually nothing could be further from the truth; HTML is after all nothing more than a hierarchical dataset with the added luxury of metadata - otherwise known as the Document Object Model (DOM). Kapow takes that structured data and its rich metadata, parses it for us, and presents it to us in ways that are easily consumable. I was given a tour of Kapow by their UK rep Dominic Dunkley. Dominic had a great demo where he built, from scratch, a Kapow robot that: - visited a search engine before…
- …entering a search term (in this case “EMC”)
- iterated over each page of results within which it…
- …iterated over each search result….
- …and extracted the title, URL and description before…
- …making all the search results available in a strongly-typed dataset
It all took about 15 minutes and that was with him pausing to explain it step-by-step. That demo really grabbed my attention because I realised that not only does Kapow have the ability to parse the DOM but it also has notions of workflow and data composition which are of course two vital features of any ETL tool. I’m reminded of 3scale Networks that I mentioned in my blog post Enterprise Mashups a couple of weeks back (it was actually after reading that blog post that Chris got in touch with me); in that I describe how 3scale have taken information made freely available by the United Nations at http://data.un.org/ and made it available as an easily consumable data service. In essence this same service could be built using Kapow without being monotonously handcrafted which is how I suspect 3scale did it. I’m hoping I have some reason to use Kapow in the future because I think there are some very interesting scenarios that come into play here. Chiefly, as Chris pointed out in his blog post, we can pull data from any web site and use it for BI purposes. For example, suppose you work for an airline and you want to easily compare your advertised prices with those of your competitors – Kapow is a one-stop shop for enabling that. Impressive stuff. If you’re interested go and check out Kapow for yourself at http://kapowtech.com/index.php/solutions/web-and-business-intelligence @Jamiet
|
-
Since I started dealing with Azure tables I’ve become frustrated that there is no ad-hoc query tool, nothing equivalent to SQL Server Management Studio. Then I heard about LINQPad and figured there must be a way to use it to query Azure tables using LINQ and indeed there is as I’ll explain here (this post assumes that you have a working knowledge of LINQPad and Azure storage). Firstly you need to reference the correct assemblies. I’ve been using the StorageClient that is provided with the Azure Samples in the Azure SDK (March 2009 CTP of the SDK linked here) which in turn uses ADO.Net Services client library hence you’ll need to reference the following assemblies: - StorageClient.dll
- System.Data.Services.Client.dll
You’ll also need to reference your DataServiceContext implementation. For demo purposes I am pulling data from my Tweetpoll application so I’m referencing the assembly containing my TweetPollDataServiceContext class - TweetPoll.dll: We also need to import the namespaces into LINQPad. You’ll need the following: - Microsoft.Samples.ServiceHosting.StorageClient
- System.Data.Services.Client
plus whichever namespace holds your DataServiceContext implementation: Thereafter you just need to write some code. You need to supply the following: - Your Azure storage account name
- Your Azure storage shared key
- The name of your DataServiceContext implementation
- A LINQ query
Here’s my code: var accountName = "jamiekt"; // 1) Enter your storage account name
var sharedKey = "smXblLn+UgXR6ysbIhoeTfE3dyOZhAaONXOP/SUawRSLzCWwDXrhkpRG45A5aeAP5IEjSBEN2mEmPM8O5HnWGQ=="; // 2) SharedKey
var uri = new System.Uri("http://table.core.windows.net/");
var tableAccountInfo = new StorageAccountInfo(uri, null, accountName, sharedKey);
var svc = new TweetPollDataServiceContext(tableAccountInfo); // 3) Specify your DataServiceContext
// 4) Supply your query
var query = from row in svc.Table
select row;
query.Dump();
And here are the results in LINQPad:
Hope that helps!!!
@Jamiet
|
-
Here is the latest collection of submissions to http://connect.microsoft.com/sqlserver/feedback that pertain to SQL Server Integration Services (SSIS) and which I think are worthy of your attention. Take a read of the items below and if you agree with the aim of the submission, please click through and vote for it – it only takes a few seconds. If you want to leave a comment supporting your vote, that would be even better. Don’t feel obliged to vote on everything carte blanche, only vote for those that you would like the SSIS product team to concentrate on. Most of these are already closed but that doesn’t mean they can’t be opened again. And remember, voting really does make a difference; if you don’t believe me take a read of Doug Laudenschlager’s blog post Your vote on Microsoft Connect influenced SQL Server 2008 Service Pack 1. Ability to create an empty raw file with required metadata In the raw file destination adapter, have the option to create the an empty raw file with the appropriate metadata when (for example) you close the raw file destination adapter editor. Or perhaps just have a button in raw file destination adapter editor called "Go create an empty raw file" #225974 : Create an empty raw file with required metadata (https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=225974) A better performing distinct component SORT component is an asynchronous, fully blocking component. This makes sense of course but it seemed to me that if 'Remove rows with duplicate sort values' check box is checked then it could be made into a partially blocking component instead. Let me explain. We know that the component is only going to output one row for duplicates. That doesn't mean though that the output necessarily is required to be sorted does it? Hence, why not just give us a new component that puts every new combination of values into the pipeline as soon as it is encountered. Sure, it still has to do a SORT internally in order to determine whether a combination of values is new or not, but if the output doesn't need to be sorted, why bother waiting until we have all the values before we start processing them downstream? #244313 : SSIS: Improve performance of DISTINCT sort component (https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=244313) Keep Raw files sorted I have an input into a raw file destination that has IsSorted=TRUE. Any output from a raw file source that uses that raw file will have IsSorted=FALSE. Proposed solution: Store IsSorted property of the input and SortKey property of the columns in a raw file #242351 : Should IsSorted metadata be stored in raw files? (https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=242351) Improve DtExecUI by adding a package browser There is no ability be able to browse the package and select a value to be set via the /SET option of dtexec. Add the ability to browse the package. Simply use the browser that is used in the configuration wizard. #124577 : Put package browser on Set Values tab of dtexecui.exe (https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=124577) To read more in this series of SSIS Connect digests go to SSIS Connect Digest @Jamiet
|
-
I recently inherited some SQL that someone else had written and had the job of “tidying it up” before it gets pushed out to production. Here’s a slightly simplified (yes, simplified) version of that SQL: 1: --options
2: select asset_class
3: , case when volume > 0 then 'profit'
4: else 'loss'
5: end as profit_or_loss
6: , customer
7: , sum(volume) as total_volume
8: from t1
9: where asset_class = 'options'
10: group by
11: asset_class
12: , case when volume > 0 then 'profit'
13: else 'loss'
14: end
15: , customer
16: union all
17: --swaps
18: select asset_class
19: , case when volume > 0
20: then 'profit'
21: else 'loss'
22: end as profit_or_loss
23: , customer
24: , sum(volume) as total_volume
25: from t2
26: where asset_class = 'swaps'
27: group by
28: asset_class
29: , case when volume > 0 then 'profit'
30: else 'loss'
31: end
32: , customer
The SQL basically takes data from two tables (t1 & t2), aggregates each, carries out some inline expressions (for profit_or_loss) and finally unions it all together. On the surface it looks fine but there are a few problems here, namely that there is a lot of repeated code; it violates the principle of don’t repeat yourself (DRY) which preaches “single point of maintenance” and “deduplication” of code. If you take the time to check it out you’ll see that identical aggregations are carried out on the two datasets (lines 7 & 24) as are the same conversions for profit_or_loss (lines 3-5 & 19-22). Not only that but we have expressions appearing in both the SELECT clause and GROUP BY of both halves of the query (lines 12-14 & 29-31), another violation of DRY.
A bit of refactoring is called for. First job, eliminate those expressions for profit_or_loss which appear in the GROUP BY clauses:
1: --options
2: select asset_class
3: , profit_or_loss
4: , customer
5: , sum(volume) as total_volume
6: from (
7: select asset_class
8: , customer
9: , case when volume > 0 then 'profit'
10: else 'loss'
11: end as profit_or_loss
12: , volume
13: from t1
14: where asset_class = 'options'
15: )t1
16: group by
17: asset_class
18: , profit_or_loss
19: , customer
20: union all
21: --swaps
22: select asset_class
23: , profit_or_loss
24: , customer
25: , sum(volume) as total_volume
26: from (
27: select asset_class
28: , customer
29: , case when volume > 0 then 'profit'
30: else 'loss'
31: end as profit_or_loss
32: , volume
33: from t2
34: where asset_class = 'swaps'
35: )t2
36: group by
37: asset_class
38: , profit_or_loss
39: , customer
Ok cool, we’ve moved the expression for profit_or_loss into a subquery (aka derived table) and hence expressions have gone from our GROUP BY clauses … but we’ve still got DRY violations. The aggregation (lines 5 & 25) and expression for profit_or_loss (lines 9-11 & 29-31) still appear in two places. More refactoring….
1: select asset_class
2: , customer
3: , sum(volume) as total_volume
4: from (
5: select asset_class
6: , case when volume > 0 then 'profit'
7: else 'loss'
8: end as profit_or_loss
9: , customer
10: , volume
11: from (
12: --options
13: select asset_class
14: , customer
15: , volume
16: from t1
17: where asset_class = 'options'
18: union all
19: --swaps
20: select asset_class
21: , customer
22: , volume
23: from t2
24: where asset_class = 'swaps'
25: )t
26: )q
27: group by
28: asset_class
29: , customer
There, much better; the aggregation only occurs once as does the expression for profit_or_loss (which in the code that I inherited occurred four different times). Our SQL is DRY and its going to be a lot easier to maintain for whomever picks the code up from me.
You’ll notice we’ve got 2 levels of nested subqueries (aka derived tables). I make no apologies for that - derived tables are a great mechanism for eliminating repeated code and if you take but one bit of advice away from this blog post it would be this: derived tables are your friend.
Would you have refactored this the same way? Maybe you might have moved the derived tables into a dedicated view. Perhaps you might even have put the derived tables into a WITH clause like so:
1: with t as (
2: --options
3: select asset_class
4: , customer
5: , volume
6: from t1
7: where asset_class = 'options'
8: union all
9: --swaps
10: select asset_class
11: , customer
12: , volume
13: from t2
14: where asset_class = 'swaps'
15: ),
16: q as (
17: select asset_class
18: , case when volume > 0 then 'profit'
19: else 'loss'
20: end as profit_or_loss
21: , customer
22: , volume
23: from t
24: )
25: select asset_class
26: , customer
27: , sum(volume) as total_volume
28: from q
29: group by
30: asset_class
31: , customer
Let me know in the comments!
@Jamiet
Further reading:
Views keep your SQL queries DRY - Seth Schroeder
|
-
Just lately I’ve been doing a lot of messing around with Windows Azure the main focus of which has been exploring Azure storage (tables, queues, blobs) and how those things can be manipulated from Azure worker roles. One output from that “messing around” has been my Tweetpoll application (read more: Tweetpoll my first Windows Azure application is live). Since then I’ve found some blog resources that would have been useful during development had I found them earlier so I’m linking to them from here mainly so that I can easily find them again but also so that other people might be made aware of them because there’s some really good content here. Firstly, Dom Green has developed a library that provides rich application tracing capabilities on top of Azure’s bog standard logging mechanisms. Read more at AzureTrace … managing all your Azure Logging Needs. I’ve put together something slightly similar myself recently but its not as feature rich as Dominic’s and, as anyone at Conchango EMC Consulting will tell you, I’m no .Net expert by any stretch of the imagination so I’ll be working to put Dom’s library into my own stuff forthwith. Secondly, Dom linked to a great series of articles from Neil Kidd which cover how we could (and should) build reliable Azure worker roles. Those links: Thank you Dom. Thank you Neil. There’s some really useful material here that I’ll be devouring constantly over the coming weeks and months. @JamieT
|
-
Of late one may have noticed that I have become very interested in high-brow, generally vaporous, disciplines such as RESTful data services, data interoperability and cloud computing whilst coincident with that has been the inexorable rise of the term “mashup” in the information technology lexicon. “Mashup” means different things to different people but to me its simply the practice of combining data from multiple places with the aim of discovering or passing on knowledge that wasn’t known before. Well hey, that sounds a lot like what I do in my day job; the main difference being that I don’t generally hear the term “mashup” being bandied about the London meeting rooms that I frequent to the same extent that it does in the funky web 2.0 and swanky startup world; the term I hear (and use) is the considerably less cool “data integration”. Fundamentally though I don’t think there’s that much difference between the two so maybe enterprise data integration people like myself have something to learn from these so-called mashup players. One of my favourite mashup tools out there is one I’ve spoken about before – Yahoo Pipes. If you haven’t had a look at this it really is worth taking a glance. Yahoo Pipes enables you to extract data from multiple web-based data sources, transform it using a series of operations like sorting, joining, unioning and filtering before finally outputting that transformed data in one of a number of different formats; its a data pipeline for web-based data (A pipeline? Oh, there’s something else I’ve talked about before – noticing a pattern here?). Here is an example of a Yahoo Pipe: Yahoo Finance Stock Quote Watch List Feed w/Chart Extracting. Transforming. Sorting. Unioning. Filtering. Outputting. This Yahoo Pipes thing is starting to sound awfully like ETL tools such as SQL Server Integration Services (SSIS) wouldn’t you say? They even look a little bit like each other with their boxes joined up with lines between them: I’m now reminded of what my good friend Andy Britcliffe of Sharpcloud once said to me upon reading my blog post (and viewing the embedded video) Consuming web services in SSIS 2008 a full two years ago. I distinctly remember Andy’s words on that occasion: “SSIS is the ultimate mashup tool”! I didn’t disagree! Most mashup tools share one common characteristic in that they invariably require someone with some technical nous to set them up in advance so that they can be used by the less tech-savvy amongst us and the same applies in enterprises as well; data is distributed by the IT guys to the information workers and this distribution of data typically takes months whereas the consumers of that data want it available in hours. In both arenas I sense a shift occurring; now the consumers of the data are being empowered to find and interrogate data for themselves and in the enterprise this is happening through the adoption of tools such as Qlikview, Omniscope and (in the near future) Microsoft’s Gemini. I find this to be a fascinating development not because it means there may be less work for me to do (admittedly that would be nice) but because information workers now have the opportunity to be much more productive in their daily jobs and I expect those who invest in learning these new technologies to be the cream that rises to the top of enterprises in the near future. Up until recently I hadn’t been all that interested in Microsoft’s Gemini project, indeed I was very sceptical of it, but as I started to formulate some of the thoughts that I’m writing about here I began to realise how important it will be when it gets released sometime (hopefully) in early 2010. I earlier described mashups as being “the practice of combining data from multiple places with the aim of discovering or passing on knowledge that wasn’t known before” and that description fits very well with Gemini. If you don’t know what Gemini is take a look at this video: That demo glosses over the main point I’m making which is that here we see data that is originally pulled from multiple sources and combined in a familiar place (Excel) where the end user can consume it. The person speaking in the video is Donald Farmer and he has a blog entry with many other links to Gemini resources at Microsoft Project Gemini links. At the top of this email I also talked about how I’m interested in data services, that is data available over the web that we can consume via an API and use for our own knowledge discovery and I was introduced to such a data service just yesterday when listening to Jon Udell’s “Interviews with Innovators” podcast. In the most recent episode Jon interviewed Stephen Willmott whose company 3scale Networks has taken it upon themselves to make data held by the United Nations freely available via a data service to anyone that would like to consume it [UPDATE: Read Jon's own writeup of the interview at Influencing the production of public data]. For example, if you want to know the United Kingdom population’s annual growth rate since 1991, that data is available, for free, at http://undata-api.appspot.com/data/query/Population%20annual%20growth%20rate%20(percent)/United%20Kingdom?user_key=XXXX (you need to sign-up for a free user-key and substitute it for XXXX in order for this query to work) and is returned like so: “Wouldn’t it be cool” I thought, “if I could consume that data inside of Excel using Gemini”, perhaps in this example to combine it with birth rates over the same period to discover if there is a correlation between the two. At the time though I didn’t know if Gemini made it possible to consume data directly from data sources so I went straight to ask the man who would know, the aforementioned Donald Farmer. I contacted Donald over Twitter and here is the conversation that ensued: - Me: @donalddotfarmer Is there a list of data sources types from which #Gemini can get data? Interested in data from web APIs e.g. undata-api.org (link)
- Donald: @jamiet I'll need to check out that site in particular, but we do support Atom feeds. (link)
- Me: @donalddotfarmer Ahh that's good news. How about POX/RSS? Does Gemini allow us to parse it or use XQuery? (link)
- Donald: @jamiet No we don't support XQuery - we just consume Atom feeds as they come - the users can then filter and sort in Gemini (link)
- Me: @donalddotfarmer OK, so Atom only right now. Looking forward to getting hands dirty, think I know what 1st feature request will be :) (link)
Lots of techy abbreviations in there so let me summarise. Gemini will be able to consume data from web services that deliver it in the popular Atom XML dialect (more on Wikipedia) which is great news and no great surprise given that Microsoft announced in February 2008 that Atom would be their XML syndication format of choice going forward (see my blog post Windows Live Dev announcements for a more complete commentary). I happen to know that the United Nations data provided by Stephen Willmott is not currently delivered in Atom format but no matter, at least things are moving in the right direction and as I alluded during my last tweet to Donald I’ll be asking for support for other syndication formats in the future. This has turned into a rather rambling blog post so I’ll call a halt here. As always though I’d be interested to know other people’s thoughts on data services, usage of that data in enterprises or anything else I’ve mentioned herein so if you have any thoughts please leave comments in the space below! -Jamie
|
-
I like to think that most of the stuff I blog about on here is in some way useful but sometimes I just want to put something out there that i think is simply damn cool, this is one of the latter cases. Today I stumbled upon Joe Pruitt’s PoshBing – a Powershell library for Microsoft’s new Bing search engine. Its a wrapper around the Bing API and enables you to call the Bing search engine and consume the results inside the Powershell pipeline. A picture speaks a thousand words so here are some example of what you can do with it: Spellchecking  Translations Basic web search News articles Not sure how useful it is but its all very cool stuff I’m sure you’ll agree (if you’re a fan of Powershell)! Download the Powershell script file containing this good stuff from http://poshbing.codeplex.com/ and happy querying! As an aside, Joe has also built a Twitter wrapper around his Powershell Bing wrapper which means you can now use Twitter to use Powershell to query Bing; simply head for @askbing and ask it whatever you want. Probably not all that useful but nonetheless very very cool. -Jamie Please check out more of my Powershell related blog posts:
|
|
|
|