Welcome to EMC Consulting Blogs Sign in | Join | Help

SSIS Junkie

Peering into the SQL Azure documentation

Some documentation for SQL Azure has been released on MSDN at http://msdn.microsoft.com/en-us/library/ee336279.aspx and upon reading there are some interesting nuggets of information that are worth calling out.

Under a section headed “Key benefits of the service” the following statement appears:

Scalability

A key advantage of SQL Azure is the ease with which you can scale your solution. After partitioning your data, the service scales as your data grows.

Hmmm…there’s two seemingly contradictory statements in there, “ease” and “partitioning your data”. Last time I looked there was nothing particularly easy about partitioning data, certainly not when it is going to be partitioned across multiple databases. I wonder if the captain of the Titanic proudly declared “Sailing to New York will be easy once we’re past all those meddlesome icebergs”!

Another name for this partitioning technique is sharding for which Dare Obasanjo has a great discussion up at Building Scalable Databases: Pros and Cons of Various Database Sharding Schemes. Dare says database sharding is:

the process of splitting up a database across multiple machines to improve the scalability of an application

which is exactly what SQL Azure proposes except that they use the word “partitioning” rather than “splitting” and its across multiple virtual instances rather than multiple machines.

Dare talks about numerous disadvantages of sharding but the key one is that you lose the benefit of referential integrity (RI) which is, to my mind, THE main reason for using a relational database in the first place. Let’s take a look at a simple example. Suppose (incredibly hypothetically) that Amazon adopted SQL Azure and chose to shard its product catalogue across multiple instances; they then face the very real situation where it is not possible to have full RI between their products, customers and the orders that link them together. An order would most likely contain products from different shards therefore which shard should the order live on? Maybe Amazon could shard the order line items according by product but how do they enforce RI back to the order header and, onward, to the customer? Clearly Amazon can’t enforce RI across shards so they have to use some different data storage mechanism in which case why bother using SQL Azure at all?

This is an extreme case but its one that you have to consider when using SQL Azure and the problem is exacerbated when you consider that the maximum size of a database in SQL Azure is 10GB. “Oh, you want to store your new product in my database? You say it’ll only use up 5KB of storage? No can do, you’re maxed out! Sorry!” (Although arguably the 10GB limit is an advantage because at least your shard’s maximum limit is absolute rather than theoretical.)

There are of course solutions to these problems and they generally involve writing your application to workaround these limitations. Doing that is, however, an expensive and time-consuming undertaking which is something anyone adopting SQL Azure needs to be aware of and you won’t find it written down anywhere in the SQL Azure documentation.

Be aware, that’s all I’m saying! If your database is not going to grow anywhere near 10GB then SQL Azure might well be a good fit for you.

@Jamiet

Published Tuesday, August 04, 2009 10:54 AM by jamie.thomson
Filed under: , ,

Comments

 

simon.munro said:

Careful Jamie! I have discovered that if you point this out too much that you can becme unpopular very quickly.  Then again, maybe you have a more friendly demeanour than me.

I asked this question on the forums ad it remained unanswered for 4 months.

http://social.msdn.microsoft.com/Forums/en-US/ssdsgetstarted/thread/01cf09ea-34dd-4431-97bd-c9ed41ded1eb

Answered, and marked as answered by the moderator.  The long answer (does SDS scale > 10GB per instance) is 'partitioning' and the short answer is 'no'

August 4, 2009 11:40 AM
 

James.Rowland-Jones said:

At the scale of Amazon it's quite normal to lose in order to gain. I believe Amazon were strong adopters of Brewer's CAP theory so I expect they gave up on consistency in favour of availability and partitioning some time ago.

Simon's right - the current generation of cloud offerings probably do require too much physical knowledge to be considered "pure cloud".

As this relates to the 10GB limit I see this in a slightly different light.  To me it's a building block. A manual RAC type solution if you will. Azure is crying out for a layer over the top to fan the data dynamically - and the person that builds this the most effectively will probably have Microsoft knocking on their door soon after.  Dynamic positioning of data on SQL Azure in 1.0? No. Dynamic positioning of data on SQL Azure 3.0? Now you are talking.

August 4, 2009 2:59 PM
 

simon.evans said:

Hi Jamie

I dont think its fair to say if your database is over 10GB SQL Azure is not for you. All the limitations of SQL Azure really mean is that you need to think carefully about your partition strategy in order to scale effectively. But can SQL Azure achieve mass scale? I think so.

Partitioning has always been a reasonable strategy for dealing with scaling issues in SQL Server; the difference here is that you have no other option, because you can't scale up the hardware SQL Azure is running on.

I think the feature missing in SQL Azure is being able to treat all your partitions as a single SQL database - so automatically query against the partition set. Maybe this is a feature request for v.Next of SQL Azure?

Funny thing is, the same constraints apply to Windows Azure table storage, which forces partitioning on you - but no one seems to be complaining about that.

August 7, 2009 11:47 AM
New Comments to this post are disabled

This Blog

Syndication

Powered by Community Server (Personal Edition), by Telligent Systems