It was pointed out to me recently that my critical position on SQL Data Services and support for Azure in general are inconsistent, so I thought that a bit of clarification is required.
SQL Server is my database of choice and have been using it since version 4.21, which was long before it became mainstream and I even spent a couple of years on Oracle to make sure that I had broader database experience and breathed a sigh of relief when I returned to the SQL Server world. As a fan of SQL Server I am really supportive of any efforts by Microsoft to put SQL Server into the cloud and was very supportive of their SDS initiative; imagine having a database platform that was both cloud capable and on-premise – it would be the only one in the market. As information has been made available about SDS, it seems that it has hit the scale-out problem of the SQL (relational) model and has no way of managing consistency and partition tolerance easily (at all?). The cloud idea of scalability by adding additional nodes as required has been lost on the SDS team and they seem to remain quiet on the issue (my question on the forums from April 2009 is still unanswered).
Now I don’t have a problem with the (lack of) scalability of SDS per se, rather the misleading marketing and evangelising that SDS is a relational database for the cloud and the implication that you develop as you normally would for a SQL database, without clearly outlining the risks associated with the lack of simple scalability. In a Tech-Ed video earlier this year members of the SDS team jovially discuss the approaches to handle scalability on SDS which comes down to two things 1) Database sharding and 2) Microsoft is working on tools/frameworks to make it easier to shard a database. In my opinion (SQL/Relational) database sharding as a starting point is bad design (you get pain and suffering without the ACID benefits of SQL) and waiting for Microsoft to develop some tools and frameworks is hopeful and risky.
Theoretical and academic discussions aside there is one simple problem with building an Azure app that uses SDS as its primary persistence mechanism. If your business gets mentioned on Oprah and suddenly makes it big, it will hit the wall. You can spin up lots of Windows Azure Hosted Services but only one SDS service and the load on that lonely stressed-out instance of SDS will render the application useless. Part of the attraction of the cloud in general and Azure in particular is the ability to lower startup costs by hosting on the bare minimum and being able to scale as needed and on demand (almost). If you do hit the wall on SDS you have two options 1) a rewrite of the data access layer to use sharding or the mythical framework that is on its way or 2) move the entire application on premise (which may be difficult because of the dependency on the Azure fabric) – neither options will be quick and easy to implement and opportunities will be missed.
The Azure platform (hosted services and storage) does not have the same scalability problems and is (seemingly) engineered with regard to scale-out as has been learned in stateless web farms and message oriented systems over the years. Azure also forces developers to architect their applications in such a way that the scalability is built in by placing restrictions on access to disk, other processes and removing most of the urge to implement cloud unfriendly practices. The focus on worker processes, queues, Azure storage and RESTful storage styles biases implementations to more service oriented styles which, almost by definition, can handle scalability. I find that this is what is the most attractive about Azure – provided we can get developers to think a little bit differently about how to process data, Azure provides the platform for massive (and painless) scalability.
In all the cloud discussions, not only at Microsoft, persistence is the elephant in the room that everyone seems to ignore. Current applications and architectures have such a high dependency on ACID data operations and all of the goodness that comes from using a SQL database. Although it is generally known that SQL databases don’t scale out very well it seems that the competitor products are immature (at least in addressing a larger problem space) so they rather keep quiet and don’t point fingers at the entrenched database vendors who, in turn, don’t want to highlight scalability problems. My issue is that these problems are ignored at the cost of the customer who eventually comes across problems that everybody seemed to know about the whole time. So it is imperative that Microsoft talk openly and honestly with customers about the architectural considerations that need to be made on Azure and not just target small businesses who don’t have the skills in-house to ask the uncomfortable questions.
I have no doubt that Microsoft has the engineering skills to provide tools, technologies and frameworks to build cloud oriented data storage mechanisms, even if it is on a SQL model (Madison is, after all, a data sharding architecture). There are clever people like Pat Helland who has a lot of experience on building distributed systems (although he is working more with unstructured data now – I think). I sincerely hope that Microsoft gives the engineers a chance to build what is needed and not just leave it up to the marketers and the customers – after all the potential technical barrier is far greater than whether or not Outlook renders html with Word or not.
So, if you are a small business and evaluating Azure and SDS, take note of the following:
- Data consistency (as offered for free by SQL databases) is not always needed.
- Any Azure application should make use of a number of persistence mechanisms depending on the need
-
SDS – where consistency is required
-
Azure storage (Tables/Blobs) for high volume and high throughput operations
-
Cache – to reduce the load on the primary persistence services
-
Client side data – particularly with fatter clients like Silverlight
-
Azure applications should make use of worker processes as much as possible and queue updates/requests against the storage mechanism if possible
-
Sharding should be considered as a last resort and should not be generally applied.
-
Considerations should be made for cloud and on-premise storage – stale and historical data can (and should) be moved to less congested storage mechanisms.
Disclaimer: My opinions are formed by experience and publicly accessible information. I have no access to MVP, partner or private beta materials and Microsoft (for obvious reasons) doesn’t talk to me directly.
Simon Munro
@simonmunro