Many moons ago I wrote a blog entry MDM -> Entity Framework -> ADO.Net Data Services. Better together? where I opined about the inherent problems of using staticly-typed programming languages (such as C#) to interact with schema-less data stores. In that case the problem reared its head when trying to use Entity Framework on top the Microsoft MDM (aka Stratature) data model; Entity Framework assumes that the underlying data model is known at design-time and is constant (i.e. static) but by its very nature the Microsoft MDM data model is always changing.
I’m now experiencing similar problems when building my Powershell client for SQL Data Services (SDS) – SDS Shell (http://www.codeplex.com/sdssh). It is a problem that I think is worth exploring some more because this it is only going to become more and more prevalent as schema-less data stores like SDS, Azure tables, Amazon’s SimpleDB and Google’s BigTable start to take hold.
N.B. Microsoft MDM is not really schema-less, it is more accurately described as schema-varying. Nevertheless the problem remains the same.
Querying SDS entities has a couple of problems in this regard:
- A single query can return a dataset of multiple entity kinds (a kind is roughly analogous to a table in the traditional database scheme that we know and love)
- Each entity of a particular kind can have a completely different set of properties from the next
Its not inconceivable to request the contents of an SDS container and got something back like the following:
| <Book xmlns:s="http://schemas.microsoft.com/sitka/2008/03/" |
| xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" |
| xmlns:x="http://www.w3.org/2001/XMLSchema"> |
| <s:Id>MySampleBook</s:Id> |
| <s:Version>13453</s:Version> |
| <Title xsi:type="x:string">My Book</Title> |
| <ISBN xsi:type="x:string">1-57880-066-36</ISBN> |
| <Author xsi:type="x:string">Mr. Author</Author> |
| <InPrint xsi:type="x:boolean">false</InPrint> |
| <NumberOfCopiesSold xsi:type="x:decimal">250</NumberOfCopiesSold> |
| </Book> |
| <Book xmlns:s="http://schemas.microsoft.com/sitka/2008/03/" |
| xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" |
| xmlns:x="http://www.w3.org/2001/XMLSchema"> |
| <s:Id>MySampleBook2</s:Id> |
| <s:Version>38735</s:Version> |
| <Title xsi:type="x:string">Another Book</Title> |
| <Publisher xsi:type="x:string">Mr. Publisher</Publisher> |
| <InPrint xsi:type="x:boolean">true</InPrint> |
| </Book> |
| <Car xmlns:s="http://schemas.microsoft.com/sitka/2008/03/" |
| xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" |
| xmlns:x="http://www.w3.org/2001/XMLSchema"> |
| <s:Id>Car1</s:Id> |
| <s:Version>86435</s:Version> |
| <Make xsi:type="x:string">Ford</Make> |
| <Model xsi:type="x:string">Focus</Model> |
| </Car> |
| <Car xmlns:s="http://schemas.microsoft.com/sitka/2008/03/" |
| xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" |
| xmlns:x="http://www.w3.org/2001/XMLSchema"> |
| <s:Id>Car2</s:Id> |
| <s:Version>54762</s:Version> |
| <NumberOfWheels xsi:type="x:string">4</NumberOfWheels> |
| </Car> |
| |
| |
That dataset has got two different entity kinds {Book, Car}; there are two Books {MySampleBook, MySampleBook2} each with different properties, and the same goes for the two Cars as well. The problem we have is how to output that dataset to the Powershell command-line in a way that can be understood by the downstream cmdlets. We can output any Object we like (as far as I know a Powershell cmdlet can accept a DataRow just as easily as it can accept a DataSet or an array because all it does is reflect on it) but we need to work out what is the best way to output this data. I conferred with my far more learned colleague James Saull about this and here are the options that we came up with.
Method 1 – Return a property bag
This was inspired by SQL Server 2008’s sparse columns feature. Simply we output each row one by one with a column for all the properties that every entity has {Id, Version, Kind} and output the rest of the properties in a single column (called PropertyBag) as an XML snippet. For the dataset above this would conceptually look like:
| Kind | Id | Version | PropertyBag |
| Book | MySampleBook | 13453 | <Title xsi:type="x:string">My Book</Title> <ISBN xsi:type="x:string">1-57880-066-36</ISBN> <Author xsi:type="x:string">Mr. Author</Author> <InPrint xsi:type="x:boolean">false</InPrint> <NumberOfCopiesSold xsi:type="x:decimal">250</NumberOfCopiesSold> |
| Book | MySampleBook2 | 38735 | <Title xsi:type="x:string">Another Book</Title> <Publisher xsi:type="x:string">Mr. Publisher</Publisher> <InPrint xsi:type="x:boolean">true</InPrint> |
| Car | Car1 | 86435 | <Make xsi:type="x:string">Ford</Make> <Model xsi:type="x:string">Focus</Model> |
| Car | Car2 | 54762 | <NumberOfWheels xsi:type="x:string">4</NumberOfWheels> |
and here’s a screenshot of what it actually returns in Powershell:
Advantages:
Disadvantages:
- Difficult to work with the properties
Method 2 – Return every column for every row regardless of existence
This method requires that we return a column for every property that exists in ANY entity and return NULLs for all entities that don’t have that property. In other words, if I have 100 entities but only one of them has a property “foo”, all entities will return a column called “foo” but 99 of them won’t have a value for it. Here is a conceptual look at how our dataset appears to the Powershell pipeline under this method:
| Id | Version | Title | ISBN | Author | InPrint | NumberOfCopiesSold | Publisher | Make | Model | NumberOfWheels |
| MySampleBook | 13453 | My Book | 1-57880-066-36 | Mr. Author | false | 250 | | | | |
| MySampleBook2 | 38735 | Another Book | | | true | | Mr. Publisher | | | |
| Car1 | 86435 | | | | | | | Ford | Focus | |
| Car2 | 54762 | | | | | | | | | 4 |
That’s far from ideal because we open ourselves up to the problem of immense sparsity in our dataset; the more rows we have the more properties can exist thus the more columns returned in our dataset, the majority of which are empty. In actual fact we can be slightly more cute about it because we can build groups of properties per Kind rather than the whole dataset but that only really helps for presentation purposes as we can see in the following screenshot of this method being implemented in Powershell:
In the screenshot you can see that I’ve made two calls to my cmdlet; on the first occasion I let Powershell naturally format the data as a list but in the second I push the data to the Format-Table cmdlet. You can see that when output as a list I have managed to group the properties by kind thus limiting the sparsity however when formatted as a table I still get sparsity across the whole dataset (e.g. a ‘Book’ with property ‘NumberOfWheels’). Not really what I want but there’s not a lot we can do about it!
Advantages:
Disadvantages
At the time of writing I’m not sure which option to go with and SDS Shell is a long way from finished so I don’t have to make a choice right now. Maybe I’ll offer the ability to do both, any thoughts or advice on that? Do you have a strategy for dealing with schema-less datasets and/or sparsity?
In practice of course this is unlikely to be a problem because we assume that most entities will have a known and uniform format (won’t they?). However, while SDS can return data as demonstrated in this blog entry then SDS Shell has to be able to handle it.
It must also be said that with the advent of dynamic typing in C# 4.0 we’ll have another option for dealing with schema-less data stores but until then we have to do what we can.
-Jamie