Welcome to EMC Consulting Blogs Sign in | Join | Help

SSIS Junkie

Once upon a time this blog was a hive of activity. Now however its pretty lifeless as you can probably tell so if are pining for more of the same you can find me over at http://sqlblog.com/blogs/jamie_thomson. I look forward to seeing you there!

The perils of staticly-typed languages and schema-less data stores

Many moons ago I wrote a blog entry MDM -> Entity Framework -> ADO.Net Data Services. Better together? where I opined about the inherent problems of using staticly-typed programming languages (such as C#) to interact with schema-less data stores. In that case the problem reared its head when trying to use Entity Framework on top the Microsoft MDM (aka Stratature) data model; Entity Framework assumes that the underlying data model is known at design-time and is constant (i.e. static) but by its very nature the Microsoft MDM data model is always changing.

I’m now experiencing similar problems when building my Powershell client for SQL Data Services (SDS) – SDS Shell (http://www.codeplex.com/sdssh). It is a problem that I think is worth exploring some more because this it is only going to become more and more prevalent as schema-less data stores like SDS, Azure tables, Amazon’s SimpleDB and Google’s BigTable start to take hold.

N.B. Microsoft MDM is not really schema-less, it is more accurately described as schema-varying. Nevertheless the problem remains the same.

Querying SDS entities has a couple of problems in this regard:

  1. A single query can return a dataset of multiple entity kinds (a kind is roughly analogous to a table in the traditional database scheme that we know and love)
  2. Each entity of a particular kind can have a completely different set of properties from the next

Its not inconceivable to request the contents of an SDS container and got something back like the following:

<Book         xmlns:s="http://schemas.microsoft.com/sitka/2008/03/"
              xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
              xmlns:x="http://www.w3.org/2001/XMLSchema">
      <s:Id>MySampleBook</s:Id>
      <s:Version>13453</s:Version>
      <Title xsi:type="x:string">My Book</Title>
      <ISBN xsi:type="x:string">1-57880-066-36</ISBN>
      <Author xsi:type="x:string">Mr. Author</Author>
      <InPrint xsi:type="x:boolean">false</InPrint>
      <NumberOfCopiesSold xsi:type="x:decimal">250</NumberOfCopiesSold>
</Book>
<Book         xmlns:s="http://schemas.microsoft.com/sitka/2008/03/"
              xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
              xmlns:x="http://www.w3.org/2001/XMLSchema">
      <s:Id>MySampleBook2</s:Id>
      <s:Version>38735</s:Version>
      <Title xsi:type="x:string">Another Book</Title>
      <Publisher xsi:type="x:string">Mr. Publisher</Publisher>
      <InPrint xsi:type="x:boolean">true</InPrint>
</Book>
<Car         xmlns:s="http://schemas.microsoft.com/sitka/2008/03/"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xmlns:x="http://www.w3.org/2001/XMLSchema">
      <s:Id>Car1</s:Id>
      <s:Version>86435</s:Version>
      <Make xsi:type="x:string">Ford</Make>
      <Model xsi:type="x:string">Focus</Model>
</Car>
<Car         xmlns:s="http://schemas.microsoft.com/sitka/2008/03/"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xmlns:x="http://www.w3.org/2001/XMLSchema">
      <s:Id>Car2</s:Id>
      <s:Version>54762</s:Version>
  <NumberOfWheels xsi:type="x:string">4</NumberOfWheels>
</Car>
 
 

That dataset has got two different entity kinds {Book, Car}; there are two Books {MySampleBook, MySampleBook2} each with different properties, and the same goes for the two Cars as well. The problem we have is how to output that dataset to the Powershell command-line in a way that can be understood by the downstream cmdlets. We can output any Object we like (as far as I know a Powershell cmdlet can accept a DataRow just as easily as it can accept a DataSet or an array because all it does is reflect on it) but we need to work out what is the best way to output this data. I conferred with my far more learned colleague James Saull about this and here are the options that we came up with.

 

Method 1 – Return a property bag

This was inspired by SQL Server 2008’s sparse columns feature. Simply we output each row one by one with a column for all the properties that every entity has {Id, Version, Kind} and output the rest of the properties in a single column (called PropertyBag) as an XML snippet. For the dataset above this would conceptually look like:

Kind Id Version PropertyBag
Book MySampleBook 13453

  <Title xsi:type="x:string">My Book</Title>
  <ISBN xsi:type="x:string">1-57880-066-36</ISBN>
  <Author xsi:type="x:string">Mr. Author</Author>
  <InPrint xsi:type="x:boolean">false</InPrint>
  <NumberOfCopiesSold xsi:type="x:decimal">250</NumberOfCopiesSold>

Book MySampleBook2 38735

  <Title xsi:type="x:string">Another Book</Title>
  <Publisher xsi:type="x:string">Mr. Publisher</Publisher>
  <InPrint xsi:type="x:boolean">true</InPrint>

Car Car1 86435

  <Make xsi:type="x:string">Ford</Make>
  <Model xsi:type="x:string">Focus</Model>

Car Car2 54762

  <NumberOfWheels xsi:type="x:string">4</NumberOfWheels>

and here’s a screenshot of what it actually returns in Powershell:

powershell cmdlet output

Advantages:

  • Uniform structure

Disadvantages:

  • Difficult to work with the properties

Method 2 – Return every column for every row regardless of existence

This method requires that we return a column for every property that exists in ANY entity and return NULLs for all entities that don’t have that property. In other words, if I have 100 entities but only one of them has a property “foo”, all entities will return a column called “foo” but 99 of them won’t have a value for it. Here is a conceptual look at how our dataset appears to the Powershell pipeline under this method:

Id Version Title ISBN Author InPrint NumberOfCopiesSold Publisher Make Model NumberOfWheels
MySampleBook 13453 My Book 1-57880-066-36 Mr. Author false 250        
MySampleBook2 38735 Another Book   true   Mr. Publisher      
Car1 86435             Ford Focus  
Car2 54762                 4

 

That’s far from ideal because we open ourselves up to the problem of immense sparsity in our dataset; the more rows we have the more properties can exist thus the more columns returned in our dataset, the majority of which are empty. In actual fact we can be slightly more cute about it because we can build groups of properties per Kind rather than the whole dataset but that only really helps for presentation purposes as we can see in the following screenshot of this method being implemented in Powershell:

powershell cmdlet output

In the screenshot you can see that I’ve made two calls to my cmdlet; on the first occasion I let Powershell naturally format the data as a list but in the second I push the data to the Format-Table cmdlet. You can see that when output as a list I have managed to group the properties by kind thus limiting the sparsity however when formatted as a table I still get sparsity across the whole dataset (e.g. a ‘Book’ with property ‘NumberOfWheels’). Not really what I want but there’s not a lot we can do about it!

Advantages:

  • Every value atomic

Disadvantages

  • Sparsity

 

At the time of writing I’m not sure which option to go with and SDS Shell is a long way from finished so I don’t have to make a choice right now. Maybe I’ll offer the ability to do both, any thoughts or advice on that? Do you have a strategy for dealing with schema-less datasets and/or sparsity?

 

In practice of course this is unlikely to be a problem because we assume that most entities will have a known and uniform format (won’t they?). However, while SDS can return data as demonstrated in this blog entry then SDS Shell has to be able to handle it.

 

It must also be said that with the advent of dynamic typing in C# 4.0 we’ll have another option for dealing with schema-less data stores but until then we have to do what we can.

-Jamie

Published 14 January 2009 19:35 by jamie.thomson

Comments

 

Reflective Perspective - Chris Alcock » The Morning Brew #266 said:

January 16, 2009 08:11
 

SSIS Junkie said:

I like to think that most of the stuff I blog about on here is in some way useful but sometimes I just

June 9, 2009 21:43
New Comments to this post are disabled

This Blog

Syndication

Powered by Community Server (Personal Edition), by Telligent Systems