Just recently I was doodling in Visual Studio, writing some .Net code (dangerous, I know). The stuff I was doing wasn’t particularly process intensive but it WAS data intensive; by that I mean I was looping over a collection of entities (let’s pretend that the entities were vehicles sales) and producing lots of statistics about those entities. For each vehicle I built an object VehicleSale each of which was stored in a List(T), hence I ended up with a List<VehicleSale>.
Huh? A what? Ok, I’ll rephrase that in a way that you data guys can understand. I was basically building a denormalised list of vehicle sales and it looked something like this:
| Type | Make | Model | OwnerName | OwnerHomeCity | Colour | Date | PurchasePrice |
| Car | Ford | Focus | John James | London | Red | 16/03/2009 | 14000 |
| Motorbike | Honda | 599 | John James | London | Blue | 02/03/2008 | 10000 |
| Car | Reliant | Robin | Simon Smith | Hull | Blue | 26/01/2009 | 6500 |
| … | … | … | … | … | … | … | … |
| … | … | … | … | … | … | … | … |
| Bicycle | Emmelle | Phantom | Simon Smith | Hull | Black | 14/08/2008 | 50 |
| Van | Ford | Transit | Henry Hodge | Bristol | Blue | 26/09/2008 | 18000 |
You get the idea!
[N.B. By the way the data, and indeed the scenario, are figments of my imagination. This is purely for demo purposes]
I find myself querying this data structure to ask a number of questions:
- How many cars have been sold in London?
- What is the most popular type of vehicle sold?
- What is the least popular colour?
- How many cars were sold in March 2009 versus March 2008?
- What was the value of sales in 2008?
This is a classic Business Intelligence (BI) scenario. In the SQL Server BI circles in which I hang out we typically put this data into a multidimensional database (aka Analysis Services) which pre-aggregates the data for us and hence enables us to answer those types of questions a lot quicker than it would ordinarily take if the data simply resided in a relational database. Analysis Services is after all nothing much more than an aggregation engine (not that I want to belittle that capability, its very very valuable).
In this case however the use of Analysis Services or any other multidimensional database product is not an option. My data isn’t in a database table, in fact it isn’t even physically stored anywhere. Its purely an in-memory data structure (effectively an unindexed heap) that I am building and querying on-the-fly and which will cease to exist once my application stops running. If the amount of data that I am storing in this data structure becomes suitably large then I would expect query time to increase exponentially, eventually rendering my application unusable.
And so that brings me onto the point of this blog entry. I started to consider whether there was a gap in the BI market; is there a need for a .Net object into which we can add data (just like we can a List(T)) but that will automatically aggregate the data for us as we add data into it? I envisage such a utility could be incredibly powerful especially if the data aggregations are immediately consistent because then you’re moving into the realm of real-time BI.
Does anyone have any immediate thoughts on this you’d like to share? I’m only throwing some vague ideas out here and I’d like to know if anyone thinks this is an interesting proposition or not, for all I know there might already be something that does this. Either way, share your thoughts in the comments.
-Jamie