SQL Server 2008 is introducing some interesting querying and storage enhancements such as the introduction of spatial data and the FILESTREAM attribute on VARBINARY(MAX) columns and my thoughts have recently turned to what querying enhancements we are going to see in future versions. I was chatting to Rob Farley recently and we started to wonder aloud about whether the database engine might one day support fuzzy querying methods in the engine (Rob spoke about it here).
What do we mean by fuzzy querying? Well I'll try to put this as close to lay terms as possible. Fuzzy querying allows us to compare two things that aren't the same but are a little bit the same. Are you any the wiser? Perhaps an example might suffice.
Take the following list of values:
Now as far as SQL Server 2008 is concerned those five names are all different, that's just how SQL works. Something either is true or it isn't. Yay or nay. Zero or one. Clearly though we as humans can see there is some sort of similarity between those five names and that is important information to know; its not hard to imagine a class of applications that could make use of this ability to know that things are similar as opposed to being the same. For example, perhaps I am maintaining a customer database for which different telesales operators have the ability to enter data. Its not inconceivable that different people may enter data for the same customer except with slightly different spellings and I am going to want to know about that. Fuzzy querying would allow us to check for these anomalies.
Those of you that have used SSIS 2005 will know that Fuzzy Querying exists in that product as provided by the Fuzzy Lookup and Fuzzy Grouping components however I personally don't want to have to call out to a SSIS package to process data that is already in my database. Rob posted a submission to Connect requesting fuzzy querying behaviour in the database engine so if you think this is an important feature go here and vote for it.
Following on from the basic ability to query on similarity, I wonder if indexing on similarity could be possible? Take the example above. Placing an index on those five values today wouldn't result in them appearing contiguously in the index however if we indexed on similarity...perhaps they would. Something to think about.
OK, so that covers off fuzzy querying of textual values. Looking even further into the future I wonder if there are even more complex scenarios that could leverage this inate ability to know that two things are similar. Here's a scenario to consider:
I use a free tool called Windows Live Photo Gallery (WLPG) to manage my personal photo library and one of the features of this tool is the ability to tag photos (WLPG uses SQL Server CE under the covers by the way to manage all of the photo data). For example, I might tag all of the photos in which my fiance, Helen, appears.
Its occurred to me before that a really great feature of WLPG would be auto-tagging. That is, the ability to compare a batch of new photos with all my existing photos and then discover that Helen appears in some of my new photos and tag them appropriately.
Here again we are comparing things (images rather than text this time) based on similarity. It may seem a bit far-fetched based on our pre-conceptions of what a RDBMS does but I see no reason why the ability to compare images to discover whether a person appears in both of them shouldn't be a function of the database rather than a standalone tool. Its clear in SQL Server 2008 that Microsoft are taking the product beyond what we traditionally think a RDBMS should do so the capabilities I'm talking about here seem, to me, to be a logical extension.
What do you think?
P.S. My colleague Peter Hanlon has an interesting perspective on fuzzy querying where he considers the impact that it could have on house buying. Read it here.