Welcome to EMC Consulting Blogs Sign in | Join | Help

SSIS Junkie

Once upon a time this blog was a hive of activity. Now however its pretty lifeless as you can probably tell so if are pining for more of the same you can find me over at http://sqlblog.com/blogs/jamie_thomson. I look forward to seeing you there!

SSIS: Adventures with Fuzzy Matching

I have been doing some work with the Fuzzy Matching transformation and as anyone that has done any form of fuzzy logic before will know, it is far from being an exact science.

Fuzzy logic for those that don't know can be defined in many ways:
"The process of reaching conclusions based on information and facts that are not 100 percent certain." - www.stanford.edu/~buc/SPHINcsX/bkhm15.htm
"A way of dealing with uncertain information and variables that do not permit simple yes/no categorisations (e.g. colour). Can also be used to make decisions where uncertainty occurs." - www.calresco.org/glossary.htm
"Fuzzy logic is applied to fuzzy sets where membership in a fuzzy set is a probability, not necessarily 0 or 1. Non-fuzzy logic manipulates outcomes that are either true or false. Fuzzy logic needs to be able to manipulate degrees of "maybe" in addition to true and false." - www.twocrows.com/glossary.htm

Or here's a back-of-the-***-packet description courtesy of yours truly:
"Fuzzy logic is a method used for approximating things that are not exact"


All of these descriptions are a bit fuzzy (pardon the extraordinarily bad pun) so lets apply a bit of context, namely some SSIS context.

SSIS provides 2 transformations that provide fuzzy computation, the Fuzzy lookup transform and the Fuzzy Matching transform. The Fuzzy Lookup transform allows you to match records in a dataset to records in a secondary set that are approximately the same. The Fuzzy Matching transform enables you to group records together that are approximately the same.

I have recently been using the fuzzy matching component a little and I thought it might be worth sharing my experiences.
I have been using fuzzy matching in a scenario of processing web server log files. I wanted to use fuzzy matching to say, for example, a referrer of http://www.google.com should be considered the same as http://www.google.co.uk which should be considerd the same as http://www.google.net etc... All I care about is that someone visited my site from Google.

The Fuzzy Matching algorithm attempts to match all records that are suitably similar according to a similarity "score". The algorithm requires that some temporary tables are created in SQL Server in order to determine this "score" and then to achieve the matching.

I had 154237 records flowing through my Fuzzy Matching transform and herein arose the first problem. There were many many identical records in my dataset. For example, the web server logs capture data from http://www.conchango.com and therefore we would expect that there would be many records with a referrer of "www.conchango.com", and indeed there were. Other sites also appeared many many times. The Fuzzy Matching algorithm attempts to group all records that suitably similar regardless of how many times a record may be considered familar to something else. Before you chew that statement over let me, more simply, tell you how it manifested itself. One of the temporary tables that I mentioned before had over 21million records in it when I pulled the plug. Goodness knows how big it would have grown if I hadn't. This resulted in approximately 10GB of hard disk being used by the database file and transaction file and up to that point had taken about 15 minutes to execute.

I needed to cut down the amount of data being processed by my Fuzzy Matching transform without losing any accuracy. My approach was to use a Multicast transform to seperate the data flow into 2 identical flows. I sent one of the flows to a sort transform to eliminate duplicates (my recordset was reduced from 154237 records to 236 records), sent that data through the fuzzy matching transformation and then join to the other flow using the original referrer value and insert the 154237 records into my destination table, complete with the "cleaned" referrer.

The screenshot here shows my data flow. The sort transformation prior to the join are required because the Merge Join transformation requires its inputs to be sorted.

This worked and it worked well. The whole data flow executed in seconds and it successful eliminated records such as http://www.google.com./ and replaced it with http://www.google.com

So, my first tip when using the fuzzy logic transforms in SSIS is to work with as small a data set as possible. You can use the technique I utilised here, that of eliminating duplicates, or any of a number of other possible techniques.


I wasn't satisfied with the results though. I still had 50 "cleaned" referrers that had the word "Google" in them. When you think that there are sites such as http://images.google.com, http://gmail.google.com as well as the obvious http://www.google.com and on top of that there are .com, .co.uk, .de, .fr, .es etc etc... (in fact most countries in the world) you start to see how problems can arise.  I wanted those 50 referrers to appear as 1 single referrer - Google, and after all this is the functionality that the Fuzzy Matching component is meant to provide.

I mentioned earlier the existence of a each record being given a "score" to indicate how well it matches to another record. The score will be a number between 0 and 1. A score of 1 indicates an exact match. Any record with a score that is above a specified threshold is considered a satisfactory fuzzy match. Part of the skill of optimising the fuzzy matching process is to modify this threshold so as to achieve the desired results.

The screenshot below shows the dialog that is used to set the similarity threshold.

The default setting for this threshold is 0.8 and that is the value I was running with when I got the 50 different Google referrers. I wanted these 50 records to be considered suitably similar so I had to lower the similarity threshold.

A threshold of 0.6 resulted in 39 different occurences with the word "Google" in them. Interestingly "www.google.com.pk" and "www.google.com.ph" are not considered suitably close enough but "www.google.co.id" and "www.google.co.in" are. Weird!!

A threshold of 0.5 resulted in 29 different occurences containing the word "Google". Note that it thinks
"google", "www.google.com", "www.google.com.au" & "www.google.com.ar" are suitably similar but "www.google.com.ph" is not.

I gave up at this point. Clearly I could eventually lower the threshold to something that would allow everything with the word "google" in it to appear together, it just seems as though that threshold needs to be a lot lower than we would imagine. Setting the threshold too low increases the danger that records that have no real-world match could in fact be matched.

Clearly if you are going to be using fuzzy logic within SSIS you need to do it with caution. Test beforehand the settings that you apply and ensure that you have the optimum balance between matching those records that are indeed similar and eliminating those rogue values that look similar but in fact are not!

I plan to do more work with fuzzy components in the future so if I find out anything else that is worthy of sharing you can be sure I'll put it up here.

-Jamie

 

Published 30 March 2005 20:48 by jamie.thomson

Comments

 

TrackBack said:

April 3, 2005 12:40
 

TrackBack said:

For those of you interested in implementing fuzzy logic in SQL Server Integration Services Jamie Thomson's...
May 3, 2005 14:09
 

jamie.thomson said:

could you send me a copy of the source code. sanpoco@sbcglobal.com
May 19, 2005 17:55
New Comments to this post are disabled

This Blog

Syndication

Powered by Community Server (Personal Edition), by Telligent Systems