Welcome to EMC Consulting Blogs Sign in | Join | Help

SSIS Junkie

SSIS: Regular Expressions & Fuzzy Matching

A few days ago I posted a blog entry discussing some experiences of using fuzzy matching in SSIS. In it I talked of a problem I was having resolving 50 different records with the word "google" in them down to a single record "www.google.com".

Another problem I was having was that fuzzy matching doesn't allow much control over what these 50 records may resolve to. I want them all to resolve to "www.google.com" but they may easily all resolve to "images.google.jp" which is not good.

Nils Erik Ihlen read the article and wrote to me suggesting that regular expressions may be able to help solve my problem. He told me he had implemented regular expressions in SSIS using the script component transformation and was kind enough to share his work with me. This was tremendously helpful to me because I am in no way a conventional .Net developer and I don't really know what the .Net Framework provides so I learnt alot from this.

The following screenshot shows how Nils used regular expressions in the script component to search for records containing the word "google".

 

Basically what the code does is to search the incoming record for the string "google" and if it is found output the string "http://www.google.com" and "" otherwise. And there you have it....exactly what I wanted...everything coming out as google.

Now, let's try and apply this to my practical example. I can add this regular expression to my package that currently uses fuzzy matching but we still have problems with it. Every string that I wanted to search for would require its own regular expression. There may be literally hundreds of these strings (e.g. microsoft, msn, yahoo, askjeeves, conchango) and building a regular expression for every one of them is going to become unwieldy. I still need the fuzzy matching component to deal with all the records that my regular expressions cannot resolve.

What I'm really saying is that on their own, regular exressions and fuzzy matching have alot of uses but combining them together gives me a powerful mechanism for dealing with irregular data.

I have posted an example package demonstrating the use of regular expressions here (apologies that its not a .dtsx file, our blogging engine doesn't seem to like them for some reason). Feel free to download it and have a play...it doesn't require any configuration, just download it, run it, and see what regular expressions can do for you. For help with regular expressions, try this link.

Many thanks to Nils for the help and advice he has provided me with.

I have been discussing what Conchango's approach to using fuzzy matching should be with my colleague Mick Horne. Mick did alot of work using SSIS's fuzzy components way back in summer 2004. His overall conclusion was that if you want fuzzy matching/lookups to do exactly what you want then there has to be some form of human element involved. His suggestion was that the outputs from fuzzy matching can be sent to a lookup table which can then be used on future executions, prior to the fuzzy matching, to ensure that repeated occurences are always mapped to the same value. To make this truly useful and intuitive we would need to periodically check over the lookup table and decide whether a mapping is satisfactory. For example if fuzzy matching says "images.google.com" always maps to "www.google.jp" that might not be what we want, so we want to get rid of that mapping from the lookup table. It wouldn't be hard to knock up a sinple interface that allowed you to do this.

[Warning, "stating the bleeding obvious" coming up!!] Fuzzy logic in SSIS isn't an exact science. SSIS's architecture allows you many methods of refining fuzzy computation to make it work for you. Explore these methods before you start to implement fuzzy computation en masse. Be sure you know what you're getting into. That sounds very cautionary, and it is. But it doesn't mean to say with the correct approach fuzzy matching/lookups can't be a powerful tool in your SSIS armoury for dealing with inexact data.

-Jamie

 

Published 03 April 2005 12:40 by jamie.thomson

Comments

 

jamie.thomson said:

Great blog Jamie. I think this nicely underlines the advantages of our "toolkit" approach to data quality components over the more monolithic approach of some quality vendors. Being able to conditionally pipeline a complex array of data quality processes should be very useful, even if it does require more design.

Thanks for the great posts - we love it!

Donald
April 6, 2005 04:56
 

jamie.thomson said:

A RegEx component that one can use for parsing strings would be pretty useful... any takers?
April 7, 2005 04:11
 

jamie.thomson said:

Its on my list of things to do Ash...its a long list though!

What exactly do you think it should do?
April 7, 2005 08:00
 

stinger said:

There is SSIS RegEx component:

http://www.microsoft.com/downloads/details.aspx?FamilyID=C16F11AD-150A-4091-B3A2-83D21D3E0973&displaylang=en

But the idea of using RegEx from the script component is better I think :-)

Good Work

June 11, 2007 12:05
New Comments to this post are disabled

This Blog

Syndication

Powered by Community Server (Personal Edition), by Telligent Systems