Welcome to EMC Consulting Blogs Sign in | Join | Help

SSIS Junkie

Kapow – ETL for HTML

A couple of weeks ago Chris Webb sent me an IM telling me about a new technology he’d just seen a demo of called Kapow. Chris has since blogged about it at Kapow Technologies and in that blog post he described Kapow as:

“a cross between a screenscraper and an ETL tool”

That’s a very apt description, for what Kapow enables you to do is build what are effectively ETL packages (although they call them Robots) that extract data from HTML pages and either (a) load it into a database for you or (more interestingly) (b) make that data available as a RESTful web service. A robot pulls out the data embedded in the markup and presents it as strongly-typed data entities.

I never really thought about a web page as being structured data but actually nothing could be further from the truth; HTML is after all nothing more than a hierarchical dataset with the added luxury of metadata - otherwise known as the Document Object Model (DOM). Kapow takes that structured data and its rich metadata, parses it for us, and presents it to us in ways that are easily consumable.

I was given a tour of Kapow by their UK rep Dominic Dunkley. Dominic had a great demo where he built, from scratch, a Kapow robot that:

  1. visited a search engine before…
  2. …entering a search term (in this case “EMC”)
  3. iterated over each page of results within which it…
  4. …iterated over each search result….
  5. …and extracted the title, URL and description before…
  6. …making all the search results available in a strongly-typed dataset

It all took about 15 minutes and that was with him pausing to explain it step-by-step. That demo really grabbed my attention because I realised that not only does Kapow have the ability to parse the DOM but it also has notions of workflow and data composition which are of course two vital features of any ETL tool.

I’m reminded of 3scale Networks that I mentioned in my blog post Enterprise Mashups a couple of weeks back (it was actually after reading that blog post that Chris got in touch with me); in that I describe how 3scale have taken information made freely available by the United Nations at http://data.un.org/ and made it available as an easily consumable data service. In essence this same service could be built using Kapow without being monotonously handcrafted which is how I suspect 3scale did it.

I’m hoping I have some reason to use Kapow in the future because I think there are some very interesting scenarios that come into play here. Chiefly, as Chris pointed out in his blog post, we can pull data from any web site and use it for BI purposes. For example, suppose you work for an airline and you want to easily compare your advertised prices with those of your competitors – Kapow is a one-stop shop for enabling that.

Impressive stuff. If you’re interested go and check out Kapow for yourself at http://kapowtech.com/index.php/solutions/web-and-business-intelligence


Published Wednesday, July 08, 2009 10:29 PM by jamie.thomson
Filed under: ,
New Comments to this post are disabled

This Blog


Powered by Community Server (Personal Edition), by Telligent Systems