Welcome to EMC Consulting Blogs Sign in | Join | Help

SSIS Junkie

DRY SQL

I recently inherited some SQL that someone else had written and had the job of “tidying it up” before it gets pushed out to production. Here’s a slightly simplified (yes, simplified) version of that SQL:

   1: --options
   2: select    asset_class
   3: ,        case    when volume > 0 then 'profit' 
   4:                 else 'loss' 
   5:         end as profit_or_loss
   6: ,        customer
   7: ,        sum(volume) as total_volume
   8: from    t1
   9: where    asset_class = 'options'
  10: group    by
  11:         asset_class
  12: ,        case    when volume > 0 then 'profit' 
  13:                 else 'loss' 
  14:         end
  15: ,        customer
  16: union    all
  17: --swaps
  18: select    asset_class
  19: ,        case    when volume > 0 
  20:                     then 'profit' 
  21:                 else 'loss' 
  22:         end as profit_or_loss
  23: ,        customer
  24: ,        sum(volume) as total_volume
  25: from    t2
  26: where    asset_class = 'swaps'
  27: group    by
  28:         asset_class
  29: ,        case    when volume > 0 then 'profit' 
  30:                 else 'loss' 
  31:         end
  32: ,        customer

The SQL basically takes data from two tables (t1 & t2), aggregates each, carries out some inline expressions (for profit_or_loss) and finally unions it all together. On the surface it looks fine but there are a few problems here, namely that there is a lot of repeated code; it violates the principle of don’t repeat yourself (DRY) which preaches “single point of maintenance” and “deduplication” of code. If you take the time to check it out you’ll see that identical aggregations are carried out on the two datasets (lines 7 & 24) as are the same conversions for profit_or_loss (lines 3-5 & 19-22). Not only that but we have expressions appearing in both the SELECT clause and GROUP BY of both halves of the query (lines 12-14 & 29-31), another violation of DRY.

A bit of refactoring is called for. First job, eliminate those expressions for profit_or_loss which appear in the GROUP BY clauses:

   1: --options
   2: select    asset_class
   3: ,        profit_or_loss
   4: ,        customer
   5: ,        sum(volume) as total_volume
   6: from    (
   7:         select    asset_class
   8:         ,        customer
   9:         ,        case    when volume > 0 then 'profit' 
  10:                         else 'loss' 
  11:                 end as profit_or_loss
  12:         ,        volume
  13:         from    t1
  14:         where    asset_class = 'options'
  15:         )t1
  16: group    by
  17:         asset_class
  18: ,        profit_or_loss
  19: ,        customer
  20: union    all
  21: --swaps
  22: select    asset_class
  23: ,        profit_or_loss
  24: ,        customer
  25: ,        sum(volume) as total_volume
  26: from    (
  27:         select    asset_class
  28:         ,        customer
  29:         ,        case    when volume > 0 then 'profit' 
  30:                         else 'loss' 
  31:                 end as profit_or_loss
  32:         ,        volume
  33:         from    t2
  34:         where    asset_class = 'swaps'
  35:         )t2
  36: group    by
  37:         asset_class
  38: ,        profit_or_loss
  39: ,        customer

Ok cool, we’ve moved the expression for profit_or_loss into a subquery (aka derived table) and hence expressions have gone from our GROUP BY clauses … but we’ve still got DRY violations. The aggregation (lines 5 & 25) and expression for profit_or_loss (lines 9-11 & 29-31) still appear in two places. More refactoring….

   1: select    asset_class
   2: ,        customer
   3: ,        sum(volume) as total_volume
   4: from    (
   5:         select    asset_class
   6:         ,        case    when volume > 0 then 'profit' 
   7:                         else 'loss' 
   8:                 end as profit_or_loss
   9:         ,        customer
  10:         ,        volume
  11:         from    (
  12:                 --options
  13:                 select    asset_class
  14:                 ,        customer
  15:                 ,        volume
  16:                 from    t1
  17:                 where    asset_class = 'options'
  18:                 union    all
  19:                 --swaps
  20:                 select    asset_class
  21:                 ,        customer
  22:                 ,        volume
  23:                 from    t2
  24:                 where    asset_class = 'swaps'
  25:                 )t
  26:         )q
  27: group    by
  28:         asset_class
  29: ,        customer

There, much better; the aggregation only occurs once as does the expression for profit_or_loss (which in the code that I inherited occurred four different times). Our SQL is DRY and its going to be a lot easier to maintain for whomever picks the code up from me.

You’ll notice we’ve got 2 levels of nested subqueries (aka derived tables). I make no apologies for that - derived tables are a great mechanism for eliminating repeated code and if you take but one bit of advice away from this blog post it would be this: derived tables are your friend.

Would you have refactored this the same way? Maybe you might have moved the derived tables into a dedicated view. Perhaps you might even have put the derived tables into a WITH clause like so:

   1: with    t as (
   2:         --options
   3:         select    asset_class
   4:         ,        customer
   5:         ,        volume
   6:         from    t1
   7:         where    asset_class = 'options'
   8:         union    all
   9:         --swaps
  10:         select    asset_class
  11:         ,        customer
  12:         ,        volume
  13:         from    t2
  14:         where    asset_class = 'swaps'
  15: ),
  16: q as    (
  17:         select    asset_class
  18:         ,        case    when volume > 0 then 'profit' 
  19:                         else 'loss' 
  20:                 end as profit_or_loss
  21:         ,        customer
  22:         ,        volume
  23:         from    t
  24: )
  25: select    asset_class
  26: ,        customer
  27: ,        sum(volume) as total_volume
  28: from    q
  29: group    by
  30:         asset_class
  31: ,        customer

Let me know in the comments!

@Jamiet


Further reading:

Views keep your SQL queries DRY - Seth Schroeder

Published Thursday, July 02, 2009 10:05 PM by jamie.thomson
Filed under: , , ,

Comments

 

Dew Droplet – July 1-3, 2009 | Alvin Ashcraft's Morning Dew said:

July 4, 2009 3:56 AM
 

AdamTappis said:

Jamie,

In general I'm a great fan of refactoring code and using CTE's to tidy up my queries. One important thing to notice regarding your approach to have a single group by for the unioned set is that may be more optimal to do the group by before the UNION based on the content and indexing of t1 and t2.

As an aside it appears redundant to do the CASE logic for profit and loss in your example as the column appears to be missing from the final SELECT and GROUP BY clauses.

July 7, 2009 2:25 PM
 

jamie.thomson said:

Hi Adam,

Regarding the missing profit_or_loss, well observed :)   My mistake.  As I alluded at the top of this post this wasn't, word for word, the query that I inherited. I changed it (a lot) for demo purposes and it appears I amde the odd mistake along the way!!!! Doh!

The issue of performance impact here is of course a consideration - that would make for a good follow-up post.

cheers for the comment

-Jamie

July 7, 2009 2:42 PM
 

Claypole's World - The SQL Server Side said:

My colleague Jamie Thomson   posted an intriguing article on DRY SQL which you can read all about

July 15, 2009 4:53 PM
New Comments to this post are disabled

This Blog

Syndication

Powered by Community Server (Personal Edition), by Telligent Systems