I recently inherited some SQL that someone else had written and had the job of “tidying it up” before it gets pushed out to production. Here’s a slightly simplified (yes, simplified) version of that SQL:
1: --options
2: select asset_class
3: , case when volume > 0 then 'profit'
4: else 'loss'
5: end as profit_or_loss
6: , customer
7: , sum(volume) as total_volume
8: from t1
9: where asset_class = 'options'
10: group by
11: asset_class
12: , case when volume > 0 then 'profit'
13: else 'loss'
14: end
15: , customer
16: union all
17: --swaps
18: select asset_class
19: , case when volume > 0
20: then 'profit'
21: else 'loss'
22: end as profit_or_loss
23: , customer
24: , sum(volume) as total_volume
25: from t2
26: where asset_class = 'swaps'
27: group by
28: asset_class
29: , case when volume > 0 then 'profit'
30: else 'loss'
31: end
32: , customer
The SQL basically takes data from two tables (t1 & t2), aggregates each, carries out some inline expressions (for profit_or_loss) and finally unions it all together. On the surface it looks fine but there are a few problems here, namely that there is a lot of repeated code; it violates the principle of don’t repeat yourself (DRY) which preaches “single point of maintenance” and “deduplication” of code. If you take the time to check it out you’ll see that identical aggregations are carried out on the two datasets (lines 7 & 24) as are the same conversions for profit_or_loss (lines 3-5 & 19-22). Not only that but we have expressions appearing in both the SELECT clause and GROUP BY of both halves of the query (lines 12-14 & 29-31), another violation of DRY.
A bit of refactoring is called for. First job, eliminate those expressions for profit_or_loss which appear in the GROUP BY clauses:
1: --options
2: select asset_class
3: , profit_or_loss
4: , customer
5: , sum(volume) as total_volume
6: from (
7: select asset_class
8: , customer
9: , case when volume > 0 then 'profit'
10: else 'loss'
11: end as profit_or_loss
12: , volume
13: from t1
14: where asset_class = 'options'
15: )t1
16: group by
17: asset_class
18: , profit_or_loss
19: , customer
20: union all
21: --swaps
22: select asset_class
23: , profit_or_loss
24: , customer
25: , sum(volume) as total_volume
26: from (
27: select asset_class
28: , customer
29: , case when volume > 0 then 'profit'
30: else 'loss'
31: end as profit_or_loss
32: , volume
33: from t2
34: where asset_class = 'swaps'
35: )t2
36: group by
37: asset_class
38: , profit_or_loss
39: , customer
Ok cool, we’ve moved the expression for profit_or_loss into a subquery (aka derived table) and hence expressions have gone from our GROUP BY clauses … but we’ve still got DRY violations. The aggregation (lines 5 & 25) and expression for profit_or_loss (lines 9-11 & 29-31) still appear in two places. More refactoring….
1: select asset_class
2: , customer
3: , sum(volume) as total_volume
4: from (
5: select asset_class
6: , case when volume > 0 then 'profit'
7: else 'loss'
8: end as profit_or_loss
9: , customer
10: , volume
11: from (
12: --options
13: select asset_class
14: , customer
15: , volume
16: from t1
17: where asset_class = 'options'
18: union all
19: --swaps
20: select asset_class
21: , customer
22: , volume
23: from t2
24: where asset_class = 'swaps'
25: )t
26: )q
27: group by
28: asset_class
29: , customer
There, much better; the aggregation only occurs once as does the expression for profit_or_loss (which in the code that I inherited occurred four different times). Our SQL is DRY and its going to be a lot easier to maintain for whomever picks the code up from me.
You’ll notice we’ve got 2 levels of nested subqueries (aka derived tables). I make no apologies for that - derived tables are a great mechanism for eliminating repeated code and if you take but one bit of advice away from this blog post it would be this: derived tables are your friend.
Would you have refactored this the same way? Maybe you might have moved the derived tables into a dedicated view. Perhaps you might even have put the derived tables into a WITH clause like so:
1: with t as (
2: --options
3: select asset_class
4: , customer
5: , volume
6: from t1
7: where asset_class = 'options'
8: union all
9: --swaps
10: select asset_class
11: , customer
12: , volume
13: from t2
14: where asset_class = 'swaps'
15: ),
16: q as (
17: select asset_class
18: , case when volume > 0 then 'profit'
19: else 'loss'
20: end as profit_or_loss
21: , customer
22: , volume
23: from t
24: )
25: select asset_class
26: , customer
27: , sum(volume) as total_volume
28: from q
29: group by
30: asset_class
31: , customer
Let me know in the comments!
@Jamiet
Further reading:
Views keep your SQL queries DRY - Seth Schroeder