Welcome to EMC Consulting Blogs Sign in | Join | Help

Claypole's World - The SQL Server Side

Concerning development of bespoke applications and database administration.

The Curious Case of the Dubious Deadlock and the Not So Logical Lock

One of the many reasons I love consulting is that you get the opportunity to work in diverse businesses, review and inspect various implementations and experience unique challenges that come with each client visit. My latest engagement has involved working with one of the largest Teradata installations in the world, which acts  as the primary data source to our application.  To give you an idea of the scale; this system has over 40 Billion rows in just one of the fact tables…  40. Billion.  That's a lot of rows.   During this engagement we encountered a rather bizarre issue that doesn't seem to be well understood so I thought I'd blog about it. Get ready for a mind job...

The Scenario

We had been pulling data from this monster into our System.  The table in question had a paltry 900 million rows in it which we had been incrementally loading into for a while.  All was well. Then suddenly out of nowhere we got a deadlock during the load process.  We re-ran the load and it went through successfully the second time.  Odd.  We didn't get another deadlock for a while and then at about 1 Billion rows we were getting them all the time.  What on earth was going on?  Time to put away the violin, stick on the deer stalker and investigate...

Before proceeding it's probably worth sharing a bit about the implementation (without actually divulging any industry details of course) so that you can understand the context of these deadlocks and why they were so strange.  Remember though this is a locking problem not an ETL problem. The load pattern caused the issue to manifest but in itself isn’t responsible for its occurrence.  The data itself was completely clean and when the process was executed sequentially this issue did not arise.

The Table

Column Name

Data Type

Key

Date

date time Clustered
CountryID tinyint Clustered
GroupID smallint Clustered
CodeID int Clustered

Other Columns

... no additional indexes

The Data

To illustrate the point of the uniqueness of the data below are two data sets. You can see that each row is in itself a unique set of values. These four rows can and will cause a deadlock based on row locks.  I promise.

Data Loaded by Parameters 2009-05-19 & CountryID 2

Date CountryID GroupID CodeID
2009-05-19 2 4271 5835066
2009-05-19 2 4619 2546652

…..

Data Loaded by Parameters 2009-05-19 & CountryID 3

Date CountryID GroupID CodeID
2009-05-19 3 4245 2651987
2009-05-19 3 4657 5744053

The Load Pattern

image

Stored Proc Pseudo Code

INSERT INTO BillionRowTable
(    [Date]     DATETIME
    ,CountryID  TINYINT
    ,GroupID    SMALLINT
    ,CodeID     INT
)
SELECT   [Date]
        ,[CountryID]
        ,[GroupID]
        ,[CodeID]
FROM    dbo.25MillionRowTable
/* 
Where Clause filters rows from 25 Million for the day
to 5 million for the Date/CountryID Combination
*/
WHERE   [Date] = @Date
AND     CountryID = @CountryID 

In short then we were loading 5 sets of data in parallel through a parameterised stored procedure.  The data set used in each stored procedure execution was for a different country on a given day.  A country could have as many as 5 million rows in it making for a total of 25 million rows in each parallel load.  Given that there was no other processing on this server it was therefore more than a little odd that we would get a deadlock. At the risk of repeating myself, each procedure was  wholly responsible for its own set of data. How on earth could a deadlock occur? The data was clean and the logic was fine so what could have been the problem?  We even ran the stored procedures sequentially without a problem. The issue only arose when running the same set of data in parallel.

Deadlocking Recap

So what then is a deadlock?  This is when two processes each hold a lock or set of locks that is incompatible with a lock that the other process is trying to acquire.  As neither one process can complete the two processes are said to be deadlocked. Rather than look on in morbid fascination SQL Server chooses a victim based on which of the two processes is easiest to roll back based on the amount of resources it has consumed.  The victim is then rolled back. 

If you have trace flags 1204, 1222 running or are using profiler/SQL Trace to capture deadlock events you can gather information on what processes were involved in the deadlock and which one or ones were chosen as the victim. This normally makes identifying the deadlock and why it occurred a relatively straightforward process.  Resolving the deadlock can be another matter entirely but perhaps that's a subject for another post.

However, in this case the deadlock info was not so straightforward.  First though let’s look at a ‘normal’ scenario.

A Simple Graph – A Community Example

Below is a deadlock generated by code from fellow MVP Peter Ward’s blog.  The code to reproduce this example can be found here.  I have used profiler to reveal the deadlock graph.

clip_image001[7]

In this instance we can see that SPID 64 requested an Update lock on a row that was exclusively locked by SPID 65.  However, SPID 65 needed to update a row on associated object 3602879703904288768 (this is the objects hobt ID). Unfortunately for SPID 65 this was already locked exclusively by SPID 64.  Neither can therefore proceed.  SPID 65  has been chosen as the victim. 

By hovering over the graph (see above) you can also see the last statement in the batch for each SPID and by querying sys.partitions (see below) you can see how the associated objid maps back to the table in question.

USE tempdb;

SELECT   OBJECT_NAME(p.[object_id]) AS ObjectName
        ,p.[object_id]              AS ObjectID
        ,i.name                     AS IndexName
        ,hobt_id                    AS HoBTID
        ,partition_id               AS PartitionID
        ,partition_number           AS PartitionNo
FROM    sys.partitions p
JOIN    sys.indexes i
ON      p.index_id     = i.index_id
AND     p.[object_id]  = i.[object_id]
WHERE   hobt_id IN (3602879703904288768,3530822109825400832);
The result set for these two queries is as follows:

image

My Graph – Not So Obvious

So now we have re-capped the basics and can understand the essential deadlock information we are likely to be presented with it’s worth looking at the graph that was making my eyes bleed.  Perhaps you can test yourself to see if you can see what, on the face of it, is so wrong about this.  N.B. I have had to change some of the names for security and privacy reasons.

image

So what is this graph saying?  It is telling me that SPID 82 is deadlocked against SPID 85.  It is saying that they were both trying to get exclusive locks for a ROW in hobt 72057594693222400. 

How was this possible?  Why should SPID 82 have a lock on any data being loaded by SPID 85? There was no overlapping data that would allow this to happen.  Why was this only starting to happen now? It just didn’t make any sense. Initially you might think that this is due to page splits or some other aspect of the storage where the data ranges may overlap but you will see these are both “Key Locks”.

I ran this query hoping for some clarification.

SELECT   OBJECT_NAME(p.[object_id]) AS ObjectName
        ,p.[object_id]              AS ObjectID
        ,i.name                     AS IndexName
        ,hobt_id                    AS HoBTID
        ,partition_id               AS PartitionID
        ,partition_number           AS PartitionNo
FROM    sys.partitions p
JOIN    sys.indexes i
ON      p.index_id     = i.index_id
AND     p.[object_id]  = i.[object_id]
WHERE   hobt_id = 72057594693222400;

This just returned the name of the target table and the primary key, so not much help to be found there. What I can tell you was that both queries were loading into the same partition in the same hobt_id on the same index.

What could we conclude at the point? Well we knew the deadlock was occurring on the same clustered index of the same table. What could we do next?

First things first we needed to find out what lock resources were involved in the deadlock.  If you look at the community sample and / or my graph you will not see the lock resource hash value in the graphical depiction.  We had to find another way to get this information. We decided to crack open the deadlock graphs to see what we could see…

Did you know you can export your deadlock graphs from profiler and save them as xml files (with a .xdl file extension)?  This is helpful for a number of reasons but the biggest reason I have found is that you can then open the file in notepad, management studio, or even your favourite XML editor (mine is XML Spy from Altova) and get even more information!

My Deadlock Graph – in text format

<deadlock-list>
    <deadlock victim="process389e718">
        <process-list>
            <process id="process384e478" taskpriority="0" logused="500719924" waitresource="KEY: 7:72057594693222400 (0801bf3924dc)" waittime="60718" transactionname="INSERT" XDES="0x80028c80" lockMode="X" schedulerid="8" kpid="360" status="suspended" spid="85" sbid="0" ecid="0" priority="0" transcount="2" isolationlevel="read committed (2)" lockTimeout="4294967295">
                <executionStack> 
                   ...
                </executionStack>
                <inputbuf>INPUTBUFFER</inputbuf>
            </process>
            <process id="process389e718" taskpriority="0" logused="350193272" waitresource="KEY: 7:72057594693222400 (0801a4a3d7f1)" waittime="4000" transactionname="INSERT" XDES="0x800ce8f0" lockMode="X" schedulerid="16" kpid="4552" status="suspended" spid="82" sbid="0" ecid="0" priority="0" transcount="2" isolationlevel="read committed (2)" lockTimeout="4294967295">
                <executionStack> 
                </executionStack>
                <inputbuf>INPUTBUFFER</inputbuf>
            </process>
        </process-list>
        <resource-list>
            <keylock hobtid="72057594693222400" dbid="7" objectname="DB.DBO.TABLE" indexname="TABLE_PK" id="lock59b304700" mode="X" associatedObjectId="72057594693222400">
                <owner-list>
                    <owner id="process389e718" mode="X"/>
                </owner-list>
                <waiter-list>
                    <waiter id="process384e478" mode="X" requestType="wait"/>
                </waiter-list>
            </keylock>
            <keylock hobtid="72057594693222400" dbid="7" objectname="DB.DBO.TABLE" indexname="TABLE_PK" id="lock5df249400" mode="X" associatedObjectId="72057594693222400">
                <owner-list>
                    <owner id="process384e478" mode="X"/>
                </owner-list>
                <waiter-list>
                    <waiter id="process389e718" mode="X" requestType="wait"/>
                </waiter-list>
            </keylock>
        </resource-list>
    </deadlock>
</deadlock-list>

I have highlighted the two key pieces of information for this diagnosis and have copied them below for ease of reference.  The values between the parenthesises are the lock resource hash values. 

waitresource="KEY: 7:72057594693222400 (0801a4a3d7f1)

waitresource="KEY: 7:72057594693222400 (0801bf3924dc)

The question is what keys do these values relate to? My data is working on unique keys in each insert so there must be something else going on. But what?

When Locks Get Physical :–

Using %%lockres%%

I was struggling to come up with an answer.  So I thought I’d run the scenario past my good friend Simon Sabin to see if he had any ideas.  Sure enough he came up with an undocumented feature in SQL Server 2005 and above (courtesy of the late Ken Henderson) that helped us understand this particular problem.  The feature is called %%lockres%%.

When people think about a row lock they often think about the DDL of their table.  In my case I had a composite clustered index that was telling me I had no dupes across Date, CountryID, GroupID and CodeID.  It was logical therefore to assume that SQL Server was using this key when saying that a lock has occurred.  Well… The answer is that it is … and it isn’t.

Locks in SQL Server are often said to be a logical concept whereas Latches are seen as physical. After being bitten by this issue I don’t think it’s wholly fair to conclude this.  My reasoning is as follows. Locking may be a logical concept but it has a physical implementation. It’s true that Latches look after the physical movement of data between addresses but I don’t think it’s fair to draw this parallel. The Lock physical implementation is based on how we, the designers, design our table.  There is an internal mechanism involved here that translates my DDL definition into something that works for SQL Server – and most notably of all performs for SQL Server irrespective of the number of columns I specify in my key.

When SQL Server needs to lock a row it creates a hash value that is based on the key values of the table.  It is this hash value that is used by the lock manager and means it has a single value to look at when checking to see if a row is already locked.  The values in my key then (which was 16 bytes in length) were being distilled to a 6 byte hash.  So you see Locking has a physical side as well – to think of Locking as a purely logical concept would be to miss an important underlying concept that translates the definition of our DDL into information that can be consumed by the lock manager.

This was the missing link (and I do not mean in the Ida context http://news.bbc.co.uk/1/hi/sci/tech/8057465.stm).  Perhaps the really perceptive people can already see where this is going.  The hypothesis then was that SQL Server was generating a duplicate hash value across the countries I was loading.

Now we needed to prove it; and %%lockres%% was the key. %%lockres%% always returns the appropriate waitresource no matter if you are scanning a clustered index or a heap.  If it’s a heap then %%lockres%% will return the waitresource for the RID lock. Conversely, if it is a clustered index being scanned then the waitresource will be for the key lock. (You can also use it to get the hashes for non clustered indexes by using an index hint. However, Simon is going to blog about this in a separate post.)

We could have queried the entire day’s processing and done a group by to see if this gave us anything but there was another way to filter out the data and I’d rather not do huge queries scanning a billion rows if I can help it. (Table was much larger than available RAM and so would have resulted in a huge amount of I/O).

By filtering our query for the day and also by the lock resource hashes that had given us this problem we could really cut into this dataset.  As the Date field is the first key of the clustered index this meant a clustered index seek was used and so %%lockres%% returned the key hashes for our clustered index which was want we wanted.

select *,%%lockres%%
from    dbo.Table(nolock)
where    %%lockres%% IN('(0801a4a3d7f1)','(0801bf3924dc)')
and        Date >= '2009-05-18'

Sure enough we had the same hash value appearing in different countries for the same data. Even though the data itself was unique SQL Server’s hashing algorithm was not able to achieve a unique result.  When this occurs it is called a collision.

One clash though wouldn’t have been enough for a deadlock to occur.  It would have slowed the process down as one process would have had to finish before the waiting process continued – a classic blocking scenario. We needed to have two. This is exactly what we got – how unlucky was that! Now normally you need to be at serious scale to hit this. Remember we had loaded happily to 900 million rows before we encountered this problem. With a 6 byte hash there are 281,474,976,710,656 permutations of hashes.  EDIT : Please see Remus Rusanu’s blog for the statistics and probability on this clash.  He’s written a great follow up post that did a far better job on explaining it!

However, once you have the values you can see the problem with only 4 rows of data.

Check this out.

CREATE TABLE #tmp
                (  [Date]       DATETIME
                   ,CountryID   TINYINT
                   ,GroupID     SMALLINT
                   ,CodeID      INT
                )
INSERT INTO #tmp 
SELECT '20090519'
        ,2
        ,4271
        ,5835066
UNION ALL
SELECT
        '20090519'
        ,2
        ,4619
        ,2546652
UNION ALL
SELECT
        '20090519'
        ,3
        ,4245
        ,2651987
UNION ALL
SELECT
        '20090519'
        ,3
        ,4657
        ,5744053;

CREATE UNIQUE CLUSTERED INDEX tmp_PK 
ON #tmp
    (    [Date]
        ,CountryID
        ,GroupID
        ,CodeID
    )

SELECT   %%lockres%% as LockHash
        ,*
FROM    #tmp
ORDER BY %%lockres%%
        ,CountryID

DROP TABLE #tmp

The results are there for all to see:

image

CountryID 2 could deadlock with CountryID 3!

If you want to get this to actually happen for yourself just follow the steps below

1) Run this script to create the table and the clustered index

CREATE TABLE LockResCheck
                (  [Date]       DATETIME
                   ,CountryID   TINYINT
                   ,GroupID     SMALLINT
                   ,CodeID      INT
                )
CREATE UNIQUE CLUSTERED INDEX LockResCheck_PK 
ON LockResCheck
    (    [date]
        ,CountryID
        ,GroupID
        ,CodeID
    )

2) Open two connections to the database in which you created LockResCheck

3) In Connection 1 Execute the following (N.B. Only Data in CountryID 3 is inserted in Connection 1)

BEGIN TRANSACTION

INSERT INTO LockResCheck
SELECT
        '20090519'
        ,3
        ,4245
        ,2651987

4) In Connection 2 Execute the following (N.B. Only Data in CountryID 2 is inserted in Connection 2)

BEGIN TRAN
INSERT INTO LockResCheck 
SELECT '20090519'
        ,2
        ,4271
        ,5835066
UNION ALL
SELECT
        '20090519'
        ,2
        ,4619
        ,2546652

5) In Connection 1 Execute the following (N.B. Still Only Data from  CountryID 3 is inserted in Connection 1)

INSERT INTO LockResCheck
SELECT
        '20090519'
        ,3
        ,4657
        ,5744053;

Voila – Deadlock!!

Mitigation – If this happens to you what are your options

So we have seen the problem but what about the resolution.  The truth is that when you get into big volumes and you have composite keys then there is no “complete” answer.  However, you can take steps to make this much less likely. 

Change your key to a surrogate integer based key

By having a unique integer based key then you can guarantee this won’t happen to you.  However, if you need a BIGINT one must assume that this might start happening to you (but only if you are loading an obscene amount of data in parallel).  Remember the hash value is 6 bytes and whilst an INT is 4 bytes a BIGINT is 8.

Additional Value into the Key

We could also add a value into the composite key to mix things up a bit in the hashing algorithm and add a big more diversity into the data.  If you look at those initial results I don’t think I am really helping SQL Server in terms of unique values.  Saying that values of this nature are very common at scale where intelligent keys are prevalent.

In the example below I have extended the original example by adding on an extra id field to the data.  You will see from the results below that this has had the desired effect but remember it has worked for the four rows in question.  There is no guarantee that this will eliminate all collisions.

CREATE TABLE #tmp
                (  [Date]       DATETIME
                   ,CountryID   TINYINT
                   ,GroupID     SMALLINT
                   ,CodeID      INT
                   ,ID          INT
                )
INSERT INTO #tmp 
SELECT '20090519'
        ,2
        ,4271
        ,5835066
        ,1
UNION ALL
SELECT
        '20090519'
        ,2
        ,4619
        ,2546652
        ,2
UNION ALL
SELECT
        '20090519'
        ,3
        ,4245
        ,2651987
        ,3
UNION ALL
SELECT
        '20090519'
        ,3
        ,4657
        ,5744053
        ,4;

CREATE UNIQUE CLUSTERED INDEX tmp_PK 
ON #tmp
    (    [Date]
        ,CountryID
        ,GroupID
        ,CodeID
        ,ID
    )

SELECT   %%lockres%% as LockHash
        ,*
FROM    #tmp
ORDER BY %%lockres%%
        ,CountryID

DROP TABLE #tmp

image

Partitioning

If partitioning is an option for you then this is possibly the one to go for.

It is worth remembering that partitioning is an “Enterprise Only” feature and so is not always an available option.

By partitioning the data you will get one hobt_ID per partition. SQL Server guarantees that locks in different partitions will never collide.  As part of the process of taking a key lock SQL Server stores two pieces of information:

  1. 6 byte hobt_id
  2. 6 byte hash key value.

This ensures that the hash is scoped to the hobt_id to which it belongs.  However, that wouldn’t necessarily mean you would be 100% out of the woods as you’d still have a chance of hitting this issue inside the scope of a hobt. Having said that, by keeping the partitions reasonably tight you would be greatly minimising your chances.  Remember we had successfully loaded 900 million records before this problem surfaced!

To complete the series of examples and repro’s please find below an example that shows how partitioning can deliver unique lock resource hash values.

CREATE DATABASE LockResCheck
GO
Use LockResCheck

CREATE Partition Function tmp_PF1(int) as Range Right for values (1,2,3,4,5)

CREATE Partition Scheme tmp_PS1 AS PARTITION tmp_PF1 ALL TO ([PRIMARY])

CREATE TABLE tmp
                (  [Date]       DATETIME
                   ,CountryID   TINYINT
                   ,GroupID     SMALLINT
                   ,CodeID      INT
                   ,PartitionID INT
                ) ON tmp_PS1 (PartitionID)
INSERT INTO tmp 
SELECT '20090519'
        ,2
        ,4271
        ,5835066
        ,2 %15
UNION ALL
SELECT
        '20090519'
        ,2
        ,4619
        ,2546652
        ,2 %15
UNION ALL
SELECT
        '20090519'
        ,3
        ,4245
        ,2651987
        ,3 %15
UNION ALL
SELECT
        '20090519'
        ,3
        ,4657
        ,5744053
        ,3 %15;

CREATE UNIQUE CLUSTERED INDEX tmp_PK 
ON tmp
    (    [Date]
        ,CountryID
        ,GroupID
        ,CodeID
        ,PartitionID
    ) on tmp_PS1(PartitionID)

SELECT *
FROM    sys.partitions
WHERE   [object_id] = OBJECT_ID('LockResCheck.dbo.tmp')

SELECT   %%lockres%% as LockHash
        ,*
FROM    tmp
ORDER BY %%lockres%%
        ,CountryID

image 

Change your insert batch size

Our issue occurred because we have a single insert statement which is processed as  a single transaction. If we changed this to use a batched approach the likelihood of having the collisions wouldn’t have been completely removed but we could have drastically reduced the likelihood of the collisions being locked at the same time.

In our scenario each process/procedure was inserting 5 million rows. This could have been split into x batches of 10,000 rows, each taking seconds to complete. The likelihood of deadlocks is small. In the event of rows actually deadlocking we could use SSIS to resubmit the failed data, using the error output of the OLEDB Destination.  One limitation here though is that the MICS parameter is not able to be set via an expression so changing it would mean making a code, as opposed to a configuration, change.  You would also need to implement either a recovery process or a retry process in the event of a partial load to back out the data or compensate for the failures.

Wrap up

At the end of the day what can I say about this issue. Whilst challenging and fun I think it’s fair to say that diagnosing this issue wasn’t easy.  I’d like to see the following things improved in the next version.

  1. Make %%lockres%% a documented feature – without this functionality I don’t think it’d have been possible to diagnose this issue.
  2. Make the deadlock graph more intuitive. At least show the lock resource in the GUI – Ideally it’d show the key values.
  3. Extend Books online to cover deadlocking more comprehensively.  Writing this post I have looked at the examples in there and they are based on custom code that isn’t part of the sample database AdventureWorks or AdventureWorks2008. 
  4. Make this less likely to happen in the first place!

If you think that any of this is as important as I do then please head over to connect and vote for the feedback Item I have posted.

  1. %%LockRes%% a documented feature - https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=458076
  2. Deadlock graph more intuitive - https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=458080
  3. Improve BOL coverage of deadlocks - https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=458084
  4. Reduce likelihood of collisions - https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=458091

Finally

I think now would be an appropriate moment to say that I am co-authoring a SQL Server Internals book with Christian Bolton, Justin Langford & Brent Ozar with contributions from Cindy Gross and Jonathan Kehayias.  You might say that this issue hit me at the perfect time as I am about to start work on a Locking and Latching Chapter.  If you are interested I will be posting updates on my authoring adventures here and on twitter.  Hash tag #NewSQLBook ID @jrowlandjones

Thanks again to Simon Sabin and Marcel van der Holst for their kind help resolving this issue and to Paul, Peter and Kevin for providing the quality control on the editorial!

Cheers, James

@jrowlandjones

Published Thursday, May 28, 2009 10:54 PM by James.Rowland-Jones

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Remus Rusanu said:

Good post, but you got the math wrong. 900 million records your guaranteed collisions. See my blog for an explanation why: http://rusanu.com/2009/05/29/lockres-collision-probability-magic-marker-16777215/

May 29, 2009 1:47 AM
 

The Rambling DBA: Jonathan Kehayias said:

This blog post by James Rowland Jones on a specific type of deadlock is so interesting that I have to

May 29, 2009 2:12 AM
 

James Rowland-Jones said:

Hi Remus,

I have read your post and it is certainly a good addition to my observations.  

I'd like to clarify one point though.  I wasn't meaning to infer that you needed to load 900 million records to experience this deadlock.  Quite the opposite as I have demonstrated with only 4 recordds. I was making a statement that this only affected this system once we had incrementally loaded 900 million.

Furthermore we were "only" loading 25 million rows at a time so whilst the probability based on the maths you have provided is still higher than I might have expected it wasn't the case that we were loading in the full 900 million.

Thank you for taking the time to contribute to this issue.

May 29, 2009 7:38 AM
 

Paul McMillan said:

When we obtained this issue, we seemed to alivate it by switching off the Row Locking Function of the Clustering Index and force it to be a Page Lock using sp_indexoption ... We were then getting blocking rather than deadlocking. Which is much better. As a side effect, the process seems much faster as SQL Server does not have to manage millions of individual row locks, only thousands of page locks

May 29, 2009 8:51 AM
 

Alejandro Mesa said:

Hi James,

Thank you for sharing this experience with us.

The article is very interesting, not because of the use of an undocumented feature that helped to solve or understand the problem, if not for the variety of the options presented to solve the problem, including the one from Paul McMillan.

I look forward to read the book, no doubt it will be a valuable resource.

AMB

May 29, 2009 1:44 PM
 

Alejandro Mesa said:

In case somebody want to read more about this, I recommend the book.

SQL Server 2005 Practical Troubleshooting: The Database Engine

http://www.amazon.com/SQL-Server-2005-Practical-Troubleshooting/dp/0321447743

You can read about this virtual column in chapter 1 "Waiting and Blocking Issues", page 34.

AMB

May 29, 2009 5:18 PM
 

James Rowland-Jones said:

I couldn't agree more Alejandro!

This is where Simon found the info on %%lockres%% in the first place.  The link to the late Ken Henderson routes directly to this book so we did credit it as the source.  I believe we just found a good context for this information.  It is however an undocumented feature and I hope you'll support me in requesting that this feature is documented in a future release.

Thanks for the comments on this post.

Cheers, James

May 29, 2009 5:23 PM
 

Alejandro Mesa said:

James,

> The link to the late Ken Henderson routes directly to this book so we did credit it

> as the source.

Sorry about that, I did not click on the link.

We should also thanks Santeri Voutilainen, who was the person who wrote that chapter.

> It is however an undocumented feature and I hope you'll support me in

> requesting that this feature is documented in a future release.

I did not think it twice. I voted as soon as I finished reading the article.

AMB

May 29, 2009 5:50 PM
 

YRB said:

Is it possible to reorder the columns of the Clustered Index such that we minimize the collision probability? I would assume a hierarchy of a following type Country-->Group-->Code-->Date. Hence we can create a clustered index with column orders opposite to the hierarchy? Using Probability against itself :)

CREATE UNIQUE CLUSTERED INDEX tmp_PK

ON #tmp

   (    

       [Date]

       ,CodeID

       ,GroupID

       ,CountryID

   )

May 29, 2009 7:23 PM
 

Alex Kuznetsov said:

Hi James,

Thank you for sharing this with us!

AK

May 29, 2009 7:42 PM
 

James Rowland-Jones said:

Hi Alex,

It was my pleasure, although I did wonder what happened to the week-end!

Cheers, James

May 30, 2009 10:07 AM
 

James Rowland-Jones said:

Hi Remus,

I have edited the post to help clarify things (by pointing them to your post).  Thanks again.

http://rusanu.com/2009/05/29/lockres-collision-probability-magic-marker-16777215/

May 30, 2009 10:49 AM
 

Remus Rusanu said:

Thank you James. Still a great post, didn't want my comment to sound negative :)

May 30, 2009 4:30 PM
 

Andrei Rinea said:

@all :

I do not understand why the size of the hash has been chosen to 6 bytes. An 8 byte would have been 65,536 times larger (less hash collisions) and would have suited x64 architectures perfectly. Well.. at leas in my opinion.

@author :

Could you please tell us exactly how much time and how many persons did this solving take?

May 31, 2009 10:30 PM
 

James Rowland-Jones said:

Hi Andrei,

I worked on this over a course of a day (had other things on as well) once the issue was happening regularly.  I worked with Simon during that evening and Marcel, who I believe was in the States, late into the night.  Once we were able to apply the lockres and I had cracked open the graphical plan to get the lock resources thats when things started to really pick up in terms of progress.

The write up however took quite a bit longer! :o).

Cheers, James

June 1, 2009 8:51 AM
 

Simon Sabin said:

I can concur the write up took probably 4 times longer than the troubleshooting of the problem. It was one of those bizarre, "it just can't be happening" situations. A bit like getting 1+1 = 3, it just didn't compute.

June 1, 2009 4:36 PM
 

Simon Sabin said:

ps. The calculation error was my mistake. It was late and I hit *256 too many times :)

June 1, 2009 4:38 PM
 

When does 1+1 = 3 - SimonS Blog on SQL Server Stuff said:

June 1, 2009 4:42 PM
 

James Rowland-Jones said:

Hi Andrei,

I asked a colleague at Microsoft why they chose a six byte hash and this was their response.

"Since locking is very commonly used structure. If we use 8 bytes instead of 6, it will add 2 bytes to all lock structures. Also, 6 bytes should actually suffice but somehow we are getting more collisions. We may revisit the hashing algorithm

6 bytes => 256 *(10**12). This is a pretty huge number."

Kind Regards, James

June 2, 2009 6:26 PM
 

Log Buffer said:

"On Claypole’s World, James Roland Jones presents The Curious Case of the Dubious Deadlock and the Not So Logical Lock. The scene is set thus: [...]"

http://www.pythian.com/news/2761/log-buffer-149-a-carnival-of-the-vanities-for-dbas

 

June 5, 2009 9:16 PM
 

Mike Smith said:

Great post, and explains some stuff I've seen before.

In addition to disabling row locks mentioned in the comments, I have also had success - when and only when the data is strictly "partitioned" and therefore safe - using set context_info.

June 11, 2009 7:05 PM
 

Andy Hughes said:

James

Thanks for an excellent and informative post.

We have experienced this problem with SQL 2000 recently and were similarly confused about why this seeminlgy illogical deadlock was happening.

While our investigation mirroired yours initially, I'm not going to be able to prove it to this level on SQL 2000.

We use a 6-part unique key in the ETL load, but with batches of only 75000 records into tables of only 10s of millions.  The batches are similarly isolated data and also run conccurrently.

The triggering event in our case was a minimal change in DB structure, and we have been able to mitigate by altering the batch size.

Unfortunately, this is mostly a production issue and in mitigating it, we have reduced our ability to troubleshoot it.  Such is life.

June 15, 2009 7:15 PM
 

Winston said:

Hi James,

To elaborate on the comment by YRB, the fundamental issue is with the composite index. The best practice for composite indexes is to order the columns from high to low cardinality. Assuming that the data is loaded on a daily basis, then the optimum column order for the index would be Code ID, Group ID, Country ID and Date. This should address the hash value problem.

Regards,

Winston.

June 18, 2009 7:13 PM
 

Claypole's World - The SQL Server Side said:

As those of you who managed to get to the bottom of my recent deadlocking post may have seen I am busy

July 19, 2009 6:30 PM
 

Yusino said:

Hi,

We also have this deadlock problem.

Our software vendor suggest us to change primary index from CLustered to Non-Clustered index to removed the deadlock.

I think clustered or non-clustered as nothing to do with hash algorithm of locking.

Just want to get 2nd Opinion.

Regards,

Yusino

August 10, 2009 8:05 PM
 

James.Rowland-Jones said:

Hello Yusino,

Actually you may be surprised to know that the hash used on a heap is different to that used on a clustered index.  The reason this goes away on a heap is because the hash includes the slot of the row on the database page therefore you aren't going to get collisions.

However, I would be extremely careful before making this switch.  You might seriously affect both storage capacity and performance.

Cheers, James

August 10, 2009 8:25 PM
 

sqlity said:

This is a great article and it provided, even so my article about table compares (http://www2.sqlity.net/content/working-table-compare-approach) is not directly related, just the last missing peace to rest my case. Thanks.

October 30, 2009 11:54 AM
 

john.rayner said:

Great write-up!  On your fourth connect issue, MS indicate that it has been resolved in SQL 2008 R2.  Is there any chance of you independently verifying this?

Cheers,

John

January 7, 2010 11:07 AM
 

Rohit Paliwal said:

Guys....

Interesting article... but I always oppose of creating a composite clustered index and I feel its always better to have a surrogate key approach and then have our combination as unique non clustered.

Any thoughts on this

Rohit

August 11, 2010 10:50 AM
 

Claire Duncan said:

Fantastic blog.

February 6, 2011 8:17 PM
 

XML Schema Editor said:

thanks for your addition Remus, very useful indeed

March 25, 2011 10:21 AM
 

Patrick Flynn said:

If you run the code example for the 4 rows on 2008 R2 the lock collisions no longer occur.

This verifies the statements in regard to Connect Item 4

April 30, 2011 3:26 PM
 

James Rowland-Jones said:

Hello Patrick,

Thank you for your comments. Yes I have been a bit remiss here. The lock hash algorithm has been changed for R2 (in part as a response to this blog post) and so this doesn't apply any more.

Thanks for clarifying that.

Cheers, James

May 1, 2011 10:00 PM
 

Bahaa11 said:

Hello,

Great article, very insightful.

I am facing a problem as I dont understand how MAX (%%LOCKRES%%) BEHAVES. wHEN YOU SELECT MAX (%%LOCKRES%%) FROM A TABLE it always returns one value. On the other hand, if you Group By a few columns in some cases I would get a list of %%lockres%% rows as a result. Any ideas on why is that?

November 3, 2011 3:22 PM

Leave a Comment

(required) 
(optional)
(required) 
Submit

About James.Rowland-Jones

James is an Advisory Practice Consultant with EMC Consulting. He works primarily with SQL Server technologies in architecture, development and administration capacities. He also co-ordinates our Internal SQL Server Community.

View James Rowland-Jones's profile on LinkedIn

Powered by Community Server (Personal Edition), by Telligent Systems