Welcome to EMC Consulting Blogs Sign in | Join | Help

Christian Wade's Blog

SQL Server Standard - Recursive Hierarchies to XML

I wrote an article for the September 2006 issue of SQL Server Standard magazine:

 SQL Server Standard

The version that was finally published was a very cut down version with minimal code snippets.  This was for obvious reasons; they had to fit the content onto small columns in the magazine.  For the readers who would prefer the more verbose version, here it is!  I guess this is further reference material for the published article; I’m sure my friends at SQL Server Standard won’t mind.

Suppose we have a sizeable recursive hierarchy in our SQL Server 2005 relational database.  We want to export it to XML.  This could be for a variety of reasons.  We may wish to use it as a pre-cached input to a UI control (e.g. a tree control); we may wish to export it to another system using a predefined format.  The possibilities are endless.


Here is an example input dataset using the Northwind Employees table.

Northwind Employees

And this is how we want it represented in XML based on the ReportsTo self-referencing relationship.


<Employee EmployeeID="2" LastName="Fuller" FirstName="Andrew">

            <Employee EmployeeID="1" LastName="Davolio" FirstName="Nancy" />

            <Employee EmployeeID="3" LastName="Leverling" FirstName="Janet" />

            <Employee EmployeeID="4" LastName="Peacock" FirstName="Margaret" />

            <Employee EmployeeID="5" LastName="Buchanan" FirstName="Steven">

                        <Employee EmployeeID="6" LastName="Suyama" FirstName="Michael" />

                        <Employee EmployeeID="7" LastName="King" FirstName="Robert" />

                        <Employee EmployeeID="9" LastName="Dodsworth" FirstName="Anne" />

            </Employee>

            <Employee EmployeeID="8" LastName="Callahan" FirstName="Laura" />

</Employee>


On the face of it, this seems like a simple nut to crack.  However, there are various design options available to us.  This article explores some of the options.  Performance is deemed the overriding factor when evaluating the optimal approach.


It is assumed that it is beneficial to convert the data to XML in the database; i.e. in-process with SQL Server.  However, the overall evaluation of the different approaches would be the same even if the transformation to the desired format took place in the middle tier.


All the code in this post can be downloaded from here.  This is a link to a zip file containing a solution and a SQL Server project.  To use it, it is necessary to change the database connection string in the project properties.  The TSQL scripts that generate the XML output can be found in the Option1.sql and Option2.sql solution files.


Note: A Northwind database restored to SQL Server 2005 is also required to run the code.  It is necessary to build and deploy the above project.  A backup of Northwind from SQL Server 2005 can be downloaded from here.



What were the options we had in SQL Server 2000 for exporting relational data to XML?  We had the FOR XML clause of the SELECT statement.  Incidentally, SQL Server 2005 has a new mode called FOR XML PATH.  This mode is powerful because it allows us to use XPath-type syntax to define our XML structure.  FOR XML PATH will meet about 95% of the use cases for which we traditionally had to resort to FOR XML EXPLICIT.  FOR XML PATH is a lot less cumbersome and easy to use.  However, both these options require us to know the depth of the lowest level node when writing the query.  For recursive hierarchies, we do not actually know the lowest depth beforehand because it is variable.  This means it is not feasible to use FOR XML and directly arrive at our desired XML output.  The best we can do with FOR XML is to run the output through an XSLT transform.



Option 1: FOR XML AUTO and XSLT

The SQL Server 2005 functionality for XML provided through XQuery and XML indexing can be considerably enhanced by using SQL CLR to apply XSLT transformations.  Here we are using a SQL CLR function to convert the output of a FOR XML AUTO query against our Northwind dataset in order to arrive at the desired output.


DECLARE @input xml, @stylesheet xml

SELECT            @input =

                        (SELECT EmployeeID, ReportsTo, LastName, FirstName

                        FROM Employees AS Employee

                        FOR XML AUTO, ROOT('Employees')),

                        @stylesheet =

                        CAST(

'<?xml version="1.0" encoding="UTF-8"?>

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

            <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>

            <xsl:template match="/Employees" >

                        <xsl:apply-templates select="Employee[not(@ReportsTo)]" />

            </xsl:template>

            <xsl:template match="Employee[not(@ReportsTo)]">

                        <xsl:element name="Employee">

                                    <xsl:attribute name="EmployeeID">

                                                <xsl:value-of select="@EmployeeID" />

                                    </xsl:attribute>

                                    <xsl:attribute name="LastName">

                                                <xsl:value-of select="@LastName" />

                                    </xsl:attribute>

                                    <xsl:attribute name="FirstName">

                                                <xsl:value-of select="@FirstName" />

                                    </xsl:attribute>

                                    <xsl:call-template  name="FindChildNodes">

                                                <xsl:with-param name="ReportsTo" select="@EmployeeID" />

                                    </xsl:call-template>

                        </xsl:element>

            </xsl:template>

            <xsl:template name="FindChildNodes">

                        <xsl:param name="ReportsTo" />

                        <xsl:for-each select="//Employees/Employee[@ReportsTo=$ReportsTo]">

                                    <xsl:variable name="EmployeeID" select="@EmployeeID" />

                                    <xsl:element name="Employee">

                                                <xsl:attribute name="EmployeeID">

                                                            <xsl:value-of select="@EmployeeID" />

                                                </xsl:attribute>

                                                <xsl:attribute name="LastName">

                                                            <xsl:value-of select="@LastName" />

                                                </xsl:attribute>

                                                <xsl:attribute name="FirstName">

                                                            <xsl:value-of select="@FirstName" />

                                                </xsl:attribute>

                                                <!-- only call template recursively if EmployeeID is a ReportsTo of another node -->

                                                <xsl:if test="count(//Employees/Employee[@ReportsTo = $EmployeeID]) > 0">

                                                            <xsl:call-template  name="FindChildNodes">

                                                                        <xsl:with-param name="ReportsTo" select="@EmployeeID" />

                                                            </xsl:call-template>

                                                </xsl:if>

                                    </xsl:element>

                        </xsl:for-each>

            </xsl:template>

</xsl:stylesheet>

' AS xml)

SELECT dbo.ApplyTransform(@input, @stylesheet)


Further discussion regarding the implementation of the XSLT stylesheet is outside the scope of this article.


Here is an implementation of the ApplyTransform function that receives the stylesheet.  It basically just exposes the System.Xml.Xsl.CompiledTransform class’ Transform method to TSQL.


using System;

using System.Data;

using System.Data.SqlClient;

using System.Data.SqlTypes;

using System.Xml;

using System.Text;

using System.IO;

using Microsoft.SqlServer.Server;

 [SqlFunction(DataAccess = DataAccessKind.None, SystemDataAccess = SystemDataAccessKind.None, IsDeterministic = false, IsPrecise = false)]

[return: SqlFacet(IsFixedLength = false, MaxSize = -1)]

public static SqlXml ApplyTransform(SqlXml data, SqlXml styleSheet)

{

    // Create a stream for the xml output

    MemoryStream ms = new MemoryStream();

    XmlWriter xw = XmlWriter.Create(ms);

    // Load the XML and transform it

    XslCompiledTransform ctx = new XslCompiledTransform(false);

    ctx.Load(styleSheet.CreateReader());

    ctx.Transform(data.CreateReader(), xw);

    // return the output

    return new SqlXml(ms);

}


Assuming some prerequisite knowledge of XSL stylesheets, this appears to be a good option to achieve our goal.  However, let us evaluate others before deciding the optimum choice.



We could of course simply access the relational data in a SQL CLR stored procedure and write it to XML.  However, if we don’t access the data in the same order it will appear in the XML, we will have to load the whole structure into the DOM.  Loading the whole structure into the DOM means loading it into memory; if we are talking about a million rows, we may well not have enough memory available!  The ‘right order’ is as follows.  This allows us to write it to XML sequentially, which is vastly more performant and efficient.

  • Node1
    • Node2
      • Node3
      • Node4
    • Node5

How can we return the data in this order from a single TSQL query?  Probably the best option is to use a recursive Common Table Expression (CTE).  CTEs are a new feature in SQL Server 2005.  They can be conceptualized similarly to views and derived tables in that they encapsulate an inner query.  However, they differentiate themselves from views and derived tables in the way they deal with recursion.  The syntax for a recursive CTE is as follows.


WITH RecursiveCTE(<column_list>)

AS

(

            -- Anchor Member:

            SELECT ...

            FROM <some_table(s)>

            ...

           

            UNION ALL

           

            -- Recursive Member

            SELECT ...

            FROM <some_table(s)>

            JOIN RecursiveCTE

            ...

)

-- Outer Query

SELECT ...

FROM RecursiveCTE

...


The WITH clause is the definition of the CTE and it precedes the outer query, which refers back to the CTE.  Within the WITH clause, the anchor member is a SELECT statement that acts as the seed for recursion.  It is merged using the UNION ALL operator to the recursive member, which is a SELECT statement that refers back to the CTE; hence it is recursive.


Note: it was possible to return the data in the ‘right order’ from a single TSQL query in SQL Server 2000 using a recursive table-valued user-defined function.  I tried this approach and the performance is far worse than the recursive CTE.  For large datasets, the execution time was multiplied by a factor of at least 5!  For example code of how this would be done using a recursive table-valued user-defined function, see this old blog post of mine.

Note: this whitepaper describes a way of avoiding using a table-valued UDF.  It does however use a scalar UDF, so it is still limited to 32 nested recursions.  If you know your hierarchies will never exceed 32 nested recursions, the scalar UDF method is probably a good way to go.  However, given the nature of recursive hierarchies, this may be an assumption you cannot make.

Note: recursive CTEs are by default limited to 100 nested recursions.  However, this only acts as a safeguard against infinite recursion.  This limit can be changed using the MAXRECURSION query hint to anything up to 32,767 nested recursions.  Recursive table-valued user-defined functions, on the other hand, are limited to 32 nested recursions.


Note: FOR XML cannot be used by the anchor or recursive member SELECT statements in a recursive CTE.  This is why we need a CLR routine to present the data as XML.



Option 2: Recursive CTE and sequential writing of XML


The usp_OrderedEmployeeHierarchy stored procedure below returns the data in the order we require.


CREATE PROCEDURE usp_OrderedEmployeeHierarchy (@Seed int)

AS

            ;WITH EmployeeCTE(EmployeeID, ReportsTo, LastName, FirstName, Depth, SortCol)

            AS

            (

              SELECT EmployeeID, ReportsTo, LastName, FirstName, 0, CAST(EmployeeID AS varbinary(max))

              FROM Employees

              WHERE EmployeeID = @Seed

              UNION ALL

              SELECT E.EmployeeID, E.ReportsTo, E.LastName, E.FirstName, M.Depth+1,

                        CAST(SortCol + CAST(E.EmployeeID AS binary(4)) AS varbinary(max))

              FROM Employees AS E

                        JOIN EmployeeCTE AS M

                          ON E.ReportsTo = M.EmployeeID

            )

            SELECT

              EmployeeID, ReportsTo, LastName, FirstName, Depth

            --,SortCol

            FROM EmployeeCTE

            ORDER BY SortCol

GO


When passing a @Seed parameter of 2, the following dataset is returned.


Ordered Northwind Employees

The way the CTE handles ordering the data is through the SortCol column.  For the seed of recursion, this column is the EmployeeID cast to varbinary(max).  For each subsequent recursive member, it is the existing SortCol value concatenated with the binary representation of the EmployeeID.  If we include it in the output, the following dataset is returned.


Ordered Northwind Employees with SortCol

As shown, the SortCol column provides a useful structure on which we can sort the dataset.  The data is thereby retrieved in the required order.


The Depth column is initialized at zero for the seed of the recursion and then incremented by one for each recursion.


This is the SQL CLR stored procedure that writes the XML sequentially.


using System;

using System.Data;

using System.Data.SqlClient;

using System.Data.SqlTypes;

using System.Xml;

using System.Text;

using System.IO;

using Microsoft.SqlServer.Server;

[Microsoft.SqlServer.Server.SqlProcedure(Name = "usp_GetEmployeeHierarchyXml")]

public static void GetEmployeeHierarchyXml(SqlInt32 seed)

{

    // Create a stream for the xml output

    MemoryStream ms = new MemoryStream();

    XmlWriter xw = XmlWriter.Create(ms);

    // Depth counters to identify when to close elements

    int previousDepth = -1;

    int newDepth = 0;

    using (SqlConnection conn = new SqlConnection("context connection=true"))

    {

        conn.Open();

        using (SqlCommand cmd = conn.CreateCommand())

        {

            cmd.CommandType = CommandType.StoredProcedure;

            cmd.CommandText = "usp_OrderedEmployeeHierarchy";

            cmd.Parameters.Add(new SqlParameter("@Seed", seed));

            SqlDataReader reader = cmd.ExecuteReader();

            while (reader.Read())

            {

                // Get the depth of the current node

                newDepth = Convert.ToInt32(reader["Depth"]);

                // Do we need close an open element(s)?

                CloseOpenElements(xw, previousDepth, newDepth);

                xw.WriteStartElement("Employee");

                xw.WriteAttributeString("EmployeeID", Convert.ToString(reader["EmployeeID"]));

                xw.WriteAttributeString("LastName", Convert.ToString(reader["LastName"]));

                xw.WriteAttributeString("FirstName", Convert.ToString(reader["FirstName"]));

                previousDepth = newDepth;

            }

            // Now we are back at depth 0; do we need close an open element(s)?

            newDepth = 0;

            CloseOpenElements(xw, previousDepth, newDepth);

            xw.Close();

            // Return the results to the client

            SqlDataRecord record;

            record = new SqlDataRecord(new SqlMetaData[] { new SqlMetaData("EmployeeHierarchy", SqlDbType.Xml) });

            // Set the record field.

            record.SetSqlXml(0, new SqlXml(ms));

            // Return the record to the client.

            SqlContext.Pipe.Send(record);

        }

    }

}

private static void CloseOpenElements(XmlWriter xw, int previousDepth, int newDepth)

{

    if (newDepth <= previousDepth)

        for (int i = 0; i <= previousDepth - newDepth; i++)

            xw.WriteEndElement();

}


The GetEmployeeHierarchyXml stored procedure executes the usp_OrderedEmployeeHierarchy stored procedure for data access.  A System.Xml.XmlWriter object is used to sequentially write the XML.  The Depth column is used to keep track of when to close open XML elements.  A Microsoft.SqlServer.Server.SqlDataRecord object is used to return a one-row, one-column, XML-datatype result set to the client.  This is done using the static Send method, which is a member of the Microsoft.SqlServer.Server.SqlContext class’ Pipe property.  The SqlContext class is a class dedicated solely to in-process data access.  It contains other interesting properties and methods, which are outside the scope of this article.


Executing usp_GetEmployeeHierarchyXml returns the XML in the desired format.



Some readers may be asking the question as to why we do not simply write a SQL CLR stored procedure that retrieves a set of child nodes at a time for every single node in the hierarchy.  This does mean a lot more TSQL queries being executed, but is conceptually the simplest approach.  However, this is not feasible because multiple active resultsets (MARS) is not supported in SQL CLR.  We would therefore have to instantiate many connection objects.  This also presents a problem because we can only have one open ‘context’ connection at a time.  I did attempt this approach processing the resultset in a client-side application and the performance was far worse than Option 1 or 2.  This approach has therefore been ruled out.


So, here is the question we’ve all been waiting for: which is the most efficient choice?  I tried both approaches on a product hierarchy, which happened to originate from Analysis Services, containing 8,700 members.  Here are the findings.  Note that Option 2 assumes a cached execution plan for the usp_OrderedEmployeeHierarchy stored procedure.


Option 1: 38 seconds

Option 2: 15 seconds


This is a performance gain of over 60% !!


We can therefore conclude that CTEs provide a well architected method for accessing recursive relational data.  A prime example of this is when exporting such data to XML.


Published Wednesday, September 20, 2006 3:21 PM by christian.wade

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Dave Parkinson said:

Your CTE doesn't work properly on my installation of SQL Server.  This is because the + operator adds rather than concatenating.  It's probably due to the server properties being different on my SQL Server.  Maybe it's a collation difference.  Why not just use lastname + firstname instead.  This always works and most people would want the data ordered by name anyway.  If you want to use the ID then the sort expression needs to be tweaked to work on all SQL Server installations.  Converting the ID to a hex string is easy enough but that's character data again.  I understand that character data takes up more space but there's probably a memory vs. time trade off anyway because lastname + firstname takes no time at all.

February 22, 2011 12:23 PM
 

Brett Anthony said:

Christian, your method of ordering saved my butt!

I was trying to work this out using partition by etc but I never even thought of the concatination of binarys, works great thanks heaps.

June 9, 2011 6:18 AM

Leave a Comment

(required) 
(optional)
(required) 
Submit
Powered by Community Server (Personal Edition), by Telligent Systems