Hierarchical XML - importing into SQL Server
Loading XML into a SQL Server 2000 database is now rather easy thanks to the
inbuilt XML features (such as OPENXML) and the freely downloadable SQLXML add-on.
It becomes a little more complicated when loading in a hierarchy unless you have
some way to uniquely identify each element, or you're comfortable getting deep into
the Updategram structure supported by SQLXML (be ready to add lots of "at-identity"
attribute to your XML).
At first glance it would seem impossible to import an infinite-depth hierarchy,
as you would need your SQL statements to be recursive.
I pondered over this very problem as I had a need to import information like
this, and now I am sharing my findings.
Firstly lets start with some simple XML:
<orgTree>
<Employee FirstName="Andrew" LastName="Fuller" Title="Vice President, Sales">
<Employee FirstName="Nancy" LastName="Davolio" Title="Sales Representative" />
<Employee FirstName="Janet" LastName="Leverling" Title="Sales Representative" />
<Employee FirstName="Margaret" LastName="Peacock" Title="Sales Representative" />
<Employee FirstName="Steven" LastName="Buchanan" Title="Sales Manager">
<Employee FirstName="Michael" LastName="Suyama" Title="Sales Representative" />
<Employee FirstName="Robert" LastName="King" Title="Sales Representative" />
<Employee FirstName="Anne" LastName="Dodsworth" Title="Sales Representative" />
</Employee>
<Employee FirstName="Laura" LastName="Callahan" Title="Inside Sales Coordinator" />
</Employee>
</orgTree>
If you're interested in how I got that XML from the Northwind database, please
see a related article "Hierarchical XML - exporting from SQL Server" (link coming
soon).
As we can see from the XML, it could theoretically continue in depth forever.
How do we process that!?
Take a look at the following SQL:
USE [tempdb]
if exists (select * from dbo.sysobjects where id = object_id(N'[Employees]') and OBJECTPROPERTY(id, N'IsUserTable') = 1)
drop table [Employees]
GO
CREATE TABLE [Employees] (
[EmployeeID] [int] IDENTITY (1, 1) NOT NULL ,
[ReportsTo] [int] NULL,
[LastName] [nvarchar] (20) NOT NULL ,
[FirstName] [nvarchar] (10) NOT NULL ,
[Title] [nvarchar] (30) NULL ,
CONSTRAINT [PK_Employees] PRIMARY KEY CLUSTERED
(
[EmployeeID]
),
CONSTRAINT [FK_Employees_Employees] FOREIGN KEY
(
[ReportsTo]
) REFERENCES [Employees] (
[EmployeeID]
)
)
GO
if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].[ImportEmployeeHierarchyXml]') and OBJECTPROPERTY(id, N'IsProcedure') = 1)
drop procedure [dbo].[ImportEmployeeHierarchyXml]
GO
CREATE PROC dbo.ImportEmployeeHierarchyXml
(
@Xml text,
@Debug bit = 0
)
AS
DECLARE
@xmlDoc int,
@intCacheID int,
@intDbID int,
@intLoopCount int,
@intMaxLoopCount int
DECLARE @tblCache TABLE
(
ID int NOT NULL,
ParentID int NULL,
DbID int NULL,
ParentDbID int NULL,
FirstName varchar(10),
LastName varchar(20),
Title varchar(30)
)
SET @intLoopCount = 0
SET @intMaxLoopCount = 10000
EXEC sp_xml_prepareDocument @xmlDoc OUTPUT, @Xml
INSERT INTO @tblCache
(
ID,
ParentID,
FirstName,
LastName,
Title
)
SELECT
ID,
CASE
WHEN [ParentID] IS NULL OR [ParentID] = 0 OR ParentNodeName <> 'Employee' THEN NULL
ELSE [ParentID]
END AS [NewParentID],
FirstName,
LastName,
Title
FROM
OPENXML( @xmlDoc, '//Employee', 8 )
WITH
(
[ID] int '@mp:id',
[ParentID] int '@mp:parentid',
[ParentNodeName] varchar(50) '@mp:parentlocalname',
[FirstName] varchar(10) '@FirstName',
[LastName] varchar(20) '@LastName',
[Title] varchar(30) '@Title'
)
IF @Debug = 1
BEGIN
SELECT * FROM @tblCache
END
SELECT TOP 1 @intCacheID = ID
FROM @tblCache
WHERE
DbID IS NULL
AND
(
ParentID IS NULL
OR
ParentDbID IS NOT NULL
)
WHILE @intCacheID IS NOT NULL AND @intLoopCount < @intMaxLoopCount
BEGIN
INSERT INTO Employees
(
ReportsTo,
FirstName,
LastName,
Title
)
SELECT TOP 1
ParentDbID,
FirstName,
LastName,
Title
FROM
@tblCache
WHERE
ID = @intCacheID
SET @intDbID = SCOPE_IDENTITY()
UPDATE @tblCache
SET DbID = @intDbID
WHERE ID = @intCacheID
UPDATE @tblCache
SET ParentDbID = @intDbID
WHERE ParentID = @intCacheID
SET @intCacheID = NULL
SELECT TOP 1 @intCacheID = ID
FROM @tblCache
WHERE
DbID IS NULL
AND
(
ParentID IS NULL
OR
ParentDbID IS NOT NULL
)
SET @intLoopCount = @intLoopCount + 1
END
IF @Debug = 1
BEGIN
SELECT * FROM @tblCache
SELECT * FROM Employees
END
EXEC sp_xml_removeDocument @xmlDoc
The above code does a few things:
- Because this is a sample, it switches to the "tempdb" database (this database
is emptied each time SQL Server starts up).
- Creates a new table called "Employees", with a sub-set of the fields in
the Northwind.dbo.Employees.
- Creates a new stored procedure called "ImportEmployeeHierarchyXml" which
does the following:
- Declares an in-memory table to cache the XML, note the comment starting
with "-- **" as this is where you'll need to change it to customise it for
your own setup.
- Sets some initial loop checking variables, this is just a safety check
to prevent a dead loop, currently it will loop a maximum of 10,000 times.
- Get a handle to an XML DOM object for the XML text passed in.
- Load the XML into the cache table variable using OPENXML.
- Grabs the ID of the first cache item to add.
- Executes a while loop to insert each item, update it's cache entry with
database ID's, and grab the ID for the next item.
Most of the code is pretty straight-forward and easy to follow, but some things
might be worth explaining further:
- The OPENXML statement provides some extra meta-property attributes which
I have used, one of particular note is "@mp:parentid" which will return either
NULL (if the current node is the root), 0 (if the parent node is the root node),
or an integer that allows us to uniquely identify the parent. The case statement
allows us to set base level node(s) in our tree to have a parent id of NULL.
- The stored procedure takes an optional @Debug parameter, which if set to
1 will output the before/after view of the Cache table, and the after view of
the Employees table to help you debug.
- Anywhere you see a comment that starts with two *'s (-- ** comment)
it's something you'll need to update if you use this for your own purposes.
- This methodology assumes that your table uses an IDENTITY column to uniquely
identify each row, if you use something different you'll need to make a few
changes.
The next step is to try it out, here's a sample script you can run:
USE [tempdb]
TRUNCATE TABLE Employees
EXEC ImportEmployeeHierarchyXml '<orgTree>
<Employee FirstName="Andrew" LastName="Fuller" Title="Vice President, Sales">
<Employee FirstName="Nancy" LastName="Davolio" Title="Sales Representative" />
<Employee FirstName="Janet" LastName="Leverling" Title="Sales Representative" />
<Employee FirstName="Margaret" LastName="Peacock" Title="Sales Representative" />
<Employee FirstName="Steven" LastName="Buchanan" Title="Sales Manager">
<Employee FirstName="Michael" LastName="Suyama" Title="Sales Representative" />
<Employee FirstName="Robert" LastName="King" Title="Sales Representative" />
<Employee FirstName="Anne" LastName="Dodsworth" Title="Sales Representative" />
</Employee>
<Employee FirstName="Laura" LastName="Callahan" Title="Inside Sales Coordinator" />
</Employee>
</orgTree>'
SELECT * FROM Employees
Once again we're using the tempdb. First we make sure the Employees table is
empty, then we execute our stored procedure, then select all the rows to see if
it was all added as we expected. At this point your results should be rather satisfying!
So, here I've demonstrated how it can be done, all that's left is for your to
play with it and see if it suits your needs.
I'm keen to hear any feedback you have, good or bad. I'll update this article
as required if you spot anything that needs changing (I spotted a few errors myself
already, but I'll wait till someone else points them out first! :-) How many can
you find?).
A challenge: If you're keen to extend this, try designing something to handle
an XML document with different types of nodes, e.g. an XML representation of a file
system that has <Folder> and <File> nodes, each with their own attributes.