Archives

Archives / 2011 / September

Theory and Practice of Database and Data Analysis (3) – Exhaustive Search by Traversing Table Relationships

Thursday, September 29, 2011
General Software Development SQL Server
No Comments

In this post, I will discuss about building a script to find all the relevant data starting from a table name and primary key value. In the previous post, I have discussed how to a list of tables that references a table. From a parent table, we just need to get a list of child tables. We then loop through each child table to select records using the parent primary key value. If the child tables themselves have child tables, we repeat the same approach by recursively getting the grand children.

I often need to remote-desktop into clients system where I only have read privilege; I can neither create stored procedures nor execute them. So I am going to present a script that requires only read privilege.

In transact-SQL (T-SQL) batch, looping through records can be implemented using cursor or temp table. I used the later approach as temp table can grow so that I can also use it for recursion as well.

Recursion is a harder problem here. Recursion can be done in most of programming languages by calling procedures recursively. However, that is not possible here since I am restricted to a batch. In computer science, recursion can be implemented with a loop with a stack. In many programming languages, when a caller calls a callee, the program would save the current local data as well as the returning location of the caller into the stack. The area in the stack used by each call is called a stack frame. Once the callee returns, the program will restore the local data from the stack and resume from the previous location.

In my script, I use temp table #ref_contraints to accomplish both looping and recursion. referential constraints from the top table are added to the table as they are discovered and removed from the table when they are consumed (that is, no longer needed). I used the depth-first traversal in my script. With this structure, I can change to breath-first traversal with minimum efforts.

The #keyvalues temp table contains the primary key values for the table that I have already traversed. I can get the records from the child tables by joining to the key values in this table so that I only need to query each child table once for each foreign key relationship.

In order to make the script simpler, I eliminate the schema and assume all tables are under the schema “dbo”. This works with our database, and works with databases in Microsoft Dynamics CRM 2011.

I have also assumed that all primary keys contain only one integer column. This is true with out database. This is also true with Microsoft Dynamics CRM 2011 except you need to change integer id to guid.

I added comments the script so that one would know where the code would correspond to looping and procedure calling if the code is written in a language like VB.NET or C#. The comments also indicate the insertion points if additional code is needed.
--input values declare @ptable varchar(256) declare @pid int set @ptable = 'Invoices' set @pid = 55813 --variables declare @sql varchar(2000) declare @pcolumn varchar(256) --primary key column of the parent table declare @ftable varchar(256) --child table name declare @fcolumn varchar(256) --column of the child table that link to primary key of the parent table declare @fpcolumn varchar(256) --primary key column of the child table declare @frame int declare @rid int set @frame = 0 --Temp table used to hold key values in related tables --drop table #keyvalues --select * from #keyvalues order by id asc create table #keyvalues ( id int not null identity(1,1), tablename varchar(256), keyvalue int ) --Temp table used to hold referential contraints so that we can navigate down --drop table #ref_contraints --select * from #ref_contraints order by rid create table #ref_contraints ( rid int not null identity (1,1), frame int, ptable varchar(256), pcolumn varchar(256), ftable varchar(256), fcolumn varchar(256) ) --get the primary column name select @pcolumn = kcu.COLUMN_NAME from INFORMATION_SCHEMA.TABLE_CONSTRAINTS tc inner join INFORMATION_SCHEMA.KEY_COLUMN_USAGE kcu on tc.CONSTRAINT_NAME = kcu.CONSTRAINT_NAME where tc.CONSTRAINT_TYPE = 'PRIMARY KEY' and tc.TABLE_NAME = @ptable --select the parent record set @sql = 'select t1.* from ' + @ptable + ' t1 where ' + @pcolumn + '=' + cast(@pid as varchar(10)) --print @sql exec (@sql) if (@@ROWCOUNT > 0) begin --Save the key value of the parent record set @sql = 'insert into #keyvalues select ''' + @ptable + ''', t1.' + @pcolumn + ' from ' + @ptable + ' t1 where ' + @pcolumn + '=' + cast(@pid as varchar(10)) exec (@sql) GOTO GET_CHILDREN_BEGIN --call the get_chilren function GET_CHILDREN_RETURN2: --return from the last get_children function drop table #keyvalues drop table #ref_contraints GOTO BATCH_END end GET_CHILDREN_BEGIN: --parameters are @ptable and set @frame = @frame + 1 insert into #ref_contraints(frame, ptable, pcolumn, ftable, fcolumn) select @frame, kcup.TABLE_NAME, kcup.COLUMN_NAME, kcuf.TABLE_NAME, kcuf.COLUMN_NAME from INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS rc inner join INFORMATION_SCHEMA.KEY_COLUMN_USAGE kcup on kcup.CONSTRAINT_SCHEMA = rc.UNIQUE_CONSTRAINT_SCHEMA and kcup.CONSTRAINT_NAME = rc.UNIQUE_CONSTRAINT_NAME inner join INFORMATION_SCHEMA.KEY_COLUMN_USAGE kcuf on kcuf.CONSTRAINT_SCHEMA = rc.CONSTRAINT_SCHEMA and kcuf.CONSTRAINT_NAME = rc.CONSTRAINT_NAME and kcup.ORDINAL_POSITION = kcuf.ORDINAL_POSITION left outer join #ref_contraints trc --Exclude existing records on trc.frame = @frame and trc.ptable = kcup.TABLE_NAME and trc.pcolumn = kcup.COLUMN_NAME and trc.ftable = kcuf.TABLE_NAME and trc.fcolumn = kcup.COLUMN_NAME where kcup.TABLE_NAME = @ptable and not (kcuf.TABLE_NAME = @ptable) --Exclude foreign key relationship within the same table and trc.frame is null FOR_EACH_CHILD_LOOP_BEGIN: select @rid=rid, @ptable = ptable, @pcolumn = pcolumn, @ftable = ftable, @fcolumn = fcolumn from #ref_contraints where frame = @frame if (@@ROWCOUNT > 0) begin delete #ref_contraints where rid = @rid --delete the one that is used set @sql = 'select ''' + CAST(@frame as varchar(10)) + ''',''' + @ptable + ''', ''' + @ftable + ''', t1.* from ' + @ftable + ' t1 inner join #keyvalues t2 on t1.' + @fcolumn + '= t2.keyvalue where t2.tablename=''' + @ptable + '''' --print @sql exec (@sql) if (@@ROWCOUNT > 0) begin --get the primary column name select @fpcolumn = kcu.COLUMN_NAME from INFORMATION_SCHEMA.TABLE_CONSTRAINTS tc inner join INFORMATION_SCHEMA.KEY_COLUMN_USAGE kcu on tc.CONSTRAINT_NAME = kcu.CONSTRAINT_NAME where tc.CONSTRAINT_TYPE = 'PRIMARY KEY' and tc.TABLE_NAME = @ftable --outer join t3 to avoid duplicated records set @sql = 'insert into #keyvalues select distinct ''' + @ftable + ''', t1.' + @fpcolumn + ' from ' + @ftable + ' t1 inner join #keyvalues t2 on t1.' + @fcolumn + '= t2.keyvalue left outer join #keyvalues t3 on t3.tablename=''' + @ftable + ''' and t3.keyvalue=t1.' + @fpcolumn + ' where t3.id is null and t2.tablename=''' + @ptable + '''' exec (@sql) set @ptable = @ftable --The child table is the new parent table now GOTO GET_CHILDREN_BEGIN --Call get_children GET_CHILDREN_RETURN1: end end else begin GOTO FOR_EACH_CHILD_LOOP_EXIT end FOR_EACH_CHILD_LOOP_END: --Clean up before returning to the loop --On finishing cleaning up, return to the beginning of the loop GOTO FOR_EACH_CHILD_LOOP_BEGIN FOR_EACH_CHILD_LOOP_EXIT: --Execute code after the FOR_EACH_CHILD_LOOP here --On finishing everything, exit the routine GOTO GET_CHILDREN_END GET_CHILDREN_END: --clean up the frame here set @frame = @frame - 1 if @frame > 0 GOTO GET_CHILDREN_RETURN1 else GOTO GET_CHILDREN_RETURN2 BATCH_END: --end of the batch
In future posts of this series, I will discuss how to capture human intelligence to make search even more powerful.
The case for living in the cloud

Monday, September 26, 2011
General Software Development
No Comments

These days one cannot have a day without hearing the cloud; vendors are pushing it. Google’s Chrome Book is already nothing but the cloud. Apple’s iOS and Microsoft’s upcoming Windows 8 also have increase cloud features. Until now, I have been skeptical about the cloud. Apart from the security, my primary concern is what if I lose the connection to the internet.

Recently, I have been migrating to the cloud. The main reason is that I triple boot my laptop now. I once run Windows 7 as my primary OS and anything else as virtual machines. However, the performance of these virtual machines has been less than ideal. The Windows 7 Virtual PC would not run 64bit OS, so I have to rely on Virtual Box to run Windows 2008 R2. Recently, I have been booting Windows 8 Developers Preview and Sharepoint/Windows 2008 from VHD. This is the my currently most satisfying configuration. Although the VHD is slightly slower than the real hard drive, I have the full access to the rest of the hardware. The issue I am facing now is that I need to access my data no matter which OS I boot with. That motivate me to migrate to the cloud.

My primary data are my email, document and code.

Email is least of my concern. Both Google and Hotmail have been offering ample space for me to store my email. I do download a copy email into Windows 7 so I can access them when I do not have access to the internet.

Documents also have been easier. Google has been offering Google Docs as well as web based document editing for a while. Recently, Microsoft has been offering 25GB of free space on Windows Live Skydrive. From Skydrive, I can edit documents using Office Web Apps. So accessing and editing document from a boot OS that does not have Office installed is no longer a concern. From Office 2010, it is possible to save documents directly to Skydrive. In addition, Windows Live Mesh can automatically sync the local file system with Skydrive.

Lastly, there are many free online source code versioning systems. Microsoft Codeplex, Google Code and Git, just to name a few, all offer free source code versions systems for open source projects. Other vendors such BitBucket also offer source control for close source projects. Open source version control software such as TortoiseHg or TortoiseSVN are easy to get. The free Visual Studio Express is also becoming more useful for real world projects; the recent Windows 8 Developer Preview has Visual Studio Express 11 preinstalled.

So my conclusion is that the currently available services and software are sufficient to my needs and I am ready to live in the cloud.
Theory and Practice of Database and Data Analysis (2) – Navigating the table relationships

Friday, September 23, 2011
SQL Server
No Comments

In the first part of the series, I discussed how to use the Sql Server meta data to find database objects. In this part, I will discuss how to find all the data related to a transaction by navigating the table relationships. Let us suppose that the transaction is a purchase order stored in a table called Orders. Any one that has seen the Northwind database knows that it has a child entity called OrderDetails. In even a modest real world database, the complexity can grow very fast. If we can ship a partial order and put the rest in back order, or we need to ship from multiple warehouses, we could have multiple Shipping records for each Order record and have each ShippingDetails record linked to an OrderDetails record. The customers can return the order, either in full or in part, with and without return authorization number, and the actually return may or may not match the return authorization. The system needs to be flexible enough to capture all the events related to the order. As you can see, a very modest order transaction could easily grow to a dozen of tables. If you are not familiar with the database, how can you find all the related table? If you need to delete a record, how do you delete it clean? If you need to copy a transaction from a production system to the development system, how do you copy the entire set of the data related to the transaction?

Fortunately, in any well-designed database, table relationships can be navigated through the foreign key relationships. In Sql server, we can usually find the information using the sp_help stored procedure. Here we will use INFORMATION_SCHEMA to obtain more refined results to be used in our tools. The TABLE_CONSTRAINTS view contains the primary and foreign keys on a table. The REFERENTIAL_CONSTRAINTS view contains the name of the foreign key on the many side and the name of the unique constraint on the one side. It is possible to navigate the table relationships using these two views. The following query will find all the tables that references the Sales.SalesOrderHeader table in the AdventureWorks database:

select tcf.TABLE_SCHEMA, tcf.TABLE_NAME from INFORMATION_SCHEMA.TABLE_CONSTRAINTS tcp inner join INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS rc on tcp.CONSTRAINT_SCHEMA = rc.UNIQUE_CONSTRAINT_SCHEMA and tcp.CONSTRAINT_NAME = rc.UNIQUE_CONSTRAINT_NAME and tcp.CONSTRAINT_TYPE = 'PRIMARY KEY' inner join INFORMATION_SCHEMA.TABLE_CONSTRAINTS tcf on rc.CONSTRAINT_SCHEMA = tcf.CONSTRAINT_SCHEMA and rc.CONSTRAINT_NAME = tcf.CONSTRAINT_NAME where tcp.TABLE_SCHEMA = 'Sales' and tcp.TABLE_NAME = 'SalesOrderHeader'

Here are the results:

The following query will return all the tables referenced by the Sales.SalesOrderHeader table:

select tcp.TABLE_SCHEMA, tcp.TABLE_NAME from INFORMATION_SCHEMA.TABLE_CONSTRAINTS tcp inner join INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS rc on tcp.CONSTRAINT_SCHEMA = rc.UNIQUE_CONSTRAINT_SCHEMA and tcp.CONSTRAINT_NAME = rc.UNIQUE_CONSTRAINT_NAME and tcp.CONSTRAINT_TYPE = 'PRIMARY KEY' inner join INFORMATION_SCHEMA.TABLE_CONSTRAINTS tcf on rc.CONSTRAINT_SCHEMA = tcf.CONSTRAINT_SCHEMA and rc.CONSTRAINT_NAME = tcf.CONSTRAINT_NAME where tcf.TABLE_SCHEMA = 'Sales' and tcf.TABLE_NAME = 'SalesOrderHeader'

And the results:

The Person.Address table appeared twice because we have two columns referencing the table.

The CONSTRAINT_COLUMN_USAGE view contains the columns in the primary and the foreign keys. By using these 3 views, we can construct a simple tool that drills down into the data. In the next part of this series, we will construct such a tool.

Added by Li Chen 9/26/2011

The KEY_COLUMN_USAGE view also contains the columns in the primary and the foreign keys. I prefer to use this view because it also contains the ORDINAL_POSITION that I can use to match the columns in the primary and the foreign keys when we have composite keys. The following example shows how to match the columns. The last line in the query eliminates the records from a table joining to itself.

declare @tablename varchar(256) declare @schema varchar(256) set @schema = 'Sales' set @tablename = 'SalesOrderHeader' select kcup.TABLE_SCHEMA, kcup.TABLE_NAME, kcuf.TABLE_SCHEMA, kcuf.TABLE_NAME, kcup.ORDINAL_POSITION, kcup.COLUMN_NAME, kcuf.COLUMN_NAME from INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS rc inner join INFORMATION_SCHEMA.KEY_COLUMN_USAGE kcup on kcup.CONSTRAINT_SCHEMA = rc.UNIQUE_CONSTRAINT_SCHEMA and kcup.CONSTRAINT_NAME = rc.UNIQUE_CONSTRAINT_NAME inner join INFORMATION_SCHEMA.KEY_COLUMN_USAGE kcuf on kcuf.CONSTRAINT_SCHEMA = rc.CONSTRAINT_SCHEMA and kcuf.CONSTRAINT_NAME = rc.CONSTRAINT_NAME and kcup.ORDINAL_POSITION = kcuf.ORDINAL_POSITION where kcup.TABLE_SCHEMA = @schema and kcup.TABLE_NAME = @tablename and not (kcuf.TABLE_SCHEMA = @schema and kcuf.TABLE_NAME = @tablename)
Theory and Practice of Database and Data Analysis (1) – Searching for objects

Wednesday, September 21, 2011
SQL Server
No Comments

Recently, I had to spend a significant portion of my time on production data support. Since we have a full-featured agency management system, I have to deal with parts of the application that I am not familiar with. Through the accumulated experiences and improved procedure, I was able to locate the problems with the increasing speed. In this series, I will try to document my experiences and approaches. In the first part, I will discuss how to search for objects.

Supposing we were told that there were problems in address data, the very first thing we need to do is to locate the table and the field that contains the data. Microsoft SQL Server has a set of sys* tables. It also supports ANSI style INFORMATION_SCHEMA. In order to reuse the knowledge discussed here to other database systems, I will try to use the INFORMATION_SCHEMA as much as possible. So to find any table that is related to “Address”, we can use the following query:

select * from Information_Schema.Tables where Table_Type = 'BASE TABLE' and Table_name like '%Address%'

Using AdventureWorks 2008 sample database, it yields the following results:

If we suspect there are columns in other tables relating to address, we can search for the columns with the following query:

select c.* from Information_Schema.Columns c inner join Information_Schema.Tables t on c.Table_Schema = t.Table_Schema and c.Table_Name = t.Table_Name where Table_Type = 'BASE TABLE' and Column_name like '%Address%'

We will get the following results this time:

Supposing we need to find all the stored procedures and functions that reference a table, we can use sp_depends stored procedure. If we need to search thoroughly using a string, we can use the following query:

select Routine_Schema, Routine_Name, Routine_Type from Information_Schema.Routines where Routine_Definition like '%Accounting%'

However, we cannot find triggers through Information_Schema. We have to use sysobjects:

select name, object_name(parent_obj) as table_name from sysobjects where type = 'TR'

Once we find the triggers, we can find the text using sp_helptext or query the syscomments table.

In the next part of the series, I will discuss navigating the table relationships.