Archives / 2011 / September
  • Theory and Practice of Database and Data Analysis (3) – Exhaustive Search by Traversing Table Relationships

    In this post, I will discuss about building a script to find all the relevant data starting from a table name and primary key value. In the previous post, I have discussed how to a list of tables that references a table. From a parent table, we just need to get a list of child tables. We then loop through each child table to select records using the parent primary key value. If the child tables themselves have child tables, we repeat the same approach by recursively getting the grand children.

    I often need to remote-desktop into clients system where I only have read privilege; I can neither create stored procedures nor execute them. So I am going to present a script that requires only read privilege.

    In transact-SQL (T-SQL) batch, looping through records can be implemented using cursor or temp table. I used the later approach as temp table can grow so that I can also use it for recursion as well.

    Recursion is a harder problem here. Recursion can be done in most of programming languages by calling procedures recursively. However, that is not possible here since I am restricted to a batch. In computer science, recursion can be implemented with a loop with a stack. In many programming languages, when a caller calls a callee, the program would save the current local data as well as the returning location of the caller into the stack. The area in the stack used by each call is called a stack frame. Once the callee returns, the program will restore the local data from the stack and resume from the previous location.

    In my script, I use temp table #ref_contraints to accomplish both looping and recursion. referential constraints from the top table are added to the table as they are discovered and removed from the table when they are consumed (that is, no longer needed). I used the depth-first traversal in my script. With this structure, I can change to breath-first traversal with minimum efforts.

    The #keyvalues temp table contains the primary key values for the table that I have already traversed. I can get the records from the child tables by joining to the key values in this table so that I only need to query each child table once for each foreign key relationship.

    In order to make the script simpler, I eliminate the schema and assume all tables are under the schema “dbo”. This works with our database, and works with databases in Microsoft Dynamics CRM 2011.

    I have also assumed that all primary keys contain only one integer column. This is true with out database. This is also true with Microsoft Dynamics CRM 2011 except you need to change integer id to guid.

    I added comments the script so that one would know where the code would correspond to looping and procedure calling if the code is written in a language like VB.NET or C#. The comments also indicate the insertion points if additional code is needed.

    In future posts of this series, I will discuss how to capture human intelligence to make search even more powerful.

  • The case for living in the cloud

    These days one cannot have a day without hearing the cloud; vendors are pushing it. Google’s Chrome Book is already nothing but the cloud. Apple’s iOS and Microsoft’s upcoming Windows 8 also have increase cloud features. Until now, I have been skeptical about the cloud. Apart from the security, my primary concern is what if I lose the connection to the internet.

    Recently, I have been migrating to the cloud. The main reason is that I triple boot my laptop now. I once run Windows 7 as my primary OS and anything else as virtual machines. However, the performance of these virtual machines has been less than ideal. The Windows 7 Virtual PC would not run 64bit OS, so I have to rely on Virtual Box to run Windows 2008 R2. Recently, I have been booting Windows 8 Developers Preview and Sharepoint/Windows 2008 from VHD. This is the my currently most satisfying configuration. Although the VHD is slightly slower than the real hard drive, I have the full access to the rest of the hardware. The issue I am facing now is that I need to access my data no matter which OS I boot with. That motivate me to migrate to the cloud.

    My primary data are my email, document and code.

    Email is least of my concern. Both Google and Hotmail have been offering ample space for me to store my email. I do download a copy email into Windows 7 so I can access them when I do not have access to the internet.

    Documents also have been easier. Google has been offering Google Docs as well as web based document editing for a while. Recently, Microsoft has been offering 25GB of free space on Windows Live Skydrive. From Skydrive, I can edit documents using Office Web Apps. So accessing and editing document from a boot OS that does not have Office installed is no longer a concern. From Office 2010, it is possible to save documents directly to Skydrive. In addition, Windows Live Mesh can automatically sync the local file system with Skydrive.

    Lastly, there are many free online source code versioning systems. Microsoft Codeplex, Google Code and Git, just to name a few, all offer free source code versions systems for open source projects. Other vendors such BitBucket also offer source control for close source projects. Open source version control software such as TortoiseHg or TortoiseSVN are easy to get. The free Visual Studio Express is also becoming more useful for real world projects; the recent Windows 8 Developer Preview has Visual Studio Express 11 preinstalled.

    So my conclusion is that the currently available services and software are sufficient to my needs and I am ready to live in the cloud.

  • Theory and Practice of Database and Data Analysis (2) – Navigating the table relationships

    In the first part of the series, I discussed how to use the Sql Server meta data to find database objects. In this part, I will discuss how to find all the data related to a transaction by navigating the table relationships. Let us suppose that the transaction is a purchase order stored in a table called Orders. Any one that has seen the Northwind database knows that it has a child entity called OrderDetails. In even a modest real world database, the complexity can grow very fast. If we can ship a partial order and put the rest in back order, or we need to ship from multiple warehouses, we could have multiple Shipping records for each Order record and have each ShippingDetails record linked to an OrderDetails record. The customers can return the order, either in full or in part, with and without return authorization number, and the actually return may or may not match the return authorization. The system needs to be flexible enough to capture all the events related to the order. As you can see, a very modest order transaction could easily grow to a dozen of tables. If you are not familiar with the database, how can you find all the related table? If you need to delete a record, how do you delete it clean? If you need to copy a transaction from a production system to the development system, how do you copy the entire set of the data related to the transaction?

    Fortunately, in any well-designed database, table relationships can be navigated through the foreign key relationships. In Sql server, we can usually find the information using the sp_help stored procedure. Here we will use INFORMATION_SCHEMA to obtain more refined results to be used in our tools. The TABLE_CONSTRAINTS view contains the primary and foreign keys on a table. The REFERENTIAL_CONSTRAINTS view contains the name of the foreign key on the many side and the name of the unique constraint on the one side. It is possible to navigate the table relationships using these two views. The following query will find all the tables that references the Sales.SalesOrderHeader table in the AdventureWorks database:

    Here are the results:

    The following query will return all the tables referenced by the Sales.SalesOrderHeader table:

    And the results:


    The Person.Address table appeared twice because we have two columns referencing the table.

    The CONSTRAINT_COLUMN_USAGE view contains the columns in the primary and the foreign keys. By using these 3 views, we can construct a simple tool that drills down into the data. In the next part of this series, we will construct such a tool.

    Added by Li Chen 9/26/2011

    The KEY_COLUMN_USAGE view also contains the columns in the primary and the foreign keys. I prefer to use this view because it also contains the ORDINAL_POSITION that I can use to match the columns in the primary and the foreign keys when we have composite keys. The following example shows how to match the columns. The last line in the query eliminates the records from a table joining to itself.

  • Theory and Practice of Database and Data Analysis (1) – Searching for objects

    Recently, I had to spend a significant portion of my time on production data support. Since we have a full-featured agency management system, I have to deal with parts of the application that I am not familiar with. Through the accumulated experiences and improved procedure, I was able to locate the problems with the increasing speed. In this series, I will try to document my experiences and approaches. In the first part, I will discuss how to search for objects.

    Supposing we were told that there were problems in address data, the very first thing we need to do is to locate the table and the field that contains the data. Microsoft SQL Server has a set of sys* tables. It also supports ANSI style INFORMATION_SCHEMA. In order to reuse the knowledge discussed here to other database systems, I will try to use the INFORMATION_SCHEMA as much as possible. So to find any table that is related to “Address”, we can use the following query:

    Using AdventureWorks 2008 sample database, it yields the following results:


    If we suspect there are columns in other tables relating to address, we can search for the columns with the following query:

    We will get the following results this time:


    Supposing we need to find all the stored procedures and functions that reference a table, we can use sp_depends stored procedure. If we need to search thoroughly using a string, we can use the following query:

    However, we cannot find triggers through Information_Schema. We have to use sysobjects:

    Once we find the triggers, we can find the text using sp_helptext or query the syscomments table.

    In the next part of the series, I will discuss navigating the table relationships.