The Gospel of the GUID - And Why It Matters!
The
Gospel of the GUID
There
is only one commandment in the Gospel of the GUID:
I.
Thou shalt always use a GUID to uniquely identify a row in a table.
Whenever
I hire a new member for my team, this is one of the first rules that I teach
them. At first, they almost always look at me like a puppy with their ears up
and their head tilted to the side as if to say "huh?"
So
I show them that every single table has field that is a unique identifier, and
most of the time, that field is the primary key and usually the clustered index
at that. For example, if we have an Employee table, it has a field named
EmployeeGUID that is a unique identifier, and it is the primary clustered key.
It's at this point that I usually start get the "uh oh, my boss is an
idiot but I don't want him to know I think that look." Next the questions
and the "but I do x" comments start to come:
- I just use an
int.
- I don't do that
because they are so big.
- Don't you know
how hard it is to go look up something by that, an int is so much easier.
- Surely you don't
do that for many-to-many bridge table?
- What happens if
they run out of GUIDs?
- I don't think I
trust that they won't be duplicated.
- I have never
heard of this, therefore it is probably a bad idea.
- Don't you think
there is a performance penalty for doing that?
So then begins my quest to make them a convert. I have a
high conversion rate among "thinking" people, and therefore I
challenge you to read this and see if you too will become a believer!
Reason
1: I can make less trips to the database, now THAT is a performance
enhancement!
There are many reasons to use a GUID as the primary key for your table. My
primary reason is directly related to how I build my object models. I prefer to
create a "new" instance of my object without round tripping the
database. So if I am going to create a new Order object, I don't have to round
trip the database to get the OrderID (OrderGUID in my world) like I would if I
was using an int as the "ID" of the order. While not all that
impressive at this level, consider this. I can create a new Order object with
an OrderGUID, and ten OrderLineItem objects with their own OrderLineItemGUID without
EVER round tripping the database unlike the INT solution which needs to make 11
calls to the DB.
Reason 2: Data Merging is so easy, Mac using developers can do it!
My second reason for always using a GUID is data merging. I
can't count how many times that I have had to merge data. Back before I saw the
light, I too would use INTs or some other thing to make my row unique but when
I had to merge data, I always had to write a conversion.
Consider the following:
|
Customer
Database 1:
Order Table
|
Customer
Database 2:
Order Table
|
|
OrderID:
1
|
OrderID:
1
|
|
CustomerID:
1
|
CustomerID:
1
|
If customer 1 acquires customer 2 and I need to merge their data, I have to
convert someone's OrderID 1 and CustomerID 1 to some INT that is not used, and
update a lot of rows of data, and possibly monkey around with seed values.
Multiply that out by dozens of tables, millions of rows of data, and I have a REALLY
big task ahead of me, that will require a lot of testing after I write the SQL and/or
code.
However, if I had followed the Gospel of the GUID
|
Customer
Database 1: Order Table
|
Customer
Database 2: Order Table
|
|
OrderID:
{C290A815-BAAA-4455-82EA-F18F5D0CEF2E}
|
OrderID:
{79E9F560-FD70-4807-BEED-50A87AA911B1}
|
|
CustomerID:
{52335ABA-2D8B-4892-A8B7-86B817AAC607}
|
CustomerID:
{60FA3045-BAE8-4526-87A2-DF22ED5F093B}
|
Now all I have to do is insert all the rows from one
customer's db into the other's db. No conversion, no testing of the conversion
necessary, just nice clean and efficient. I recently needed to merge two of my client's
data in just such a manner: AT&T and Cingular. My "conversion" took all of 45 minutes.
Another similar example: imagine if you had a need for
people to work disconnected and you need to merge in their changes once they
get back online. Now it is child's play... if you are a believer in the Gospel of
the GUID. You can easily push changes made in either DB to the other.
Reason
3: Type/Table Ignorance
The third reason I believe in the Gospel of the GUID is
what I call "Type Ignorance" which is a play on "Type Inference" in .NET 3.5 as
it is kind of the opposite meaning: I don't care what the type is at the row
level in my table. Let's assume that I have an application that has been in
production for a while. It has the following tables (among others) Customer,
Order, OrderLineItem, Product, and finally Vendor. Now I determine that I want
to be able to add "notes" to all of these items. I can simply create a new
table and object that looks like:
|
Notes
|
|
|
NoteGUID
|
Unique
Identifier
|
|
ParentGUID
|
Unique
Identifier
|
|
Note
|
VarChar(500)
|
I can insert all my notes into a single table, using
the GUID of the parent object in the ParentGUID field. If I want to get to any
object's notes all I need is a simple join. For example, to get to a Vendor's
notes, all I need is a simple join of Note.ParentGUID to Vendor.VendorGUID. I
don't need a "type" indicator so I know which table to join, nor do I need a
myriad of join tables to know which note goes with which type.
You would be surprised how often this little technique
comes in handy. We recently added "auditing" to our application where we needed
to know exactly who added, updated, and deleted what. We simply added a handful
of lines of code to the SubmitChanges() method of the datacontext (We use LINQ
exclusively in our new app) and write out the appropriate record to our
audit table. Any time we add a new object type/table to our app, it will
automatically get the auditing ability without us having to change a single
line of code, or make any modifications to our "auditing table."
Reason
4: The hits just keep on coming!
There are many less than obvious reasons for using
GUIDs to identify things, but here is one that I did not plan for and was incredibly
grateful for and ultimately it saved my client millions of dollars... yes I did
say MILLIONS!
I was developing an AP Feed ability for a large client.
They were going to be able to pay millions of dollars of their bills at the
push of a button. Basically, push a button in our system, we would generate a
feed file, ship it to their AP system, and it would generate the checks... and
out the money would go. Of course, I used GUIDs to identify every single thing.
So when the AP system generated a reconciliation file, I could easily match it
up to my end.
At our site, this client had a production database and a
test system which was a copy of their production system, just somewhat stale. A
couple of months stale to be exact. During our testing, someone on their end
saw one of our test feed files and mistakenly assumed that it was a production
feed which should be "paid" so they dropped it over into their production AP
system. You guessed it... they just paid a lot of legit vendors twice (once from
the production system, and once from the test system) and some not so legit
vendors got checks they shouldn't have because they were no longer the landlord
(remember, the test db was a couple of months stale.) Through no fault of my own, I now had a horrible
mess of data... or at least my client thought so. However, since all of my
payment records had GUIDs for identifiers, I could easily join the two db's against
AP System which had my GUID for each payment, to determine which payments were
legit and therefore should be paid, and which should get a good ol "stop
payment." Imagine if I had used INTs. I
would have no way of knowing if PaymentID 1000 was from the good data or the
test data.
So how did this save my client millions? Easy...
thousands of payments multiplied by "stop payment" fees adds up in a hurry. Especially
when they make this horrible mistake 3 times! J
So
are there any downsides to the Gospel of the GUID?
In short, yes there are, however, they are so miniscule
that they don't even come close to changing my mind. The most "in your face"
downside is when you have to hand write SQL to find something.
SELECT * FROM ORDER WHERE ORDERID = 12
is a lot easier than typing
SELECT * FROM ORDER WHERE ORDERGUID = ‘{45F57B42-38A4-46ce-A180-6DE0E7051178}'
A second downside is that there is a small performance
hit for joins that are based on GUIDs if measured against the same data using
INTs instead. In my experience, even when using multimillion row tables, this
has NEVER been an issue. A few milliseconds are such a minor price to pay for
all of the goodness.
I encourage you to try out this technique in your next
small project, especially if you are still a skeptic. I think you will find it
more useful than you ever dreamed.