Gospel of the GUID - Answers to your Burning Questions, Comments and Insults.
First and foremost, I was blown away with the number of nice responses and questions to my Gospel of the GUID post. For the most part, I got a lot of “wow, I never thought of that” and “cool, we do that too, glad to see someone else doing it” type of responses. I enjoyed the slew of “I am a convert now” type of responses also. It was nice to see people play along with the motif of the post. I had expected a few comments, but not the flood I received. Instead of just “approving” the monster load of comments, I wanted to condense it all down into a single readable response post. So many of the questions were flavors of the same question that I would hate for people to read 100+ comments where most of it was just duplication.
Before I go any further, I should state, and should have stated in my Gospel of the GUID post, that every rule in my shop can have exceptions. My GUID rule is no exception to this. (Sorry… couldn’t resist the pun) However, to break my rules, you must have valid reason for doing so. Many people have pointed out the more common exceptions; data warehousing, apps that need high speed multi-million inserts, etc. These would all be allowable exceptions to me, however, these are not the majority of my applications and they are certainly not the norm of most business applications. After all, someone has to actually build the apps that generate the data for the warehouseJ!
Another interesting thing that I did not expect, were the insults. I certainly didn’t expect the anti-Microsoft slurs but I guess if you are unemployed, sitting at home in your whitey-tighties, taking a break from leveling your level 70 Night Elf, what else are you going to do? Luckily after many years of playing MMORPGs, I am quite used to bravado that seems to be induced on the weak minded and immature by the anonymity of the net. I just didn’t expect so much of it from the readers of ASP.NET … but what do I know? J Truth be known, it was the most entertaining part of it all for me. For those of you that took the time to go out of your way to insult me (and Bill Gates), you will sleep better knowing that I, my team, and my friends had a good time laughing at you… not with you… but at you.
For the most part, I received a lot of emails and many comments asking “what about x?” I will answer the most common questions and comments below. I am so sorry it has taken so long to get these answers back to you.
1) Can you explain number one a little better? I still don’t quite get it. Why don’t you just return the unique identifier from a stored proc?
The beauty of the GUID approach is that I can generate the identifier that I need in my object model WITHOUT going to the database. This allows me to create an “Order” object, and all of its “Line Item” objects without making any calls to the database until I am ready to persist them, thereby saving trips to the database. I don’t care how small you can make your index, no trips to the database is faster than 1. J
Note: If I need to generate an identifier at the database level, I can do that also. If I need to “return” it from a stored proc, I would need to use an output parameter.
2) Don’t you know that using GUID’s as clustered indexes is bad?
At the risk of sounding flippant, yes, I know in certain cases it is bad.
Yes, I know that if you have a GUID as a clustered index AND you are doing high volume inserts, this really slows down your inserts. For 99% of the applications I write, I truly don’t care about how long the inserts take as long as they perform adequately.
Yes, I know that over the long haul, the more records that get inserted the more fragmentation I could possibly have. For the most part I deal with this by using the appropriate Fill Factor or through database maintenance. If it is really an issue after that, I don’t use a clustered index on the GUID here either. MOST of my tables aren’t million+ rows, and most don’t need to be able to handle high volume inserts, therefore MOST of my tables do have the GUID as the clustered index.
3) Why not use the new NewSequentialId()?
There are several reasons really. The only one that really matters is that I can’t do it in my object model, and therefore it is of no use to me.
4) What happens if we run out of GUIDs? What if I don’t trust that they can’t be duplicated?
Simple, your “unique index” will guarantee you don’t get that duplicated bugger into your database. Unless, of course, you don’t trust SQL to make sure it doesn’t allow duplicates J
Now you have a huge problem though, you need to rush out and buy a lottery ticket but be careful, because you are going to get hit by lightning on the way because you have defied the odds on a scale that is nearly unimaginable. After all, the odds of duplicating a GUID is 1 in 2122 (or 5.3×1036) Good luck with that. By the way, anyone know how long that would be if I could line up 5.3×1036 dollar bills one after the other?
5) What about referential integrity regarding #3?
I knowing sacrifice referential integrity for the flexibility I get with my object model. I rarely physically delete data anyway. Most of the time, we do logical deletes so referential integrity doesn’t even come into play.
6) What if you need to know which parent a row belongs to in #3?
If you absolutely must know at the row level of the child record, then you will either need to add a “type” identifier column, or you will need to do a left join against your appropriate parent tables to see who it belongs to. It is extremely rare for me to need to go “up the chain” in my object graph from an object that is stored this way. I almost always only go “down the chain.” For example, I would start with my customer and get the notes for that customer. Not start with some note and try to figure out, “Gee what the heck is this note for?”
7) What about Customers who don’t want to see a GUID for an Order Number?
GUIDs are NEVER for human consumption. (I don’t consider devs and dba’s human…they are superhumans J ) People like to see things like “Order ID 1000” which is totally fine. Obviously you should make sure those are unique also, but that has nothing to do with using GUIDs.
Besides, this breaks another one of my rules, which is never make something that normal people feel is unique, your identifier. Some day humans will change this little piece of data, and when they do, your app won’t work. I can’t tell you how many times I have seen an order “number” go from numeric to alpha numeric, or an identifier that will “always” be 5 characters, be changed to 6 characters years later. Here me now, and believe me later, don’t let your PKs be anything that means something to your customer. Your PK should only be something that means something to your database… and that something should be what row am I getting… even if it is an INT J
Additional Thoughts:
1) In my world, which admittedly is different from many of yours, what I care about the most is:
a. How my app behaves from the user’s perspective. I truly don’t care if a DBA thinks that speedy inserts of millions of records is important. DBA’s don’t dictate design in my apps. DBA’s don’t “own” the database in my world. They are an important part of the design team of our app, but they don’t get to make unilateral design decisions based on whether we can do millions of inserts quickly, if that’s not what my users do. If my users are ACTUALLY doing millions of inserts, I care very much and certainly wouldn’t do anything that caused that task to perform poorly, including making a clustered index on a GUID.
b. Code maintainability. This is a topic for a future post of mine, but the short version is that I will take code that performs acceptably and is easier to maintain over hyper optimized code that is harder to maintain. The Gospel of the GUID is based on this philosophy, and makes my application framework very easy to maintain. When you know that EVERY row is identifiable by a GUID, your framework can easily deal with any row of data, without having to “figure it out.” For example, in our UI framework, we have an abstract base class for all of our pages that deal with singletons to inherit. In the base page, we have helper functions that do things like get the singleton etc. While not a huge amount of code in and of itself, across several hundred pages, it’s quite a nice savings.
2) COMBS - http://www.informit.com/articles/article.aspx?p=25862&seqNum=1 Now here is something that I must take a closer look at. Trading some of the uniqueness for better performing inserts merits some investigation on my part.
Off the top of my head, it would be pretty easy to modify my framework to use a COMB instead of GUID and of course it would be trivial to create a UDF in SQL for the same thing if I needed to generate some IDs at the DB level. I will have to do some testing and see if this is worth doing. Thanks for the heads up on this dani.
The
Gospel of the GUID
There
is only one commandment in the Gospel of the GUID:
I.
Thou shalt always use a GUID to uniquely identify a row in a table.
Whenever
I hire a new member for my team, this is one of the first rules that I teach
them. At first, they almost always look at me like a puppy with their ears up
and their head tilted to the side as if to say "huh?"
So
I show them that every single table has field that is a unique identifier, and
most of the time, that field is the primary key and usually the clustered index
at that. For example, if we have an Employee table, it has a field named
EmployeeGUID that is a unique identifier, and it is the primary clustered key.
It's at this point that I usually start get the "uh oh, my boss is an
idiot but I don't want him to know I think that look." Next the questions
and the "but I do x" comments start to come:
- I just use an
int.
- I don't do that
because they are so big.
- Don't you know
how hard it is to go look up something by that, an int is so much easier.
- Surely you don't
do that for many-to-many bridge table?
- What happens if
they run out of GUIDs?
- I don't think I
trust that they won't be duplicated.
- I have never
heard of this, therefore it is probably a bad idea.
- Don't you think
there is a performance penalty for doing that?
So then begins my quest to make them a convert. I have a
high conversion rate among "thinking" people, and therefore I
challenge you to read this and see if you too will become a believer!
Reason
1: I can make less trips to the database, now THAT is a performance
enhancement!
There are many reasons to use a GUID as the primary key for your table. My
primary reason is directly related to how I build my object models. I prefer to
create a "new" instance of my object without round tripping the
database. So if I am going to create a new Order object, I don't have to round
trip the database to get the OrderID (OrderGUID in my world) like I would if I
was using an int as the "ID" of the order. While not all that
impressive at this level, consider this. I can create a new Order object with
an OrderGUID, and ten OrderLineItem objects with their own OrderLineItemGUID without
EVER round tripping the database unlike the INT solution which needs to make 11
calls to the DB.
Reason 2: Data Merging is so easy, Mac using developers can do it!
My second reason for always using a GUID is data merging. I
can't count how many times that I have had to merge data. Back before I saw the
light, I too would use INTs or some other thing to make my row unique but when
I had to merge data, I always had to write a conversion.
Consider the following:
|
Customer
Database 1:
Order Table
|
Customer
Database 2:
Order Table
|
|
OrderID:
1
|
OrderID:
1
|
|
CustomerID:
1
|
CustomerID:
1
|
If customer 1 acquires customer 2 and I need to merge their data, I have to
convert someone's OrderID 1 and CustomerID 1 to some INT that is not used, and
update a lot of rows of data, and possibly monkey around with seed values.
Multiply that out by dozens of tables, millions of rows of data, and I have a REALLY
big task ahead of me, that will require a lot of testing after I write the SQL and/or
code.
However, if I had followed the Gospel of the GUID
|
Customer
Database 1: Order Table
|
Customer
Database 2: Order Table
|
|
OrderID:
{C290A815-BAAA-4455-82EA-F18F5D0CEF2E}
|
OrderID:
{79E9F560-FD70-4807-BEED-50A87AA911B1}
|
|
CustomerID:
{52335ABA-2D8B-4892-A8B7-86B817AAC607}
|
CustomerID:
{60FA3045-BAE8-4526-87A2-DF22ED5F093B}
|
Now all I have to do is insert all the rows from one
customer's db into the other's db. No conversion, no testing of the conversion
necessary, just nice clean and efficient. I recently needed to merge two of my client's
data in just such a manner: AT&T and Cingular. My "conversion" took all of 45 minutes.
Another similar example: imagine if you had a need for
people to work disconnected and you need to merge in their changes once they
get back online. Now it is child's play... if you are a believer in the Gospel of
the GUID. You can easily push changes made in either DB to the other.
Reason
3: Type/Table Ignorance
The third reason I believe in the Gospel of the GUID is
what I call "Type Ignorance" which is a play on "Type Inference" in .NET 3.5 as
it is kind of the opposite meaning: I don't care what the type is at the row
level in my table. Let's assume that I have an application that has been in
production for a while. It has the following tables (among others) Customer,
Order, OrderLineItem, Product, and finally Vendor. Now I determine that I want
to be able to add "notes" to all of these items. I can simply create a new
table and object that looks like:
|
Notes
|
|
|
NoteGUID
|
Unique
Identifier
|
|
ParentGUID
|
Unique
Identifier
|
|
Note
|
VarChar(500)
|
I can insert all my notes into a single table, using
the GUID of the parent object in the ParentGUID field. If I want to get to any
object's notes all I need is a simple join. For example, to get to a Vendor's
notes, all I need is a simple join of Note.ParentGUID to Vendor.VendorGUID. I
don't need a "type" indicator so I know which table to join, nor do I need a
myriad of join tables to know which note goes with which type.
You would be surprised how often this little technique
comes in handy. We recently added "auditing" to our application where we needed
to know exactly who added, updated, and deleted what. We simply added a handful
of lines of code to the SubmitChanges() method of the datacontext (We use LINQ
exclusively in our new app) and write out the appropriate record to our
audit table. Any time we add a new object type/table to our app, it will
automatically get the auditing ability without us having to change a single
line of code, or make any modifications to our "auditing table."
Reason
4: The hits just keep on coming!
There are many less than obvious reasons for using
GUIDs to identify things, but here is one that I did not plan for and was incredibly
grateful for and ultimately it saved my client millions of dollars... yes I did
say MILLIONS!
I was developing an AP Feed ability for a large client.
They were going to be able to pay millions of dollars of their bills at the
push of a button. Basically, push a button in our system, we would generate a
feed file, ship it to their AP system, and it would generate the checks... and
out the money would go. Of course, I used GUIDs to identify every single thing.
So when the AP system generated a reconciliation file, I could easily match it
up to my end.
At our site, this client had a production database and a
test system which was a copy of their production system, just somewhat stale. A
couple of months stale to be exact. During our testing, someone on their end
saw one of our test feed files and mistakenly assumed that it was a production
feed which should be "paid" so they dropped it over into their production AP
system. You guessed it... they just paid a lot of legit vendors twice (once from
the production system, and once from the test system) and some not so legit
vendors got checks they shouldn't have because they were no longer the landlord
(remember, the test db was a couple of months stale.) Through no fault of my own, I now had a horrible
mess of data... or at least my client thought so. However, since all of my
payment records had GUIDs for identifiers, I could easily join the two db's against
AP System which had my GUID for each payment, to determine which payments were
legit and therefore should be paid, and which should get a good ol "stop
payment." Imagine if I had used INTs. I
would have no way of knowing if PaymentID 1000 was from the good data or the
test data.
So how did this save my client millions? Easy...
thousands of payments multiplied by "stop payment" fees adds up in a hurry. Especially
when they make this horrible mistake 3 times! J
So
are there any downsides to the Gospel of the GUID?
In short, yes there are, however, they are so miniscule
that they don't even come close to changing my mind. The most "in your face"
downside is when you have to hand write SQL to find something.
SELECT * FROM ORDER WHERE ORDERID = 12
is a lot easier than typing
SELECT * FROM ORDER WHERE ORDERGUID = ‘{45F57B42-38A4-46ce-A180-6DE0E7051178}'
A second downside is that there is a small performance
hit for joins that are based on GUIDs if measured against the same data using
INTs instead. In my experience, even when using multimillion row tables, this
has NEVER been an issue. A few milliseconds are such a minor price to pay for
all of the goodness.
I encourage you to try out this technique in your next
small project, especially if you are still a skeptic. I think you will find it
more useful than you ever dreamed.
If you were like me and downloaded the CTP today, all anxious to give it a whirl, you probably found out like I did, that the credentials that the web page state to use DON'T WORK!
The website says: VPC Credentials: Username: Administrator Password: P2ssw0rd (NOTE: Microsoft strongly recommends that immediately after installation you log in and change the administrator password)
What a bunch of crap!
Anyway... yours truly screwed around with it, decided man..that 2 sure looks weird... bet they meant @ and sure enough I was right.
USE THIS:
Username: Administrator
Password: P@ssw0rd
To make matters worse... you have to "activate" this verson of windows also.... despite their warning to "not turn on the network". At any rate...there ya go.
To Download the CTP
http://www.microsoft.com/downloads/details.aspx?familyid=1ff0b35d-0c4a-40b4-915a-5331e11c39e6&displaylang=en
ps... Blogmailr ain't working for me. Just ANOTHER thing to go screw with.