October 2005 - Posts

"Ha," an email from a colleague started, "I think you can finally admit that MySQL is ready to compete with the big boys!" I rolled my eyes and let out a skeptical "uh huh." His email continued, "Check out Version 5. They now have views, stored procedures, and triggers."

My colleague has been a MySQL fan since day one. He loves the fact that it's free and open source and could never quite understand why anyone would spend tens of thousands of dollars on something else. But then again, he has never really had an interest in understanding; data management just isn't his "thing." Thankfully, he readily admits this and stays far, far away from anything to do with databases, leaving all of that "stuff" to the experts. No less, he'll still cheers whenever there's a MySQL "victory." it is, after all, free and open source.

Data professionals have traditionally relegated MySQL as a "toy" relational database management system (RDBMS). Don't get me wrong, it's perfectly suitable for blogs, message boards, and similar applications. But despite what its proponents claim, it has always been a non-choice for data management in an information-system. This is not a criticism of the "free open source" aspect of the product, but of its creators.

The MySQL developers claim to have built a reliable RDBMS yet seem to lack a thorough understanding of RDBMS fundamentals, namely data integrity. Furthermore, they will often surrogate their ignorance with arrogance. Consider, for example, their documentation on invalid data [emphasis added]:

MySQL allows you to store certain incorrect date values into DATE and DATETIME columns (such as '2000-02-31' or '2000-02-00'). The idea is that it's not the job of the SQL server [sic] to validate dates.

Wait a minute. It's not the job of the RDBMS to ensure data are valid?!? One of the greatest revelations in information systems is that applications are not good at managing their data: they change too frequently are too-bug prone. It just doesn't work. That's the whole point of a DBMS; it ensures that data are typed and valid according to business rules (i.e. an employee can't have -65 dependents).

But I digress. This is the 5.0 release. They've added views. They've added stored procedures. They've added triggers. Maybe things have changed.

I thought I'd check out MySQL 5.0 first hand, so I visited their website and downloaded the product. I have to say, the installation process was painless. It even defaulted to and recommended "strict mode," which apparently disallows the invalid dates as seen above. This is certainly progress!

After it installed, I fired up the MySQL prompt and started hackin' around.

 mysql> CREATE DATABASE ALEXP; Query OK, 1 row affected (0.00 sec) 
mysql> USE ALEXP; Database changed
mysql> CREATE TABLE HELLO (
-> WORLD VARCHAR(15) NOT NULL PRIMARY KEY,
-> CONSTRAINT CK_HELLO CHECK (WORLD = 'Hello World')
-> );
Query OK, 0 rows affected (0.14 sec)

Wow! I'm impressed! MySQL 5.0 has check constraints! Maybe I was wrong about these guys ...

 mysql> INSERT INTO HELLO(WORLD) VALUES('Hi World'); 
Query OK, 1 row affected (0.05 sec)

Err … umm … wait a minute. You did just see me put that check constraint on the HELLO table, right? It's not a very complicated check, maybe, I did it wrong?

 mysql> SELECT * FROM INFORMATION_SCHEMA.TABLE_CONSTRAINTS
-> WHERE TABLE_NAME='HELLO';
+--------------------+-------------------+-----------------+--------------+------------+-----------------+
| CONSTRAINT_CATALOG | CONSTRAINT_SCHEMA | CONSTRAINT_NAME | TABLE_SCHEMA | TABLE_NAME | CONSTRAINT_TYPE |
+--------------------+-------------------+-----------------+--------------+------------+-----------------+
| NULL | alexp | PRIMARY | alexp | hello | PRIMARY KEY |
+--------------------+-------------------+-----------------+--------------+------------+-----------------+
1 row in set (0.01 sec)

Well how about that. It didn't actually add the constraint. Of course, had I done my research before hand, I would have known that MySQL 5.0 does not support check constraints. Apparently, in the MySQL world, one can buy a negative number of items and can be hired long before being born.

Alas, maybe I'm being too harsh; I suppose we could implement data validation logic in triggers. After all, this is often required in other RDBMS when CHECK constraints do not allow cross-table tests (e.g., making sure items cannot be added to a shipped order). So, why not just do it here?

The trigger cannot use statements that explicitly or implicitly begin or end a transaction such as START TRANSACTION, COMMIT, or ROLLBACK.

Oh that's just lovely. Leave it to MySQL to drop the most important use of triggers (complex data validation) and encourage their most obnoxious use (difficult to maintain business logic).

As far as other features added in MySQL, I think they are definitely a step in the right direction. Stored Procedures are a key component in creating a clean interface with strong-cohesion to the data layer (see, Strong vs Weak cohesion). Views (virtual tables) are absolutely essential for creating an effective and maintainable data model.

However, the new features do very little to budge MySQL's position as a "toy" RDBMS. As I mentioned before, this does not preclude MySQL from being an excellent tool for niche applications. But without mechanisms to ensure data are valid, MySQL will remain a non-choice for information systems.

No one has asked me that question just yet, but with the release of SQL Server 2005 just around the corner, I'm sure a handful of people will. Unlike regular User Defined Types, CLR UDTs are a new feature of SQL Server 2005 that allows one to create a .NET class and use it as a column datatype. As long as a few requirements are followed, one can create any class with any number of properties and methods and use that class as a CLR UDT.

Generally, when a new feature is introduced with a product, it can be a bit of a challenge to know when and how to use that feature. Fortunately, with SQL Server's CLR UDTs, knowing when to use them is pretty clear:

Never.

Let me repeat that. Never. You should never use SQL Server CLR User Defined Types. I'm pretty sure that this answer will just lead to more questions, so allow me to answer a few follow-up questions I'd anticipate.

Why Not?
CLR UDTs violate a fundamental principle of relational databases: a relationship's underlying domains must contain only atomic values. In other words, the columns on a table can contain only scalar values. No arrays. No sub-tables. And, most certainly, no classes or structures. Remember all the different levels of normalization? This is the first normal form, you know, the "duh" one.

This is a big thing. One can't just go and fudge a tried-and-true, mathematically-validated, theoretically-sound concept and "add and change stuff to it 'cause it'll be cool." Think of how much your car would love driving on a road made of stained glass blocks three years after it was built by an engineer who thought it'd look better.

Deviating so grossly from the relational model will bring as much joy as a dilapidated glass road. Take Oracle's foray into relational abuse: nested tables. I don't believe that there has ever been a single, successful implementation of that abomination. Sure, it may work out of the box, but after a year or two of use and maintenance, it decays into a tangled mess of redundancy and "synch" procedures -- both completely unnecessary with a normalized relational model.

And if that doesn't convince you, just think of having to change that CLR UDT. How easy do you think it would be to add a property to the class representing a few million rows of binary-serialized objects? And, trust me, it won't be nearly as easy as you think.

But wouldn't I want to share my .NET code so I don't have to duplicate logic?
This is always a novel goal, but an impossible one. A good system (remember, good means maintainable by other people) has no choice but to duplicate, triplicate, or even-more-licate business logic. Validation is the best example of this. If "Account Number" is a seven-digit required field, it should be declared as CHAR(7) NOT NULL in the database and have some client-side code to validate it was entered as seven digits. If the system allows data entry in other places, by other means, that means more duplication of the "Account Number" logic.

By trying to share business logic between all of the tiers of the application, you end up with a tangled mess of a system. I have illustrated this in the diagram below.

As you can see, the diagram on the right is a nicely structure three-tier architecture. The system on the right is the result of someone trying to share business logic between tiers, making a horribly tangled mess. One can expect to end up with the latter system by using CLR UDTs.

Never?!? How can there never, ever be an application of CLR UDTs?
Though I may not live by the cliché "never say never," I do follow the "never say 'never, ever'" rule. The only possible time where one might possibly want to use this feature is for developing non-data applications. But therein lies the crux: why would one develop a non-data application using SQL Server? There are certainly better tools out there for what the non-data application needs to accomplish. If you can come up with an appropriate use of a CLR UDT in an information system, I'll buy you a t-shirt or a mug.

But what about the samples provided? That's a use, right there!
Allow me to address these ...

Supplementary-Aware String Manipulation / UTF8 String User-Defined Data Type
Both of these samples have to do with UTF-8 character encoding. Without getting into the details, UTF-8 encodes characters as one, two, three, or four 8-bit bytes, meaning you can not do anything with characters in the string (length, substring, etc) unless you read it byte-by-byte. This works great for preserving "funny characters" while transmitting data but is a poor choice for storage. UCS-2 uses a fixed-size character format of 16-bits per character and is what should be used for storing character data.

Calendar-Aware Date/Time UDTs
Let's think about this. A point in time is a point in time; how it's described varies by culture ("Monday", "Lunes"), time zone (+6:00 GMT, -3:00GMT), calendar (Gregorian, Aztek), and format (2005-08, Aug '05). Describing a point in time properly is essential when interfacing with people or other systems. The keyword in that last sentence was "interface;" such description is best done in the "interface" tier of a system, not in the data tier. Doing this makes as much sense as putting currency conversion and language translation in the database.

Multi-dimensional Points and Latitude/Longitude
A geospatial location is described with Latitude and Longitude. Not Lati-longi-tude. These are two separate attributes and putting them in the same column violates First Normal Form. The same goes for points and other "array-like" structures.

Imaginary Numbers
Seriously? Correct me if I'm wrong, but the only actual use for imaginary numbers is in solving of differential equations. If you're not sure why this invalidates the example, say these two phrases aloud: "solving differential equations" and "relational database." Didn't that feel just like saying "drilling a hole" and "hacksaw?"

But what about if I want to put down "SQL CLR UDTs" on my resume?
What's stopping you now? By reading this article, you know everything you will ever need to about CLR UDTs. With this on your resume, you will be able to use your expert knowledge on the topic to never use CLR UDT.

I hope that clears things up about CLR UDT. Hopefully now you look forward to not using them and strongly opposing anyone who suggests it. Oh, and I really am serious about sending The Daily WTF swag to whoever can come up with a use for these things. So think about a use; you may just get a free t-shirt.

I'm asked that question every now and then from other developers who've played around in SQL Server Enterprise Manager and noticed the "User Defined Data Types" tab under their database. UDT seem a bit strange and pointless because they do not allow one to define (as one might expect) a data structure with more than one related data element. A UDT consists simply of a name and a base type (INT, VARCHAR(6), etc).

So why then would one use a UDT? It all has to do with a fundamental concept of data known as "domains." I'm not referring to a dot-com type domain, but a domain in the mathematical sense of restricting the value of a particular value. For example, the domain of x for "f(x) = 1/x" is "!=0".

We don't get domains in C++ / C# / VB / etc; all we have are types (integer, date, string, etc). But we're used to not having this; everyone knows you need to check if "x != 0" before trying to divide by x.  Imagine how much less coding (and related bugs) we'd have if trying to assign "0" to "x" threw an exception from the start, instead of in the middle. That's exactly what you can (and should) be doing with your databases. 

When I start on this same explanation to others, it turns out a lot don't quite understand what check constraints are. Basically, check constraints are used to define the domain of a column to ensure that a row can only contain valid data according to the business rules. For example, your Products table should have a check constraint on the Price column, requiring it to be greater than zero (this would cause an exception to be raised if you tried to update the price to zero). Here's another example of some code:

CREATE TABLE [Transactions] (
  [Transaction_Id] INT IDENTITY(1,1) PRIMARY KEY NOT NULL,
  [Transaction_Type] VARCHAR(5) NOT NULL
    CHECK ([Transaction_Type] IN ('Debit','Credit','Escrow')),
  [Transaction_Amount] DECIMAL(4,2) NOT NULL
    CHECK ([Transaction_Amount] <> 0),
  [Reference_Code] CHAR(5)
    CHECK ([Reference_Code] LIKE '[A-Z][ A-Z][A-Z][A-Z][A-Z]'))
)

Get the idea? Each column has a constraint to ensure only valid data is allowed in the table. This way, there is no way that [Reference_Code] could contain anything but a five character string of upper case letters. No need to write code to test it, no need to ever validate it (except maybe on the data entry form so that the user doesn't see an ugly exception message), and no need to assume that it will be anything but that.

Now, immagine that you wanted to have the same [Reference_Code] attribute throughout your database. You'd have to define that check constraint time and time again. If the rules ever changed, you'd need to change it in every place. That's where UDTs come into place. UDTs are the SQL Server imlementation of domains.

If you have a common data element that will be used throughout the system, then it should be a UDT. Account number, Username, Order Number, etc; all should be UDT. When you define these types, you can easily apply rules (which are essentially just check constraints that apply whenever the type is used) to the type, and have it automatically enforced throughout the system.

It's really easy to do. I'll use the SQL 2005 syntax, but you can do the same things in 2000 using sp_addtype and sp_addrule:

CREATE TYPE USERNAME FROM VARCHAR(20)
GO

CREATE RULE USERNAME_Domain
    AS @Username = LTRIM(RTRIM(@Username))
   AND LOWER(@Username) NOT IN ('admin','administrator','guest')
GO

EXEC sp_bindrule 'USERNAME_Domain', 'USERNAME'
GO

And that's it. Now you can use the type throughout the database just as you normally would, and you'll never need to check or verify to make sure that someone slipped in an invalid value ...

CREATE TABLE [User_Logons] (
  [Username] USERNAME NOT NULL,
  [Logon_Date] DATETIME NOT NULL,
  [Success_Indicator] CHAR(1) NOT NULL
    CHECK ([Success_Indicator] IN ('Y','N')),
  PRIMARY KEY ([Username],[Logon_Date])
)

More Posts