"When Should I Use SQL-Server CLR User Definied Types (UDT)?"

No one has asked me that question just yet, but with the release of SQL Server 2005 just around the corner, I'm sure a handful of people will. Unlike regular User Defined Types, CLR UDTs are a new feature of SQL Server 2005 that allows one to create a .NET class and use it as a column datatype. As long as a few requirements are followed, one can create any class with any number of properties and methods and use that class as a CLR UDT.

Generally, when a new feature is introduced with a product, it can be a bit of a challenge to know when and how to use that feature. Fortunately, with SQL Server's CLR UDTs, knowing when to use them is pretty clear:

Never.

Let me repeat that. Never. You should never use SQL Server CLR User Defined Types. I'm pretty sure that this answer will just lead to more questions, so allow me to answer a few follow-up questions I'd anticipate.

Why Not?
CLR UDTs violate a fundamental principle of relational databases: a relationship's underlying domains must contain only atomic values. In other words, the columns on a table can contain only scalar values. No arrays. No sub-tables. And, most certainly, no classes or structures. Remember all the different levels of normalization? This is the first normal form, you know, the "duh" one.

This is a big thing. One can't just go and fudge a tried-and-true, mathematically-validated, theoretically-sound concept and "add and change stuff to it 'cause it'll be cool." Think of how much your car would love driving on a road made of stained glass blocks three years after it was built by an engineer who thought it'd look better.

Deviating so grossly from the relational model will bring as much joy as a dilapidated glass road. Take Oracle's foray into relational abuse: nested tables. I don't believe that there has ever been a single, successful implementation of that abomination. Sure, it may work out of the box, but after a year or two of use and maintenance, it decays into a tangled mess of redundancy and "synch" procedures -- both completely unnecessary with a normalized relational model.

And if that doesn't convince you, just think of having to change that CLR UDT. How easy do you think it would be to add a property to the class representing a few million rows of binary-serialized objects? And, trust me, it won't be nearly as easy as you think.

But wouldn't I want to share my .NET code so I don't have to duplicate logic?
This is always a novel goal, but an impossible one. A good system (remember, good means maintainable by other people) has no choice but to duplicate, triplicate, or even-more-licate business logic. Validation is the best example of this. If "Account Number" is a seven-digit required field, it should be declared as CHAR(7) NOT NULL in the database and have some client-side code to validate it was entered as seven digits. If the system allows data entry in other places, by other means, that means more duplication of the "Account Number" logic.

By trying to share business logic between all of the tiers of the application, you end up with a tangled mess of a system. I have illustrated this in the diagram below.

As you can see, the diagram on the right is a nicely structure three-tier architecture. The system on the right is the result of someone trying to share business logic between tiers, making a horribly tangled mess. One can expect to end up with the latter system by using CLR UDTs.

Never?!? How can there never, ever be an application of CLR UDTs?
Though I may not live by the cliché "never say never," I do follow the "never say 'never, ever'" rule. The only possible time where one might possibly want to use this feature is for developing non-data applications. But therein lies the crux: why would one develop a non-data application using SQL Server? There are certainly better tools out there for what the non-data application needs to accomplish. If you can come up with an appropriate use of a CLR UDT in an information system, I'll buy you a t-shirt or a mug.

But what about the samples provided? That's a use, right there!
Allow me to address these ...

Supplementary-Aware String Manipulation / UTF8 String User-Defined Data Type
Both of these samples have to do with UTF-8 character encoding. Without getting into the details, UTF-8 encodes characters as one, two, three, or four 8-bit bytes, meaning you can not do anything with characters in the string (length, substring, etc) unless you read it byte-by-byte. This works great for preserving "funny characters" while transmitting data but is a poor choice for storage. UCS-2 uses a fixed-size character format of 16-bits per character and is what should be used for storing character data.

Calendar-Aware Date/Time UDTs
Let's think about this. A point in time is a point in time; how it's described varies by culture ("Monday", "Lunes"), time zone (+6:00 GMT, -3:00GMT), calendar (Gregorian, Aztek), and format (2005-08, Aug '05). Describing a point in time properly is essential when interfacing with people or other systems. The keyword in that last sentence was "interface;" such description is best done in the "interface" tier of a system, not in the data tier. Doing this makes as much sense as putting currency conversion and language translation in the database.

Multi-dimensional Points and Latitude/Longitude
A geospatial location is described with Latitude and Longitude. Not Lati-longi-tude. These are two separate attributes and putting them in the same column violates First Normal Form. The same goes for points and other "array-like" structures.

Imaginary Numbers
Seriously? Correct me if I'm wrong, but the only actual use for imaginary numbers is in solving of differential equations. If you're not sure why this invalidates the example, say these two phrases aloud: "solving differential equations" and "relational database." Didn't that feel just like saying "drilling a hole" and "hacksaw?"

But what about if I want to put down "SQL CLR UDTs" on my resume?
What's stopping you now? By reading this article, you know everything you will ever need to about CLR UDTs. With this on your resume, you will be able to use your expert knowledge on the topic to never use CLR UDT.

I hope that clears things up about CLR UDT. Hopefully now you look forward to not using them and strongly opposing anyone who suggests it. Oh, and I really am serious about sending The Daily WTF swag to whoever can come up with a use for these things. So think about a use; you may just get a free t-shirt.

20 Comments

  • How about as a monetary unit? I think this scenario would make perfect sense no?



    From "A First Look at SQL Server 2005 for Developers" (Beauchemin, Berglund, Sullivan):



    "If the object represents a scalar and requires more than one firled ot be described, a user-defined type should be used to represet it; otherwise, it should not."



    "Is a geometric point a scalar? No, because its representation requires two dimensions, not just two numbers. But why is a data that has three numbers a scalar but a point that has only two numbers associated with it not a scalar? You can ask of any date something like 'Is d2 between date d1 and date d3?' and get a consistent answer."



    To me this says that if the rules of inequality are transitive, then a CLR UDT can be used to create a cleaner representation of the data. I'm thinking that it would be helpful to work with monetary units without using two columns to store the amount and the currency.

  • I can't imagine anywhere in a non-trivial application where a CLR UDT would be preferable to normal data in the DB mapped with a decent Object Relational Mapper.



    Now, if MS goes and does something drastic like turning the filesystem or the registry into a shallow wrapper around SQL Server... well then you may see some uses I guess.



    Maybe we're forgetting one thing though -- polymorphism. With .NET as an SP language, storing CLR objects in the DB lets you do "interesting" (aka "so new it's bound to be abused") things. Imagine iterating through a table and calling virtual methods on the objects in a CLR UDT column.

  • @Marty.

    What you describe (storing unfinished "scratch" data) is fine to store as a BLOB -- in this case, the database ("data layer") should have no ability or knowledge of what the scratch data contains. Putting it as a CLR UDT would give the database such knowledge and throw out the clean tiered design.



    @Charles

    Don't be sold by the authors' double-speak! If something requires two attributes to be described, then it is, by definition, not scalar.



    Money has always been two-parts and not scalar: Currency_Code (CHAR(3) per ISO-4217) and Amount (DECIMAL(9,4), or whatever). That's how the real world models and and how the real world describes it.



    We could apply that logic to anything that "we" consider to be scalar. An address, for example, is always stamped as a block, so why not store it in a TEXT field? Or as a CLR UDT?



    @Richard

    Ah yes, "interesting" things. That usually translates to completely unmaintainable and "cool while it lasted" systems. That's where systems modeled on "Object-Relational" models end up going.

  • Another reason to take a pass on CLR UDTs is that you have now tied your database to SQL Server 2005. If you are developing code that can run on multiple versions of SQL Server or even different server vendors, you're screwed.

  • But money _is_ scalar.



    If A > B and B > C and C > D, then A > D, B > D, A > C, etc.



    The problem is that no one thought about creating a monetary data type that not only described the amount, but also the currency of the amount.



    The unit qualifies the amount much the same way that say the year qualifies the month. Which is greater, June or July? It depends on how you qualify the two, as June, 2005 is obviously greater than July 1999. Similarly, which is greater, 100.00 or 99.00? It depends on how you qualify the two, as 99.00 GBP is obviously greater than 100.00 USD. And this is a good example of where the CLR integration becomes very powerful as when you write your custom aggregation funtions, you can call out to a web service to build a lookup table of the currencies :)



    Very powerful (and useful) indeed. Otherwise, the alternative is not nearly as clean. To answer a question like "What is the total value of all of our global positions in USD?" (assuming that you store the original currency value along with the currency) you would either need to break it out into another layer (read all of the data as is and perform conversions in some business code) or add another layer to import the conversion data into SQL Server (and do the math in SQL Server).



    Using a CLR UDT, you can resolve all of this in SQL Server, which is beauty of it all.

  • @Charles

    Time is one-dimensional. "June" is not a point in time, nor a unit of measure. It's the name of a month and cannot be evaluated with "greater than." Points in time can be easily evaluated.



    Money is two dimensional and cannot be evaluated with "greater than." For example, you can not say whether "20.00CDN > 15.00USD" is true or false. Five years ago that was probably true, but today it isn't.



    The web-service solution you describe to aggregate global holdings would be a complete nightmare. It can be solved with much simpler means:

    ACCOUNT_BALANCES { ACCOUNT_NUM, BALANCE_DTE, ACCOUNT_CUR, BALANCE_AMT }

    CURRENCY_CONVERSION { CONV_DTE, FROM_CUR, TO_CUR, EXCH_RTE }



    A few simple join and you've got global holdings given in any currency at any time. No web services. No CLR. Simplicity.

  • @Charles

    I take it you've never done anything like this before? You never "pull" into your prod db or DW -- having the db interface with vendors would be just like the diagram I have above.



    As far as exchange rate tables, vendors for that are dime-a-dozen. Just add a simple job to your batch processing servers to pull from the vendor and push into the database.



    I understand the analogy, but it's incorrect comparison. Months, Month Names, and Points in Time are totally different concepts. Without getting into semantic nonesense, "June" is the name of a month, not a point in time. A point in time scalar in that it is N units past epoch.



    Money and Monetary Value are two different concepts, the former being described as Imentioned before, and the second being a non-quantifiable perception. If we could quantify monetary value, we wouldn't really be having these problems i suppose ...



    Currency types also works significantly differently than units of measure. A cup will always be 12 tablespoons, but 99GBP may not always be more than 100USD.

  • @DavidM

    There seems to be some disagreement in the field as far as what 1NF is and what 1NF is not. Celko, for example, in DATA & DATABASES book, contends that scalar values are required for 1NF.



    However, in that same INTRODUCTION TO RELATIONAL DATABASES (Chap 25), it's pretty clear that CLR UDT falls painfully short of the extended-relational model that Date calls for.



    Consider that the CLR UDT do nothing but serialize their data into a BLOB column. On top of that, with the use of properties on the CLR types, we have the potential for non-deterministic "mystery" values exposed to the user. Surely this is *not* part of the non-scalar types either.



    Informix's DataBlades are a prime example of a good implementation of an extended-relational model. It is actually an extention at the engine level, as opposed to CLR UDT which exist on top of the relational level.

  • Alex,



    Cheers for the response!



    The disagreement seems to boil down to: One Value vs One Scalar Value.



    Personally, I like the simple definition and feel the term "scalar" be banished!



    It shouldn't matter how the value is being physically stored. People get REALLY confused between the logical and physcial.

    I once read someone argue that indexed views violated normal forms because the view stored the data.



    I haven't had a chance to use 2005, but regardless of the issues/limitations CLR UDTs have, it is a step in the right direction AFAIAK.

  • Alex,

    how is something like a complex number different from a string? A string can be viewed as a sequence of characters (and often is: witness CHAR(30)), which makes it not scalar by your definition.

    What makes CHAR(30) ok and INT(2) not ok?

  • Alex,



    Your confused. A point is a datum, an array is a collection of data. Are you next going to argue that rational numbers should be stored in two columns because they can be represented as a numerator and a denominator, and not a numi-denominator? And, as an earlier post mentionied, what about strings? Getting a table into first normal form isn't about breaking each and every datum up into it's components and then putting that into separate columns, but rather it is about breaking the data apart into single pieces.



    A point is a single entity, not two: the two components are very tightly bound, in fact it doesn't even make sence when working with a point, to only look at one component, you have to look at both components when doing transformations with that point. This same result is also true of time, complex numbers (Oh, and by the way, nobody is going to need complex numbers? Complex numbers are only used in solving differential equations? You're now an expert on what everybody needs in a database? So, you're just going to ignore all of those scientific applications that need a reliable data store?), and vectors (math/physics vectors). Vectors are not scalar in the math sence, but they are in the computer science sence, because the components of a vector are so tighly coupled. (Again, you can't talk about a vector without working with both components.) This tight coupling of components is (in general) not true of arrays and lists. An array is a collection of values, but each value is distinct (unlike in points, math/physics vectors, etc). This is exactly why an array of chars to represent a string is valid in a column, but an array of chars to represent grades would not be. Each component of the array of chars in the string doesn't mean anything buy itself (just like vectors, points, rational numbers, etc.), however, in the array of chars representing grades, each component does have meaning by itself.



    Being a scalar value (in computer science terms) has nothing to do with being an ordered set (being a scalar value has nothing to do with being an ordered set in mathematics either, but I digress). Being a scalar value has eveything to do with whether the components of that value (the numerator and denominator of a rational number, the two components of a vector) have any meaning (as a piece of data) by themselves.

  • I have one more thing to say. Concerning my rant above, my point isn't that you should always put tightly coupled pieces of data in the same column, but rather, that it is often difficult to determine whether something should go into seperate columns or not. I am not a database expert, but I do know inconsistancy when I see it, and it would be inconsistant to insist that (mathematical/physics) vectors should be broken into separate components unless you were also insisting that rational numbers, strings, and other (mistaken for merely aggregate) datatypes should also be broken up into components and put into separate columns. We're thinking about too much detail. We need to think about our data in more abstract terms. We need to really understand what Codd was saying when he proposed the normal forms, and also understand the mathematical reasons for using the normal forms.

  • I entirely agree that only scalar values should be allowed in database columns. The trouble is that one person's "atom" is another's "bucket o' quarks with electrons in shells around it".

    After all, if you look at them the right way, strings aren't really scalars -- they are arrays of characters. Even integers can be seen as arrays of digits. But obviously 1NF doesn't require decomposing these.

    1) By UCS-2 you hopefully mean UTF-16, as UCS-2 is not capable of representing all Unicode values. Unfortunately, UTF-8 and UTF-16 don't sort the same way on a simple binary sort order. UTF-8 sorts according to the order of Unicode codepoints, but UTF-16 does not.

    2) If talking about the past, it's fine to capture moments in time using a single time zone such as UTC and push off converting to local time to the interface layer. For future moments, that won't work: you need a local time and a place. Nobody knows what UTC instant corresponds to "midnight November 1, 2010 in New York", because Congress could change the time-zone rules or (more likely) the daylight-savings rules. It's happened before.

    3) I correct you: you are wrong about complex numbers. You need 'em for anything concerned with electrical engineering, a perfectly fit subject for relational databases. And to their users, complex numbers are no more non-atomic than decimal fractions are.

  • 1. You cannot sort any string using binary, and especially not unicode strings.

    2. I agree, that is a hard problem.

    3. I have to agree with you here.

  • Dude, if you think the only use of imaginary numbers is in solving differential equations, I think it is time to go back to uni and study some more maths.

  • SQL Server CLR User Defined Types is Microsoft's attempt at supporting the new Object Relational Extensions of SQL3. However, I personally don't like the way M$ decided to implement this. Oracle has a much easier way of defining user defined types:
    create type mytype as (
    somenumber number,
    somechar varchar2(50),
    );

    I don't have an opinion as to whether creating/using UDT's is a good idea tho...

  • >A geospatial location is described with Latitude and Longitude. Not Lati-longi-tude. These are two separate attributes and putting them in the same column violates First Normal Form

    I think you are completely wrong. They are atomic in that both are required in the attribute. A latitude on its own is meaningless wrt a point on the globe, just as an hour number is meaningless wrt a point in time.

    There are many attributes that contain what look like multiple values but are atomic values. For example, the quads in an IP address. For example an IP address without a subnet mask. For example the digits in a phone number.

    Arguing that these should be split into component properties is like arguing that date time information should be stored like this:
    DOBYear int,
    DOBMonth int,
    DOBDay int,
    DOB Hour int,
    DOBMinute int,

    A user-defined type, for example DateTime, is the best way to store this data.

  • The use of SQL Server CLR UDTs would be warranted in certain limited situations. I can see it being useful in a space-limited environment where one needs to share data whose representation is smaller in UDT form than separated into traditional columnar format. It would also be useful when rapidly developing a scalable system that used SQL Server as its mode of transfer between machines rather than developing a hand-coded system of server message queueing. It's also useful in many queueing systems that could be scaled across servers using SQL Server's logic automatically. Use of UDTs would be prime in any case that SQL Server native capabilities can be borrowed upon, especially since developers would save time from wrapping them in database column / tableobject data transfer systems. CLR UDTs are also useful where the point of the data IS the object rather than further breakdowns of the object's components. Yes, yes, you can argue scalars and other theory all day long, but if the point of using SQL Server is to store the object itself, that's it - you cannot break down the item beyond its own component. The very word "atomic" when used to describe data that has no further subcomponent is kind of a misnomer, considering that the "atom" itself is comprised of subatomic particles (as the very elements we describe as atomic in DBMS can themselves be composites). Now let's examine "best" from another viewpoint: your boss would agree with me. Compare 2 months of extra development time to implement it in a way that conforms to a set of arguable theoretical standards versus an easily maintainable system that saves all of that time and money? I think we have a winner. Let me know when you've got that t-shirt ready!

  • In my oppinion, reading your comment on NEVER it is very said to see a culture still exists where people are very much against change.

    Currently,..and I stress, CURRENTLY one is suppose to design databases only scalar, where each column only represents one piece of dead data.

    Yes, we all have learned the way of DB dev.
    But hang on,... how many times have we been sitting down thinking, ..I wonder if I could store a value in a column of a data type where the data type could actually do some validation itself and report back when inserting into that table that the value is incorrect,..instead of recieving a SQL error,..mostly unreadable.

    Now, with CLR UDT this is possible.
    Also one would not introduce business logic into the wrong place as one would add data validation routines with readable responses.

    Controling the data coming into the database is by far simpler.

    I come from the days of good old multi-tier development were DCOM objects would be placed where their belong. Business logic on the application server, data validation on the database server and minor relevant validation in a component on the client pc.

    Well times are changing and if want to part of the future we have to adabt new ways of improving.

    One can not say the earth is flat cause some else said so. Well I think for yourself, oldtimer, its about time you get out of your comfort zone and let the future improve how we design applications.

    Lets strive torwards making things faster, bigger, better and use less resources while we are at it.

    By all means I'm not saying CLR is a wonder,..not at all,..in fact I think that so far it has little need.

    My point is that people like you, that out-right have issues with something new without even having considered the possiblities are very sad.

    Get a life my friend.

  • @Ben:

    You are dead wrong about a Point being a datum. In fact, your explanation describes the exact reason it's a bad idea to store two values in one record.

    Imagine you want to know what records live in a box defined by (0,0) and (10,10). Try writing that query when all your point data is concatenated into "x,y" pairs. In a bounding box check, you will absolutely want to know about the X coordinate of your point independantly of the Y.

Comments have been disabled for this content.