Normalization

Pat Helland has been talking about immutable data for a while. He last post on 'Normalization is for sissies' is quite fun. A not-very-accurate post from Dare made me remember about it and pushed me to post this.

Pat is playing with two ideas.

One is that immutable data should not be normalized as normalization is designed to help you dealing with updates.

Another is that you actually don't need to delete/update the database. 'Deleting' a row means setting a InvalidationTimestamp = now(), and updating a row means setting InvalidationTimestamp = now() and inserting a new row with SinceTimestamp = now() and InvalidationTimestamp = null (you actually need two sets of dates, but that's for another post).

Now, if you put the two ideas together, all the data is immutable, so you don't need to normalize anything. This means you will have a record that have all the 'extended table': the 'base table' and all the fields from related tables in your normalized model. If you have Orders, Customers, Countries, your tables will look like

Order: OrderId, OrderDate, CustomerId, CustomerName, CountryId, CountryName, SinceTimeStamp, InvalidationTimeStamp

Customer: CustomerId, CustomerName, CountryId, CountryName, SinceTimeStamp, InvalidationTimeStamp

Country: CountryId, CountryName, SinceTimeStamp, InvalidationTimeStamp

You will be wasting a lot of disk space, but that's not something to worry about. The advantages of this approach are very important. You don't need to join, and you can cache/replicate most of your data.

The main physical issue I find today with this approach is that database engines have a limit in the number of columns they can store, and an approach like this one will require a large number of columns per table.

I wonder how a model like this will impact O/R mapping tools.

They can probably hide this kind of schema automatically by changing the semantics of delete/update, writing Order.Customer.Name should return the name in the Order row.

How would they handle object identity? Now if I have 'Customer #1' in memory, every reference to Customer #1 points to the same instance. This is because the object model is normalized. Now they should point to different read/only instances.

7 Comments

  • >A not-very-accurate post from Dare made me remember about it and pushed me to post this.

    I ready your post twice. I didn't see the inaccuracy mentioned. Maybe I'm not smart enough to read between the lines. :)

  • I'm sorry I actually commented on your post but for some reason the comment did not show up. This post was not answering your post ;).

    My comments are actually in the same line as John Prevost's, so it does not make sense to post them again. In your use case, you don't need joins, and the solutions to unnormalize it aren't good.

  • I updated my post after seeing his comments and others like it. Denormalization is usually a last resort when you have reached the limits of your database (indexes & views no longer help) and caches don't help because you have a long tail of queries (e.g. social networking sites).

    Tomnipotent has a great response to Peter Prevost in the comments on my post. You should read it.

  • Dare, I actually agree with him and with you in the sense that normalization is not a good idea in those kind of databases. I just think you chose a bad example to show it.

  • Isn't this known as a temporal database?

    WRT denormalization, I'd say if you denormalize your relational data that much, you may want to question why you are using a RDBMS at all. Dare updated his post to say, you should do things X, Y, and Z first. I'd go so far as to say, move to an OO DBMS first as well. Denomralizing is a hack. It is doing the wrong thing that might work. I'm not a fan of the wrong thing, EVER. The number of things that can go wrong increases greatly with denormalized data. Of course if your data is really immutable, than OK. But Dare's topic didn't talk immutable. It talked "myspace user profile data" or whatever.

  • This type schema which gives a historical perspective is very valuable, and yes, it can throw a wrench into some OR mapping tools. One solution to the OR mapping problem is to create a database "VIEW" of the table which hides the timestamps and only shows the value of the row at NOW (ie where InvalidationTimeStamp is null). However, you don't want just read-only access to this view, so the next step is to use the "insteadof" trigger capability of the DBMS to make the view updatable. You add an "insteadof update" trigger that runs when you attempt to update the view which instead updates the history table with the proper values.

    Now, you can use the OR mapping tool to map objects to the NOW view knowing that updates to that view will properly update the history table. Note, this approach only works for the 95% of usecases where the objects are only interested in the "latest value" from the table. For the 5% of usecases where you want to show history, you still need to query the real table. Also, when you add "effective dating" (ie the bitemporality element which the poster reserved for another day) this gets more complicated as you need to specify a TX time of now and an effective date of now.

  • In the WikiPedia entry about Temporal Databases http://en.wikipedia.org/wiki/Temporal_database there are some links to resources about this.

    I read two of the books that are cited (Date's and Snodgrass'). Date's has a lot of theory and it's difficult to read. The other one is more practical. The PDF to this one is linked from the Wikipedia article.

    Regards

Comments have been disabled for this content.