Database theory, your friend for success.

In the past weeks I've read several articles / blog-posts and other digitally expressed thoughts about relational databases, query systems and how they all suck compared to K/V stores, CQRS, OODBs or whatever Hype of the Day-term. While most of them were simply re-labeling 20+ year old common knowledge, others were pretty stupid and downright sending the (novice) reader the wrong message. With 'wrong' I mean: the conclusions are based on false 'facts', assumptions and hand-waving n==1 pseudo-science.

Instead of writing a long essay here, I'll quote from and link to several Wikipedia articles and other articles which can help you learn about what relational models, databases are all about, what the theory is they're based on, why they work and what tools (as in: methodologies) are at your disposal. It's not meant to sell you the picture of 'OODB==bad, RDBMS==good', as that would be silly and as short-sighted as the articles I mentioned above. Instead you should see this small subset of knowledge about relational models and databases as a starting point for yourself when you are researching what to use and how to face a problem domain. After all, you can only make an informed decision if you know what you're talking about.

Relational model and theory

Relational databases are based on the relational model as described by E.F. Codd. Its core is based on predicate logic, and operations on the model are based on relational algebra which is a form of predicate logic.

Links:

Why is this so important? It's important because it will teach you what the idea is behind grouping attributes together to entities and use them to define the meaning of data and above all: creating new entities from them by using projections.

Normalization and De-normalization

Relational models are in a given normal form. This normal form is a way to describe how much redundant information is stored in the data set defined by the relational model. To transform a model from a lower normal form to a higher normal form is called normalization. The other way around is called de-normalization. Normalization is considered a good thing because it solves various problems which arise with redundant information in a de-normalized model, if you perform data manipulation operations.

Links:

Important quote, which should describe for you why normalization is a Good Thing:

Normalized tables are suitable for general-purpose querying. This means any queries against these tables, including future queries whose details cannot be anticipated, are supported. In contrast, tables that are not normalized lend themselves to some types of queries, but not others

This is precisely the point of using a relational model: you store data in a model which gives it meaning in a general form, so you can create new information from it by using relational algebra, in such a way that it doesn't matter what query you might need in the future, it's already suitable for dealing with that.

Normalization has a downside: for reading data it could lead to performance intensive operations. For performance reasons, relational models sometimes get a de-normalized variant. This variant is then used for situations where a lot of queries have to be ran on many different tables with many joins to obtain the result requested. To use a copy of the data for reads, in an optimized de-normalized form, by projecting the data (using normal relational algebra) from the normalized model to the de-normalized model, read queries can be sped up drastically. As the de-normalized form is a copy of the original data, and isn't used for data manipulation, it doesn't run into the downsides of a de-normalized model.

In modern relational databases, these de-normalized variants are implemented by materialized views. Materialized views (in SQL Server they're called indexed views) are queries which results are stored on disk like a normal table. One could add indexes to these 'tables' to make them even more suitable for optimal performance in read logic. Materialized views can be partitioned across multiple systems to be able to handle massive sets of data. In today's hyped up term CQRS one can clearly recognize these two variants of the same model: for data manipulation work, the normal relational model is used, and for reads, the de-normalized variant with materialized views.

Links:

Information Analysis

No matter what people say, always remember that performing analysis of your problem domain, which functionality is required, what the abstract entity definitions are that you can recognize in your problem domain etc. is a good thing. If you don't know what you're dealing with, you can't create successful software to solve the problem at hand. Period.

The big problem with analysis is: if you use a flaky analysis method, you could end up with a skewed picture of your problem domain and therefore your software will likely suck, however how do you know your analysis method is flaky?

In short, analysis comes down to: gather as much information so you can make informed, well reasoned decisions during the software developing process. So you need to convince yourself, based on facts, that your analysis was complete and you have gathered all the information to make these informed decisions. This is a complex process and you need a methodology which allows you to do proper analysis of your problem domain without leaving important areas untouched. One of the pioneers on this subject is E. Yourdon. Today, some people consider his work very 'dated' but I firmly disagree: analysis of problem domains hasn't suddenly changed, the tools Yourdon provides are still very much alive and useful today as they were 20+ years ago. If you look closer and understand what Yourdon meant with all the elements he described, you'll see that most of them have got new names in recent years, and are actually not that dated at all.

As the field of analysis is very wide and deep, it's impossible to describe all possible forms and methodologies. I'll give you a list of links below which is very incomplete, but should get you started.

Links:

Friction and the Impedance Mismatch

One important aspect of a relational model is that it doesn't define behavior. Behavior is defined outside the relational model, through relational algebra using operations. This is a mismatch with modern day Object Oriented software design where data and behavior are combined in an object. If you look at software as code, you'll run into a problem if you want to store live objects you have in memory into a relational database: relational databases work with tables and relational algebra, not with objects and their embedded behavior.

This has resulted in a different form of database: the OO database or OODB. OODBs are nothing new, they've been around for many years. They're ideal if you want to store data in the object it is in as it is in memory, so your live in-memory objects can be stored without any conversion and you can get them back later on without a conversion (there are conversions, but they're not 'in your face', so hidden behind the scenes). You effectively look at the data as if it's inside your object model, so you navigate from one object to the next, not from one table to the next.

If you go back to the relational model and the quoted important aspect of a normalized model, you'll see that an OODB has an important side-effect: it's not really suitable for queries which don't match your object model (or don't use objects at all). This gets particularly problematic when the software using the OODB is replaced with a different system: the original database is outlived by the software requiring its existence however it's not usable in a different form as it's setup and designed to meet the requirements of the replaced software.

Being agile means to be able to cope with this, to be able to deal with change. If you use a relational model which is designed to represent reality, the facts you recognize in the problem domain, the software consuming it doesn't dictate how the model is defined, the reality does, as the relational model represents reality. That's an important aspect which makes it clear why relational models are so important, even in OO software: if the software is replaced with something else, the relational model can be re-used as well as all its data, simply because it's not tied to (or as I should say coupled with) the software consuming it.

To overcome this impedance mismatch between OO software and relational models, the concept of Object / Relational mapping has been introduced. This 'mapping' between two sides A and B is based on the idea that in A there is an element E which is directly related to an element F in B and vice versa. This theoretical connection between E and F, so the connection the mapping is based on, is the sole reason why it works in the first place. That's why you can save an object to a relational model and fetch it back (you're not saving the object, you're saving its data, the entity instance, but I've already written about that some time ago, see below). It works because both sides give the same meaning to the data. This is important because you can then work with a relational database in an OO fashion and you don't run the risk of being unable to deal with change.

As many databases outlive the software they're initially created for, it's important to realize this. All that matters is that if you look at your data, can you distill information from that data in any shape or form you might wish / need without the necessity of the special / original software the database was created for?

Links:

I hope this article has given you some insight in how to deal with databases, what they are, what the essential aspects are of databases, relational models, why they're important and how to successfully embed them into your own software. Now, go and build great software based on well informed decisions.

11 Comments

  • Well said, Frans. I downloaded db4o 18months ago but I still haven't got around to looking at it. The main reason is that I trust relational databases for two of the reasons you mentioned. The database is likely to be used in programs you didn't envisage originally. This happens to me over and over again.
    Secondly, the data will outlive the programming environment. I am currently upgrading a .Net 1.1 app I wrote years ago to .Net 3.O and it may yet go to version 4. But the data and the database just live on. That clinches it for me.

  • Oh, please... Who needs theory - just look what cool guys are using.

  • There's a lot of good points, but I sense some bias. One implication I picked up is that it is the database, not the data that's important ("As many databases outlive the software they're initially created for..."). What's truly important is the data. If you can get your data out of an old database and in to a new one, then the (old) database no longer matters either. All too often, I've seen software saddled by an archaic database when really what was important in it was the data, not the database.

    The second implication is that the database is longer lived when you make it represent reality ("...the relational model represents reality"). But you even state that the database is just the data, not the behavior. That says to me that a database is only part of the reality. Indeed, I've seen perfectly good software systems fail because someone thwarted the encapsulation and data protection of the software by using a general purpose RDBMS tool to manipulate the data to state that violates reality -- a reality that was definied by the software and database as a whole.

    I'm not at all saying that relational databases are a bad thing, but I think its important to highlight their power and their achilles heels, especially in how they're used by software applications.

  • Brian: Exactly the kind of statements one would expect from someone who is not aware of the mathematical guarantees of a relational database system.

    The unfortunate downside of all of these newfangled "better" ways of persisting data is the lack of certain guarantees. Here are a few:
    1. Elimination of data redundancy
    2. Elimination of data inconsistency by way of update anomalies
    3. Mathematical guarantees of lossless joins (see Chase Test for lossy joins)
    etc...
    These things can NEVER be guaranteed mathematically by some other persistence framework UNLESS they adhere to the theories behind relational databases.
    It is further discouraging that we are willing to throw nearly 40 years of provable theories because a relational database is "too hard." Poor Codd must be turning over in his grave over this apparent lack of rigidity.

  • @Brian:
    "Database" is what I call the data + the model definition (relational model). Not the system surrounding it. So for example: 'database' == .mdf file, not sqlserver.

    Data is important, but without context (the relational model which gives the bits meaning), it's just a bucket of bits. A table with 20 columns and 17 of them have the value "1", does that mean anything? It does if you have a relational model to give the data meaning. So both are important.

    Of course you can migrate data from one schema to the other, you just have to make sure the new schema gives the same meaning to the data. (when creating migrations through projections, the meaning of the new sets have to be in line with what the projections mean of course)

    About reality: you should see 'reality' as the facts you can distillate from the problem domain. So for example you can determine which attributes form the customer entity, which the order entity, how customer relates to order etc. etc. That's not behavior but it is a description of reality. The behavior is functionality consuming and manipulating the data stored in that model's tables.

    Behavior can of course change over time, though that doesn't imply the data will change. Only if the problem domain changes (e.g. you first are a chair factory and now you provide chair making services, which means the database changes), the relational model changes, and always by migration, so projecting current data onto a new schema.

  • @FransBouma, I'll accept your definition of a database as the data+model separate from a database system. And, yes, a data model does give more meaning to the data, but there are plenty of "facts" that won't be represented in a relational model (as far as I can fathom). By facts, I'm including everything in the "reality" that the system (database system and software system) are modelling.

    Consider a reality where things must progress in a certain way. This is often modelled in software as a state machine. A state machine might say you have to go from state A to C by way of state B. Using programmatic logic, you can enforce this. I've not yet seen a relational model that enforces that. Instead, someone could use reach into the database and manipulate the data ina way that's consistent with the relational data but still violate the reality the combined system was supposed to model. And the reality for me is: I have seen this happen FAR too often: people manipulating data in a database system directly using SQL violating the part of the reality the database couldn't model.

    Again, my purpose is not to attack relational databases, but to clarify the limits of a relational database.

  • @Brian: Fair enough.

  • This is cruel.

    While Udi, Ayende and Greg Young explicitly show how RDBMS alternatives may change application development experience you give a lot of computer science links.

    To read and comprehend all of them one needs to spend a lot of time. Actually DB theory behind first 8 links taught at the university for about a year.

  • @Brain
    You have meaning about data which makes data information and you have interpretation of that information to drive actions. I think these two are different things. Using your state machine metaphore: the state transition table is the data, what the states do and when events occur is behavior. The transition table has data, which is given meaning by the relational model it's in. So that model gives meaning to a column called 'State' and a column called 'Event' and a column called 'NextState'. This data then becomes information as it is then usable for a state machine. How that state machine works, how the states work, that's up to the code consuming the model's data. The statemachine implemntation can change, the model stays the same. you can even alter the data in the table to meet the requirement of new states/events, it will still work. That was my point.

    @FallenGameR:
    It's not cruel. Think about it this way: when you charge your client for 100 pop an hour to do software development for that client, the client expects a professional. If that person then isn't a professional but an amateur, s/he's really not being honest with the client. I.o.w.: be aware that educating yourself is part of becoming and STAYING a professional. Of course that takes time, and one has to be willing to pay that price, or forever stay behind.

    But it's not that much info really. You can grasp this in a couple of weeks and as it's fundamental knowledge, it's usable everywhere you go. :)

  • @FallenGamer: there's a difference between a piece of software which enables you to find a piece of data and a theoretical model which gives meaning to that data.

    You refer to an article of the former, which pretends to be the latter. That's what this article I posted is all about.

  • Spot on, Frans. On the other hand, people need to think outside the box, for fear of stagnation.

    For the novice, yes I absolutely agree; it simple sends them down the road to failure. Fundamentals and pragmatism will always trump the latest trend or 'hype of the day'.

Comments have been disabled for this content.