Relationships

Friday, May 9, 2003

No, this is not a piece of text about a broken/just started relationship between two people :). 'Relationships' should be read in a geeky context: relationships between attributes and entities in databases / object hierarchies, you know, the kind of relationships a computer-savvy person feels comfy with.

On April 18, Edgar (Ted) Codd died. This man is probably not the man you remember as the 'hero' you adored and who convinced you computers were your future, but it is the man who invented the concept of 'relationships' in our world, the software developers world, which resulted in the concept of Relational Databases. Now, as we say in The Netherlands, "Over de doden niets dan goeds", which means something like "Say only good things about the people who died", I shouldn't be saying this but in the last couple of days I'm seriously starting to doubt the quality of the form Codd defined the concept of a relational database as we know today. I think it is seriously flawed, and when I say seriously, I mean: really seriously, as much that we should start using other models today, instead of keeping this flawed model around any longer.

Entities and Attributes
In the database world, we talk about 'Entities', like 'Customer', 'Order' etc. These entities have 'Attributes', like CustomerName, and some of these attributes are used to identify an entity, like CustomerID. Entities are thus formed by groups of attributes and use one or more of these attributes to make a group of values stored in these attributes unique. So far no new stuff. The first question which arises in the minds of database-newbies is: why are these attributes in the group of entity 'Customer' and these other attributes, the ones in entity 'Order', are not? The reason for this is that we see instances of entities, rows in the table formed by the group of attributes of a given entity, as identifiable units, and including the attributes of other entities would break the abstract definition of the entity the attributes were added to (f.e.: we could add the attributes of 'Order' to the entity of 'Customer' but that would enlarge the 'Customer' entity with attributes which are not as closely related to each other as the current attributes of 'Customer'). To achieve such a model, we often use modelling techniques such as NIAM / ORM make it easy to define on an abstract level the model we want to achieve: the E/R (Entity / Relationship) Model. Such a model is in most cases a 1:1 representation of a table-structure and can be transformed easily to these table-structures we all work with. Entities become tables, the attributes of an entity form the fields (or columns) of the entity's table and the identifying attribute(s) become the primary key. The E/R model also embeds several other elements like constraints, the little fellows which are, among others, the biggest reason why the relationship model is flawed as it is today.

Constraints are used to define relationships (foreign key constraints) or to define uniqueness of an attribute in an entity (unique constraint) or to define custom logic (check constraints). All of these constraints define relationships: the foreign key constraint defines a relationship between 1 or more attributes of one entity and 1 or more attributes of another entity. The unique constraint defines a 1:1 (one-to-one) relationship between the attribute(s) with the unique constraint and the attribute(s) which are used to identify an entity (the primary key). Check constraints define semantic relationships between attributes of one or more entities based on logic defined in the check constraint (f.e. an attribute can only hold positive numbers (f.e. an account) if another attribute of perhaps another identity is set to the value of 'true'), which means these two attributes are related. As if that wasn't enough, there are more types or relationships: the 1:n (one-to-many) relationship defined between the attributes in an entity and the attribute(s) used to identify the entity, or the semantically m:n relationship which spans 3 tables, f.e. Employee, EmployeeTasks and Tasks (Employee can have more than 1 task, task can be assigned to more than 1 employee).

The flawed relationship system
The problem with all these relationships is that all of them are defined differently, in sometimes horrible ways. To illustrate this, ask yourself this: the entities 'Customer' and 'Order' have both an attribute 'CustomerID': in Customer this is the identifying attribute, in Order this is an attribute which holds a foreign key constraint which points to the CustomerID attribute of Customer. Order thus holds a piece of information borrowed from the entity Customer. Are Customer and Order related as entities directly? No! The attributes CustomerID in both tables are related. Because these attributes hold a relationship with the other attributes in their group, we can identify entity instances based on the values of these attributes.

"Well, that's great, but what's the problem?". The problem is: Order holds an attribute of Customer but Customer doesn't hold an attribute of Order. The relationship between Customer and Order is defined as a relationship between Order and Customer plus the concept that relationships are bi-directional. If I take the entity 'Customer' I do not know if there is a relationship with 'Order', but there is, it's however defined in another entity. If I do this in the OO world, this would be not that clever: if I hold an object 'Customer' and I hold a list of objects of type 'Order', and to find the objects related to my Customer object, I have to traverse the list of Order objects each time. Goodbye performance!

Another great example of how flawed the model really is, is the following: using the knowledge that a non-identifying attribute which is the foreign key field in a foreign key constraint forms a 1:n relationship with the identifying attribute which forms the primary key field in a foreign key constraint, how is it possible that when I add an unique constraint to the foreign key attribute, the relationship changes from 1:n to 1:1? So to identify a relationship I need 2 constraints at least, and if I'm really lucky 3 (a nice, good maintainable, easy to locate check constraint (that was sarcasm, folks)). WIthout the unique constraint, the foreign key constraint defines a relation between one entity and a set of entities, and with the unique constraint, the foreign key constraint defines a relation between one entity and another entity. That's a big difference.

To define a m:n (many-to-many) relationship, we have to jump through hoops to get that defined in the database: we have to introduce an intermediate table which holds the relationship. Because we learned that entities are stored as tables, which entity is this intermediate table representing? None, if you do not objectify the relationship (add a new, identifying attribute to the attributes forming the m:n relationship). So the relationship between 2 attributes (and thus between 2 entities) is not stored in a combination of constraints between 2 sets of attributes but between 3 sets of attributes, where the two foreign key constraints defined between the three tables are indirectly related to each other (via the relation between the attributes in the intermediate table). Also the relation is stored in an entity, although a vague type of entity, which is odd, all other relations are not stored in an entity of some sort.

Some relations between entities are not even definable in the relational model. Take for example a content management system (CMS). Let's say the CMS has containers located on pages, and each container can have an unlimited amount of items. Each Item is defined with an theoretically unlimited amount of elements, like strings, tables, images etc. To model this in an OO-way, one could opt for the solution to relate the Item class with an abstract Element class (perhaps even using an interface for that). The different element classes for the string element, image element etc, can be derived from the abstract Element class and can be related to the item they're in at runtime, using polymorphism. However how are you going to store this in a relational database, when you consider the entity 'Element' is abstract and can hold a lot of different data (image f.e. holds different data then a bulleted list)? The solution most developers will choose is the one where the Item entity is related to an Element entity which holds a type attribute which identifies the element's true entity: string element, image element etc, and thus which Element table to pick (string element table etc.). However, you can't define relationships between the attributes in a way other relationships are defined, i.e. via foreign key constraints or even check constraints. The relationships are not definable in the relational model and have to be defined in code, because the relationships are defined based on a value of an attribute (the type column) instead of the definition of the attribute itself. Looking at the relational model, you will not see this relation, it's purely a semantic relationship, based on a value in a column.

Still with me? Good. Let's get to the real heart of the problem.

The Problem
If you think about this mess, and you look at how you want to work with entities like 'Customer' and 'Order' in an OO world, like in the .NET world, it won't come as a surprise that these two types of working with the same thing (OO vs. Relational Model) are not interchangeable without serious effort, and thus time, sleepless nights, missed deadlines and angry PHB's. In a world which has worked with entities for a long time, why is the metadata which define the entities, their attributes and the relations between the entities, stored and structured in such a weird way that it is hard to work with? Isn't it more wise to work with entities on each level of the application (application, persistent storage (database)), as we think we should do? I think it is. That's why I think the relational model as it is defined today is seriously flawed and should be restyled in such a way that the 'solutions' to the limits which were present in the days Codd defined the model are removed and new ways of working with the same abstract concepts of 'entities' and 'attributes' are embedded. Relationships are members of the metadata of a complete model and should not be part of an entity itself (like the CustomerID attribute in the Order entity).

A solution?
Is there a solution now? Not really. There are some object-oriented databases, however none of them is very mainstream. Unless database vendors fundamentally change the way entities are stored instead of hiding the flawed model with a layer of sugar, we as developers have to bite the bullet and have to deal with the transformation work which has to be done to work with an Entity 'Customer' on the OO level as also on the relational database level. A lot of work is done today to investigate new ways of dealing with data altogether: O/R mappers to do the transformations for you, post-Relational-Model-databases, etc. However I doubt we'll soon see a mainstream database which handles entities and more important: the relationships between the entities, in a way the developer uses them and thus stores them (perhaps indirectly using a runtime like the CLR)...

That's a big post Frans! :-)

Eric J. Smith - Friday, May 9, 2003 7:32:00 AM

heh :) yeah, it turned out much larger than expected because I thought I had to explain some stuff before diving into what I wanted to say. :)

Frans Bouma - Friday, May 9, 2003 7:40:00 AM

There is a solution to your first woe: denormalize for performance.

What is the difference between maintaining an OrderCount property in a Customers object (or OrderExists, or whatever it is you want to record about the relationship), and providing a field in a Customers table with OrderCount?

If you don't like the performance hit of traversing Orders, then don't. Denormalize for performance.

Your complaint about using intermediate tables: "which entity is this intermediate table representing? None..." shows either a basic misunderstanding of abstraction, or a double-standard, as you don't seem to have a problem with the concept when OO-methods do it.

Yes, a relationship is an entity -- an abstract entity. (Note, "abstract" is used with its common English definition here, not the OO definition.) Not every table which defines a many-to-many relationship entity contains additional fields describing the relationship, and this does not make the relationship's abstract reality a "bad thing" in any way.

Hiding the reality of the relationship entity, or giving it less "value" as a real thing to be considered in design is a mistake.

Then, happily your CMS example discusses the real differences between object and relational data models. For a given situation, one will provide a more efficient representation than the other. And once you commit to one or the other, translating back and forth is a rough boundary to cross, best avoided if possible. These are the points that need to be blogged and taught until everyone knows 'em.

Since your point was not to write about the merits of the relational model, it is understandable that you did not include an example where the storage and retrieval inefficiences of the object model make it a worse choice than the relational model.

The real solution is to understand both approaches well enou

Eli Robillard - Saturday, May 10, 2003 9:36:00 AM

The problem I have is about the very different ways of how relations are physically stored in the database: via foreign key constraints, via combinations of constraints, via check-constraints, via instead of triggers etc. While you can store relations ALWAYS via a separate object which holds the complete information about the relationship, which entities are involved, which attributes participate in the relationship etc. You then eliminate the necessitiy of intermediate tables plus f.e. the necessity of needing f.e. combinations of constraints or the addition of attributes which serve no meaning. Intermediate tables which formulate the m:n relations (like non-obvious intermediate tables like 'Orders' in Northwind which defines the m:n relation between Employees and Customers) can be entities, but are necessary to formulate a relation between two OTHER entities. That's weird. You'd like to define that relationship with only the two entities involved, since these entities have the relationship.

Frans Bouma - Sunday, May 11, 2003 12:58:00 AM

4 Comments