An Introduction to Document Databases

Friday, August 13, 2010

When most people say database, they mean relational database. Edgar Codd defined and coined the term at IBM's Almaden Research Center about 40 years ago. Since that time, relational databases have become the foundation of nearly every enterprise system. However, Internet-scale systems have begun to push the limits of this venerable technology. What has sprung up to fill the need? Various next generation databases addressing some of the following points: being non-relational, distributed, and horizontal scalable. These attributes are characteristics of the "NO SQL" movement. In this case, NO stands for "Not Only". So how many NO SQL databases are there? More than I care to count. But most of the fall into the following categories: Document, Graph, Key/Value, and Tabular/Wide Column.

Document Databases are especially interesting. So what makes them different from the relational model?

A document-oriented database is, unsurprisingly, made up of a series of self-contained documents. This means that all of the data for the document in question is stored in the document itself — not in a related table as it would be in a relational database. In fact, there are no tables, rows, columns or relationships in a document-oriented database at all. This means that they are schema-free; no strict schema needs to be defined in advance of actually using the database. If a document needs to add a new field, it can simply include that field, without adversely affecting other documents in the database. This also documents do not have to store empty data values for fields they do not have a value for. [ from Exploring CouchDB ]

They have some special characteristics that make them kick some serious SQL.

Objects can be stored as documents: The relational database impedance mismatch is gone. Just serialize the object model to a document and go.
Documents can be complex: Entire object models can be read & written at once. No need to perform a series of insert statements or create complex stored procs.
Documents are independent: Improves performance and decreases concurrency side effects
Open Formats: Documents are described using JSON or XML or derivatives. Clean & self-describing.
Schema free: Strict schemas are great, until they change. Schema free gives flexibility for evolving system without forcing the existing data to be restructured.
Built-in Versioning: Most document databases support versioning of documents with the flip of a switch.

A few of the top document databases are CouchDB, RavenDB, and MongoDB.

CouchDB is an Apache project created by Damien Katz (built using Erlang) and just reached a 1.0 status. Damien has a background working on Lotus Notes & MySql.
RavenDB is built on using C# and has some interesting extension capabilities using .NET classes. RavenDB was created by Ayende Rahien (the creator of Rhino Mocks & much more).
MongoDB is written in C++ and provides some unique querying capabilities. MongoDB was originally developed by 10gen.

So, where is the best place to use a document database?

The schema-less nature makes it ideal to store dynamic data, such as CMS and CRM entities, which the end user can usually customize as necessary or semi structure data (provided by human).
Web Related Data, such as user sessions, shopping cart, etc. - Due to its document based nature means that you can retrieve and store all the data required to process a request in a single remote call.
Dynamic Entities, such as user-customizable entities, entities with a large number of optional fields, etc. - The schema free nature means that you don't have to fight a relational model to implement it.
Persisted View Models - Instead of recreating the view model from scratch on every request, you can store it in its final form in a document database. That leads to reduced computation, reduced number of remote calls and improved overall performance.
Large Data Sets - The underlying storage mechanism for Raven is known to scale in excess of 1 terabyte (on a single machine) and the non relational nature of the database makes it trivial to shard the database across multiple machines, something that Raven can do natively.

[ from About RavenDB ]

I'd be interested to hear your experiences with document databases. I'll go more into a RavenDB in a future post.

UPDATE [1]: Included MongoDB

UPDATE [2]: You may also want to take a look at my book - RavenDB High Performance

2 Comments