The Shift: how Orchard painlessly shifted to document storage, and how it’ll affect you

Monday, November 4, 2013

ASP.NET CMS Orchard

We’ve known it all along. The storage for Orchard content items would be much more efficient using a document database than a relational one. Orchard content items are composed of parts that serialize naturally into infoset kinds of documents. Storing them as relational data like we’ve done so far was unnatural and requires the data for a single item to span multiple tables, related through 1-1 relationships. This means lots of joins in queries, and a great potential for Select N+1 problems.

Document databases, unfortunately, are still a tough sell in many places that prefer the more familiar relational model. Being able to x-copy Orchard to hosters has also been a basic constraint in the design of Orchard. Combine those with the necessity at the time to run in medium trust, and with license compatibility issues, and you’ll find yourself with very few reasonable choices. So we went, a little reluctantly, for relational SQL stores, with the dream of one day transitioning to document storage.

We have played for a while with the idea of building our own document storage on top of SQL databases, and Sébastien implemented something more than decent along those lines, but we had a better way all along that we didn’t notice until recently… In Orchard, there are fields, which are named properties that you can add dynamically to a content part. Because they are so dynamic, we have been storing them as XML into a column on the main content item table. This infoset storage and its associated API are fairly generic, but were only used for fields. The breakthrough was when Sébastien realized how this existing storage could give us the advantages of document storage with minimal changes, while continuing to use relational databases as the substrate.

public bool CommercialPrices {
    get { return this.Retrieve(p => p.CommercialPrices); }
    set { this.Store(p => p.CommercialPrices, value); }
}

This code is very compact and efficient because the API can infer from the expression what the type and name of the property are. It is then able to do the proper conversions for you. For this code to work in a content part, there is no need for a record at all. This is particularly nice for site settings: one query on one table and you get everything you need.

This shows how the existing infoset solves the data storage problem, but you still need to query. Well, for those properties that need to be filtered and sorted on, you can still use the current record-based relational system. This of course continues to work. We do however provide APIs that make it trivial to store into both record properties and the infoset storage in one operation:

public double Price {
    get { return Retrieve(r => r.Price); }
    set { Store(r => r.Price, value); }
}

This code looks strikingly similar to the non-record case above. The difference is that it will manage both the infoset and the record-based storages. The call to the Store method will send the data in both places, keeping them in sync.

The call to the Retrieve method does something even cooler: if the property you’re looking for exists in the infoset, it will return it, but if it doesn’t, it will automatically look into the record for it. And if that wasn’t cool enough, it will take that value from the record and store it into the infoset for the next time it’s required. This means that your data will start automagically migrating to infoset storage just by virtue of using the code above instead of the usual:

public double Price {
    get { return Record.Price; }
    set { Record.Price = value; }
}

As your users browse the site, it will get faster and faster as Select N+1 issues will optimize themselves away. If you preferred, you could still have explicit migration code, but it really shouldn’t be necessary most of the time. If you do already have code using QueryHints to mitigate Select N+1 issues, you might want to reconsider those, as with the new system, you’ll want to avoid joins that you don’t need for filtering or sorting, further optimizing your queries.

There are some rare cases where the storage of the property must be handled differently. Check out this string[] property on SearchSettingsPart for example:

public string[] SearchedFields {
    get { return
        (Retrieve<string>("SearchedFields") ?? "")
            .Split(new[] {',', ' '},
        StringSplitOptions.RemoveEmptyEntries);  }
    set { Store("SearchedFields", String.Join(", ", value));  }
}

The array of strings is transformed by the property accessors into and from a comma-separated list stored in a string. The Retrieve and Store overloads used in this case are lower-level versions that explicitly specify the type and name of the attribute to retrieve or store.

You may be wondering what this means for code or operations that look directly at the database tables instead of going through the new infoset APIs. Even if there is a record, the infoset version of the property will win if it exists, so it is necessary to keep the infoset up-to-date. It’s not very complicated, but definitely something to keep in mind. Here is what a product record looks like in Nwazet.Commerce for example:

And here is the same data in the infoset:

The infoset is stored in Orchard_Framework_ContentItemRecord or Orchard_Framework_ContentItemVersionRecord, depending on whether the content type is versionable or not. A good way to find what you’re looking for is to inspect the record table first, as it’s usually easier to read, and then get the item record of the same id.

Here is the detailed XML document for this product:

<Data>
  <ProductPart Inventory="40" Price="18" Sku="pi-camera-box"
    OutOfStockMessage="" AllowBackOrder="false"
    Weight="0.2" Size="" ShippingCost="null" IsDigital="false" />
  <ProductAttributesPart Attributes="" />
  <AutoroutePart DisplayAlias="camera-box" />
  <TitlePart Title="Nwazet Pi Camera Box" />
  <BodyPart Text="[...]" />
  <CommonPart CreatedUtc="2013-09-10T00:39:00Z"
    PublishedUtc="2013-09-14T01:07:47Z" />
</Data>

The data is neatly organized under each part. It is easy to see how that document is all you need to know about that content item, all in one table. If you want to modify that data directly in the database, you should be careful to do it in both the record table and the infoset in the content item record.

In this configuration, the record is now nothing more than an index, and will only be used for sorting and filtering.

Of course, it’s perfectly fine to mix record-backed properties and record-less properties on the same part. It really depends what you think must be sorted and filtered on. In turn, this potentially simplifies migrations considerably.

So here it is, the great shift of Orchard to document storage, something that Orchard has been designed for all along, and that we were able to implement with a satisfying and surprising economy of resources. Expect this code to make its way into the 1.8 version of Orchard when that’s available.

11 Comments

And what document database engine are you settling for?

Pierre-Alain Vigeant - Tuesday, November 5, 2013 4:11:18 PM

Pierre-Alain: I suppose the post is not making that clear enough, but we're using the same database engines as before. We store documents in there instead of only relational data.

Bertrand Le Roy - Tuesday, November 5, 2013 4:38:15 PM

Sounds interesting. Is this a step along the road to converting Orchard to use a document store?

Dave Gardner - Wednesday, November 6, 2013 12:56:55 AM

Oh that's awesome. So, so awesome. (I think Pierre didn't bother reading the article at all...)

I'm in the process of writing a heap of modules for a project - if I don't need any of the filtering/sorting "features" of using Records, could I update my parts and remove all the Table creation code from Migrations?

Rhys Lloyd - Wednesday, November 6, 2013 3:17:30 AM

@Dave: I recommend you read the post ;)

@Rhys: yes, absolutely, you can remove table creation code, if you don't need the records for querying. If you look at the branch changes, you'll see that's exactly what's been done on all settings parts in core.

Bertrand Le Roy - Wednesday, November 6, 2013 6:29:32 PM

In other words, Orchard just became SharePoint ;P Seriously, I 'get' why this can be a good option, but fear the issues it could bring. Due in part to this sort of DB design, SP has some pretty wonky limits on list/library view sizes (search for SharePoint Boundaries and Limits), and requires developers who go over those limits to jump through some fairly unreasonable hoops. Hopefully the hybrid solution here will be the best of both worlds.

Daniel - Wednesday, November 6, 2013 7:45:03 PM

@Daniel: it changes nothing about the nature or size of what can be stored. Also, it's an option, and the "old" way of doing things will continue to work. So yes, best of both worlds.

Bertrand Le Roy - Wednesday, November 6, 2013 9:36:57 PM

I'd been building a document based solution with Sql Server since 2005, when they released XML data type support with XQuery. It's all started when one of solution for a local hospital reached 100 or so tables they all should be a simple document. Since then we never look back.

I would very much like to share our framework and our experience with the Orchard team, you guys had been doing fantastic on content management.

Briefly this how we store an item in the database

http://i.imgur.com/DgxjOj2.png as XML data type

and this is how we query them, all using custom LINQ provider

http://i.imgur.com/RTiN3mP.png

and with Where clause

http://i.imgur.com/0FM8b7x.png

then this we add/update or delete

http://i.imgur.com/nWSR5rO.png

it's basically built usin Domain Driven Design approach, and all the logic and events are wired with AMQP(we use RabbitMQ/Azure Service Bus) to achieve a simple code.

We use a lot of reflection and dynamic binding in some situation to achieve things like automatic mapping for the columns, at first we were all worried, that it would performed badly, but it wasn't the case, at least not yet. Basically it's all downed to few questions.

Do we need a relational model? not in most cases, while our domain are not relational, why should our database

Can we go NoSQL route, mongodb,raven etc? If we are building something for enterprise, then it's hard to tell them our solution run on this WHAT.. so using SQL Server/Oracle/DB2 is still the only choice.

erymuzuan - Thursday, November 7, 2013 12:33:29 PM

@Bertrand: yes the post is clear as far as it goes, but what I was hoping to elicit from you is how you see the document db options after The Shift (ie after 1.8). Is this the start of the shift or the end of it? Is ravendb back on the agenda? Are other storage options on the horizon?

Can you point to discussions in the forums that I should follow?

Dave Gardner - Friday, November 8, 2013 11:34:11 PM

@Dave: this is it, arguably. I think it cancels the need for a full document DB transition. That could only happen in a fork at this point I'm afraid. Raven has never been an option because of licensing issues. Discussions were mostly at the weekly meetings.

Bertrand Le Roy - Friday, November 8, 2013 11:59:14 PM

Wow! It's good to hear you guys knew all along a document store was ideal for CMS. I always wondered why a relational store was the default choice and figured it was Microsoft pushing SQL Server.

John - Sunday, November 10, 2013 6:14:26 PM

Comments have been disabled for this content.