Frans Bouma article about why a cache in an O/R mapper doesn't make it fetch data faster.

 
Via Frans Bouma's blog -

Preface
One of the biggest myths in O/R mapper land is about 'caching'. It's often believed that using a cache inside an O/R mapper makes queries much faster and thus makes the O/R mapper more efficient. With that conclusion in hand, every O/R mapper which doesn't use a cache is therefore less efficient than the ones who do, right?

Well... not exactly. In this article I hope to explain that caching in O/R mappers is not there for making queries more efficient, but is there for uniquing. But more on that later on. I hope that at the end of the article, I have convinced the reader that the myth Caching == more efficiency is indeed a myth. Beware, it's perhaps a bit complicated here and there, I'll try to explain it in as much layman's terms as possible.

What's a cache?
Before I can explain what a cache is, it's important to understand what an entity is, what an entity instance is etc. Please consult this article first to learn about what's meant with these terms.

A cache is an object store which manages objects so you don't have to re-instantiate objects over and over again, you can just re-use the instance you need from the cache. A cache of an O/R mapper caches entity objects. Pretty simple actually. When an entity is fetched from the persistent storage (i.e. the database), the entity object (i.e. the entity class instance which contains the entity instance (== data)) which contains the data fetched, is stored in the cache, if it's not there already. What exactly does that mean: "if it's not there already" ? It means that the entity object isn't there yet.

Caches in O/R mappers are above all used for a concept which is called uniquing. Uniquing is about having a single entity object for every entity (== data) loaded. This means that if you load the entity of type Customer and with PK "CHOPS" from the Northwind database, it gets stored in an entity object, namely an instance of the Customer entity class. What happens if you load the same entity with PK "CHOPS" again in another instance of the Customer entity class? You would end up with two instances of the same class, but with the same data. So effectively the objects represent the same entity.

This doesn't have to be a problem. Most actions on entities don't require a unique entity object. After all, they're all mirrors of the real entities in the database and with a multi-appdomain application (like desktop applications accessing the same database or a multi-webserver using webapplication) you have the chance of having multiple entity objects containing the same entity data anyway.

However sometimes it can be a problem or an inconvenience. When that happens, it's good that there's a way to have unique objects per entity loaded. Most O/R mappers use a cache for this: when an entity is loaded from the database, the cache is consulted if there's already an entity object with the entity data of the same entity fetched. If that's the case, that instance is updated with the data read from the database, and that instance is returned as the object holding the data. If there's no object already containing the same entity, a new instance is created, the entity data fetched is stored in that instance, that instance is stored in the cache and the instance is returned. This leads to unique objects per entity.

Not all O/R mappers use a cache for uniquing however, or don't call it a 'cache'. You see, a central cache is really a very generalizing. What if you need for a given semantical context a unique entity, and outside that context you don't need a unique instance or a different, unique instance? Some O/R mappers, like LLBLGen Pro, use Context objects which provide uniquing for a semantical context, e.g. inside a wizard or an edit form. All entity objects inside that context are stored in unique objects.

Caches and queries: more overhead than efficiency
So, when does this efficiency the myth talks about occur exactly? Well, almost never. In fact, using a cache is often less efficient. I said almost, as there are situations where a cache can help, though these are minor or require a lot of consessions. However I'll discuss them as well so you have a complete picture.

Let's state I have a cache in my O/R mapper and I want to see, by using theory, how efficient it might be. So I have my application running for a given period of time, which means that the cache contains a number of entity objects. My application is a CRM application, so at a given time T the user wants to view all customers who have placed at least 5 orders in the last month. This leads thus to a query for all customer entities which have at least 5 related order entities which are placed in the last month.

What to do? What would be logical and correct? Obviously: fetching the data from the persistent storage, as the entities live there, and only then we'll get all known and valid customer entities matching the query. We can consult our cache first, but we'll never know if the entities in the cache are all the entities matching my query: what if there are many more in the database, matching the query? So we can't rely on the cache alone, we always have to consult with the persistent storage as well.

This thus causes a roundtrip and a query execution on the database. As roundtrips and query executions are a big bottleneck of the complete entity fetch pipeline, the efficiency the myth talks about is nowhere in sight. But it gets worse. With a cache, there's actually more overhead. This is caused by the uniquing feature of a cache. So every entity fetched from the database matching the query for the customers has to be checked with the cache: is there already an instance available? If so, update the field values and return that instance, if not, create a new instance (but that's to be done anyway) and store it in the cache.

So effectively, it has more overhead, as it has to consult the cache for each entity fetched, as well as store all new entities into the cache. Storing entities inside a cache is mostly done with hashvalues calculated from the PK values and stored per type. As hashvalues can result in duplicates (it's just an int32 in most cases) and compound PKs can complicate the calculation process, it's not that straight forward to get the lookup process of entities very efficient.

Some tricks can help... a bit and for a price
Before I get burned down to the ground, let me say that there are some tricks to speed things up a bit. I have to say "a bit" because it comes at a high price: you've to store a version field in every entity and the O/R mapper of choice must support that version field. This thus means that you've no freedom over how the entities look like or how your datamodel looks like. This is a high price to pay, but perhaps it's something you don't care about. The trick is that instead of returning the full resultset in the first query execution, only the PK values and the version values are returned for every entity matching the query. By checking the cache, you use the version value to see if an entity has been changed in the db since it was fetched and stored in the cache. If it's not changed, I don't have to fetch the full entity, as the data is already in the cache. If it is changed or not in the cache at all, I've to fetch the full entity. So then I will fetch all entities matching the PKs I've collected from my cache investigation. The advantage of this is that it might be that the second query is very quick. It however also can bomb: what if you're using oracle and you have 3000 customers matching your query? You then can't use an WHERE customerid IN (:a, :b, ...) query as you'll be exceeding the limit of parameters to send in a single query. It also will cause a second query run, which might add actually more time than simply doing a single fetch: first the PK-Version fetch query has to be run, then the second full fetch query (which might result in less rows, but still...).

You might wonder: what if I control all access to the database? Then I know when an entity is saved, and thus can keep track of when which entities are changed as well and thus can make assumptions based on that info whether an entity is updated or not! Well, that's true, but that's not scaling very well. Unlike Java, .NET doesn't have a proper cross-appdomain object awareness system. This means that if you have even two systems targeting the same database (webfarm, multiple desktops), you can't use this anymore. And even if you're in the situation where it could help (single appdomain targeting single database), it's still takes time to get things up to steam: until all entities of a given type are inside the cache, you still have to fetch from the database.

Cache and single entity fetches
There is one situation where a cache could help and be more efficient. That is: if you know or assume the data you might find in a cache is 'up to date' enough for you to be used. That situation is with single entity fetches using a PK value. Say the user of our previously mentioned CRM application wants to see the details of the customer with PK value "CHOPS". So all what should happen is an entity fetch by using the PK value "CHOPS". Consulting the cache, it appears that the entity with the PK value "CHOPS" is already loaded and available in the cache! Aren't we lucky today!

Again, what to do? What's logical in this situation? Pick the one from the cache, or consult the database and run the risk of fetching the exact same entity data as is already contained in the entity object in the cache? I'd say: go to the database. You only know for sure you're working with the right data by consulting the persistent storage. If you pick the one from the cache, you might run the risk that another user has updated the customer data and you're working with outdated, actually wrong data which could lead to wrong conclusions while you could have made the right conclusions if you would have looked at the right data. If you pick the one from the database, you might run the risk of burning a few cycles which turned out to be unnecessary. A trick here could be that if the entity in the cache is fetched X seconds ago, it's still considered 'fresh enough', as all data outside the database is stale the moment it's read anyway. But if correctness is in order, you can't be more sure than by reading from the database and bypass the cache.

So what's left of this efficiency?
Well, not much. We've seen that it actually adds more overhead than efficiency, so it's even less efficient than not using a cache. We've seen that it could be solved in a way which could lead to more efficiency but only in a small group of situations and required consessions you're likely not willing to make. We've also seen that a cache could be more efficient in single-entity fetches, but only if you're willing to sacrifice correctness, or if you're sure the data in your cache is valid enough.

Are caches then completely bogus? Why do O/R mappers often have a cache?
They're not bogus, they're just used for a different feature than efficiency: uniquing. Some O/R mappers even use the cache for tracking changes on entity objects (and thus the entities inside these objects). Claiming that the cache is making the O/R mapper more efficient is simply giving the wrong message: it could be a bit more efficient in a small group of situations, and is often giving more overhead than efficiency.

I hope with this article that people stop spreading this myth and realize why caches (or contexts or whatever they're called in the O/R mapper at hand) in O/R mappers are used for uniquing, and not for object fetch efficiency.

 

Brenton House

Recent Posts

Tag Cloud

8 Comments

  • Would you please be so kind to simply LINK to my blog, instead of copying the complete article? Thanks.

  • Frans doesn't have comments or any other kind of contact information on his blog so I can only reply here.

    I maintain an O/R mapping tool that's used mostly within my own company only at this point. But I found this headline very surprising because a couple of months ago I implemented caching in my tool specifically for performance purposes and got an orders-of-magnitude boost from it.

    It works for three reasons: (1) My code is the only consumer of the database in question; I can for all practical purposes guarantee that if my code didn't do any updates, the database hasn't been updated, (2) the database is read much more than it's written (it's a CMS, the site gets viewed far more often than updated) and (3) the semantics of my O/R tool specifically *don't* guarantee uniqueness. They guarantee the exact opposite, in fact: that if you call GetById() on the same table with the same ID twice in a row, you'll get two separate and independent objects representing the same row.

    So I can cache the results of every query in an appropriately-sized per-query LRU cache, discard the cache entirely any time any record in that table gets updated, and return results by cloning the record in the cache if present.

    I can say from direct practical experience that the speed boost (and network load boost) from doing this was phenomenal.

    The main reason for this is that IMO a good O/R mapping tool makes DB access so convenient and transparent that it's easy for developers not to realize that they're doing an awful lot of it. Code like this:

    foreach (CMSPage child in CMSPage.GetById(pageId).Children) {
    foreach (Panel panel in child.Surface.Panels) {
    doSomethingWith(panel.PanelType.SourceFile);
    }
    }

    results in one database hit per panel (to get the PanelType) plus two per child (to get the Surface, then the list of Panels) plus one to get the page's children as well as the one that's explicit (GetById).

    On the other hand, with well-chosen caches on each of those queries, every single one of these database hits is likely to be eliminated.

  • Stuart,

    your situation works because you meet the criteria for when caching is actually useful.

    From the POV of API development, the "silver bullet" is to come up with a general-purpose caching mechanism that works with most situations. What's become apparent is that the gains from a cache are realized only within a narrow range of situations.

  • hey foobar, I beg to differ:

    it is Frans who narrowed definition of the 'cache' to fit his argument...

  • foo: no, I didn't narrow the definition of 'cache', how would you define 'cache' then?

    Foobar: exactly.

    Stuart: you miss the point. The problem is with a general query which fetches a set. You can't know if all data matching the set when you run it on the db, is in the cache, unless you consult the db. This is especially hard because the set fetch query simply uses a couple of criteria for example which aren't referring to PK values.

    Of course, if you just fetch single entities and you know the PK, you can consult the cache and get it from the cache IF you know the entity isn't updated and in your situation you're able to do that. Which is a situation I discussed as a situation in which caching could work. However it's a narrow situation. Your query example suggests (but it might be different) that you fetch individual entities and work with these individual entities. In that case, you fall into the situation I described. Though if you want all panels on a page X, how would you do that with your cache? You can't know if all panels on page X are in the cache, unless you consult the db and run SELECT ... FROM Panel WHERE PageID = @p (lame query, but you get the idea) or something similar.

  • Frans, I didn't miss the point, and like I said I've implemented this while still preserving correctness and getting a verifiable dramatic boost in performance. (My code does a LOT of DB access, I was able to reduce the number of DB hits by a factor of hundreds, and the only bug that's been verifiably a result of caching was that I initially forgot about cascading-delete foreign keys - since fixed, so that's also covered automatically).

    In other words as long as the rule that only my code does DB access is preserved, the caching gives a massive perf improvement and guarantees correct results.

    The reason for this is that I don't cache just single entities, and I don't make any attempt whatsoever to search the cache for items that match the criteria of a particular query. I cache *the results of queries*. This may mean that the same entity is in the cache several times, once for the query based on its PK, once for a query based on some other unique field, and once as part of a set that's cached for some other query. But since I *don't* guarantee uniquing and the semantics are specifically that the result of a query should reflect the contents of the database at the time the query is made, that's not a problem.

    To give a concrete example: Suppose I have a CMSPage.GetById (returning a CMSPage) and a CMSPage.GetByParentId (returning a List). I can do CMSPage.GetById on every page in the entire site, and those results will be cached (up to the limit on the LRU cache size for that query) but only for future GetById calls. If I do a GetByParentId, it will NOT attempt to trawl through the GetById cache for pages where the parent ID matches. It'll only look at it's own cache to see if GetByParentId has been called before with the same parent ID value, and return the results if so.

    Also, to ensure correctness, *every single time* an update, insert or delete is done on ANY CMSPage record, ALL caches related to ALL gets on CMSPage are cleared (along with any table that might cascading-delete off it, if it's a delete). Like I said, the perf improvement is greatest because reads vastly outnumber writes for my use case.

    All you're really saying is that in order for caching to give a boost in performance it must be done correctly and the situation must be one that can benefit. Which isn't really saying very much ;) Although it's a fair point that there are a lot of situations that *aren't* ones that can benefit. And depending on the semantics of your O/R tool's API, it may not be possible to do the same tricks I did to guarantee correctness per the rules of that API.

  • "In other words as long as the rule that only my code does DB access is preserved, the caching gives a massive perf improvement and guarantees correct results."
    Also in webapps in a webfarm? An app on multiple desktops targeting a central server? no way: simply because appdomain A is on a different machine than appdomain B. So what's 'reality' for A, isn't reality for B. A can assume something isn't changed, however B could have changed it.

    "All you're really saying is that in order for caching to give a boost in performance it must be done correctly and the situation must be one that can benefit. Which isn't really saying very much ;)"
    No, the point of my article is that claiming a cache boosts performance isn't a term you can use in a generic way. You have to precisely define what's the situation your app is in, it's otherwise a moot claim. You claim I'm wrong, and I definitely ain't, simply because your narrow situation in which caching does work doesn't scale up: as soon as you use a webfarm, your caching scheme isn't working anymore.

    Stuart Carnie: 'if you always use your O/R mapper framework'. That's an 'if' which can have deep consequences. What if there's a trigger somewhere to create reporting records? What if there is another app targeting the db which can't be using your o/r mapper layer? What if you're using multiple machines targeting the db?

    A network based cache seems nice, but is it really? You need a remoting boundary to get the object into your appdomain. Furthermore, you need to have the security as in the real RDBMS. It then comes down to: is it really more efficient than the query cache in the RDBMS? I doubt it, due to the overhead it creates.

    Caching only works at the far end of the pipelines: either at the end of the processing pipeline (i.e. cache pages or webcontrols or processing output) or at the beginning (i.e. cache in the RDBMS, which is already the case).

  • "No, the point of my article is that claiming a cache boosts performance isn't a term you can use in a generic way."

    I'm not disputing that. I'm just saying that claiming that a cache *never* boosts performance is equally incorrect.

    "You have to precisely define what's the situation your app is in, it's otherwise a moot claim."

    Yes. Generalizations are bad, mmmkay. We're agreeing here :)

    "You claim I'm wrong, and I definitely ain't"

    Your article is titled "Why a cache in an O/R mapper doesn't make it fetch data faster". Not "... doesn't *necessarily* make it faster", or "... doesn't make it faster in the general case", either of which would be true statements that I wouldn't be disputing.

    By all means point out that caching isn't a panacea and that there are many cases it doesn't apply to, but don't claim that it's always pointless. We were having severe performance issues on one particular site; caching eliminated those entirely with no code changes outside the O/R tool (except to tune the size of the caches for particular queries). The change in the load on the DB server was spectacular.

    "simply because your narrow situation in which caching does work doesn't scale up: as soon as you use a webfarm, your caching scheme isn't working anymore."

    Fortunately I'm not using a webfarm. I do think there are ways to get many of the same benefits even in a webfarm though, as long as updates are sufficiently rare. Have a single table storing a version number; increment this at the end of every transaction that included any updates; and at the *beginning* of every transaction, compare this version number to the last known value. If it doesn't match, discard your entire cache.

    Stuart.

Comments have been disabled for this content.