Why a cache in an O/R mapper doesn't make it fetch data faster.
Preface
One of the biggest myths in O/R mapper land is about
'caching'. It's often believed that using a cache inside an
O/R mapper makes queries much faster and thus makes the O/R
mapper more efficient. With that conclusion in hand, every
O/R mapper which doesn't use a cache is therefore less
efficient than the ones who do, right?
Well... not exactly. In this article I hope to explain that
caching in O/R mappers is not there for making queries more
efficient, but is there for uniquing. But more on
that later on. I hope that at the end of the article, I have
convinced the reader that the myth
Caching == more efficiency is indeed a myth. Beware,
it's perhaps a bit complicated here and there, I'll try to
explain it in as much layman's terms as possible.
What's a cache?
Before I can explain what a cache is, it's important to
understand what an entity is, what an entity instance is
etc. Please consult
this article
first to learn about what's meant with these terms.
A cache is an object store which manages objects so
you don't have to re-instantiate objects over and over
again, you can just re-use the instance you need from the
cache. A cache of an O/R mapper caches entity objects.
Pretty simple actually. When an entity is fetched from the
persistent storage (i.e. the database), the entity object
(i.e. the entity class instance which contains the entity
instance (== data)) which contains the data fetched, is
stored in the cache, if it's not there already. What exactly
does that mean: "if it's not there already" ? It means that
the entity object isn't there yet.
Caches in O/R mappers are above all used for a concept which
is called uniquing. Uniquing is about having a single
entity object for every entity (== data) loaded. This
means that if you load the entity of type Customer and with
PK "CHOPS" from the Northwind database, it gets stored in an
entity object, namely an instance of the Customer
entity class. What happens if you load the same entity with
PK "CHOPS" again in another instance of the Customer
entity class? You would end up with two instances of the
same class, but with the same data. So effectively
the objects represent the same entity.
This doesn't have to be a problem. Most actions on entities
don't require a unique entity object. After all,
they're all mirrors of the real entities in the
database and with a multi-appdomain application (like
desktop applications accessing the same database or a
multi-webserver using webapplication) you have the chance of
having multiple entity objects containing the same entity
data anyway.
However sometimes it can be a problem or an inconvenience.
When that happens, it's good that there's a way to have
unique objects per entity loaded. Most O/R mappers
use a cache for this: when an entity is loaded from the
database, the cache is consulted if there's already an
entity object with the entity data of the same entity
fetched. If that's the case, that instance is updated with
the data read from the database, and that instance is
returned as the object holding the data. If there's no
object already containing the same entity, a new
instance is created, the entity data fetched is stored in
that instance, that instance is stored in the cache and the
instance is returned. This leads to unique objects
per entity.
Not all O/R mappers use a cache for uniquing however, or
don't call it a 'cache'. You see, a central cache is really
a very generalizing. What if you need for a given
semantical context a unique entity, and outside that
context you don't need a unique instance or a different,
unique instance? Some O/R mappers, like
LLBLGen Pro, use Context objects which provide uniquing for a
semantical context, e.g. inside a wizard or an edit form.
All entity objects inside that context are stored in
unique objects.
Caches and queries: more overhead than efficiency
So, when does this efficiency the myth talks about occur
exactly? Well, almost never. In fact, using a cache
is often less efficient. I said almost, as
there are situations where a cache can help, though these
are minor or require a lot of consessions. However I'll
discuss them as well so you have a complete picture.
Let's state I have a cache in my O/R mapper and I want to
see, by using theory, how efficient it might be. So I have
my application running for a given period of time, which
means that the cache contains a number of entity objects. My
application is a CRM application, so at a given time T the
user wants to view all customers who have placed at least 5
orders in the last month. This leads thus to a query for all
customer entities which have at least 5 related order
entities which are placed in the last month.
What to do? What would be logical and correct?
Obviously: fetching the data from the persistent storage, as
the entities live there, and only then we'll get all
known and valid customer entities matching the query.
We can consult our cache first, but we'll never know if the
entities in the cache are all the entities matching my
query: what if there are many more in the database, matching
the query? So we can't rely on the cache alone, we always
have to consult with the persistent storage as well.
This thus causes a roundtrip and a query execution on
the database. As roundtrips and query executions are a big
bottleneck of the complete entity fetch pipeline, the
efficiency the myth talks about is nowhere in sight. But it
gets worse. With a cache, there's actually
more overhead. This is caused by the uniquing feature
of a cache. So every entity fetched from the database
matching the query for the customers has to be checked with
the cache: is there already an instance available? If so,
update the field values and return that instance, if not,
create a new instance (but that's to be done anyway) and
store it in the cache.
So effectively, it has more overhead, as it has to consult
the cache for each entity fetched, as well as store all new
entities into the cache. Storing entities inside a cache is
mostly done with hashvalues calculated from the PK values
and stored per type. As hashvalues can result in duplicates
(it's just an int32 in most cases) and compound PKs can
complicate the calculation process, it's not that straight
forward to get the lookup process of entities very
efficient.
Some tricks can help... a bit and for a price
Before I get burned down to the ground, let me say that
there are some tricks to speed things up a bit. I have to
say "a bit" because it comes at a high price: you've to
store a version field in every entity and the O/R mapper of
choice must support that version field. This thus means that
you've no freedom over how the entities look like or how
your datamodel looks like. This is a high price to pay, but
perhaps it's something you don't care about. The trick is
that instead of returning the full resultset in the
first query execution, only the PK values and the
version values are returned for every entity matching the
query. By checking the cache, you use the version value to
see if an entity has been changed in the db since it was
fetched and stored in the cache. If it's not changed, I
don't have to fetch the full entity, as the data is already
in the cache. If it is changed or not in the cache at all,
I've to fetch the full entity. So then I will fetch all
entities matching the PKs I've collected from my cache
investigation. The advantage of this is that it might be
that the second query is very quick. It however also can
bomb: what if you're using oracle and you have 3000
customers matching your query? You then can't use an WHERE
customerid IN (:a, :b, ...) query as you'll be exceeding the
limit of parameters to send in a single query. It also will
cause a second query run, which might add actually
more time than simply doing a single fetch: first the
PK-Version fetch query has to be run, then the second full
fetch query (which might result in less rows, but still...).
You might wonder: what if I control all access to the
database? Then I know when an entity is saved, and
thus can keep track of when which entities are changed as
well and thus can make assumptions based on that info
whether an entity is updated or not! Well, that's true, but
that's not scaling very well. Unlike Java, .NET doesn't have
a proper cross-appdomain object awareness system. This means
that if you have even two systems targeting the same
database (webfarm, multiple desktops), you can't use this
anymore. And even if you're in the situation where it could
help (single appdomain targeting single database), it's
still takes time to get things up to steam: until all
entities of a given type are inside the cache, you still
have to fetch from the database.
Cache and single entity fetches
There is one situation where a cache could help and
be more efficient. That is: if you know or assume the data
you might find in a cache is 'up to date' enough for you to
be used. That situation is with single entity fetches using
a PK value. Say the user of our previously mentioned CRM
application wants to see the details of the customer with PK
value "CHOPS". So all what should happen is an entity fetch
by using the PK value "CHOPS". Consulting the cache, it
appears that the entity with the PK value "CHOPS" is already
loaded and available in the cache! Aren't we lucky today!
Again, what to do? What's logical in this situation? Pick
the one from the cache, or consult the database and run the
risk of fetching the exact same entity data as is already
contained in the entity object in the cache? I'd say: go to
the database. You only know for sure you're working with the
right data by consulting the persistent storage. If you pick
the one from the cache, you might run the risk that another
user has updated the customer data and you're working with
outdated, actually
wrong data which could lead to
wrong conclusions while you could have made the right
conclusions if you would have looked at the right data. If
you pick the one from the database, you might run the risk
of burning a few cycles which turned out to be unnecessary.
A trick here could be that if the entity in the cache is
fetched X seconds ago, it's still considered 'fresh enough',
as all data outside the database is stale the moment
it's read anyway. But if correctness is in order, you can't
be more sure than by reading from the database and bypass
the cache.
So what's left of this efficiency?
Well, not much. We've seen that it actually adds more
overhead than efficiency, so it's even less efficient than
not using a cache. We've seen that it could be solved in a
way which could lead to more efficiency but only in a small
group of situations and required consessions you're likely
not willing to make. We've also seen that a cache could be
more efficient in single-entity fetches, but only if you're
willing to sacrifice correctness, or if you're sure
the data in your cache is valid enough.
Are caches then completely bogus? Why do O/R mappers
often have a cache?
They're not bogus, they're just used for a different feature
than efficiency: uniquing. Some O/R mappers even use the
cache for tracking changes on entity objects (and thus the
entities inside these objects). Claiming that the cache is
making the O/R mapper more efficient is simply giving the
wrong message: it could be a bit more efficient in a
small group of situations, and is often giving more overhead
than efficiency.
I hope with this article that people stop spreading this myth and realize why caches (or contexts or whatever they're called in the O/R mapper at hand) in O/R mappers are used for uniquing, and not for object fetch efficiency.