re: Create benchmarks and results that have value
Kelly Sommers wrote a blogpost called 'Create benchmarks and results that have value' in which she refers to my last ORM benchmark post and basically calls it a very bad benchmark because it runs very few iterations (10) and that it only mentions averages (I do not, the raw results are available and referred at in the post I made). Now, I'd like to rectify some of that because it now looks like what I posted is a large pile of drivel.
The core intent of the fetch performance test is to see which code is faster in materializing objects, which framework offers a fast fetch pipeline and which one doesn't, without any restrictions. So there's no restriction on memory, number of cores used, whether or not it utilizes magic unicorns or whether or not a given feature has to be there, everything is allowed, as it just runs a fetch query like a developer would too. If a framework would spawn multiple threads to materialize objects, use a lot of static tables to be able to fetch the same set faster in a second attempt (so memory would spike), that will go unnoticed. This is OK, for this test. It's not a scientific benchmark setup in a lab with dedicated hardware run for weeks, it's a test run on my developer workstation and my server. I never intended it to be anything else than an indication.
The thing is that what I posted, which uses publically available code, is an indication, not a scientific benchmark for a paper. The sole idea behind it is that if code run 10 times is terribly slow, it will be slow a 100 times too. It's reasonable to assume that as none of the frameworks has self-tuning code (there's no .NET ORM which does), nor does the CLR have self-tuning code (unlike the JVM for example). The point wasn't to measure how much slower framework X is compared to framework Y, but which ones are slower than others: are the ones you think are fast really living up that expectation? That's all, nothing more.
That Entity Framework and NHibernate are slower than Linq to SQL and LLBLGen Pro in set fetches is clear. By how much exactly is not determinable by the test results I posted as it depends on the setup. In my situation, they're more than 5-10 times slower, on another system with a different network setup the differences might be less extreme, but the differences will still be there, likewise the difference between the handwritten materializer and full ORMs.
The core reason for that is that the code used in the fetch pipelines is less optimal in some frameworks than in others, or that they perform more features with every materialization step than others. For example, LLBLGen Pro for each row read from the DB, it goes through its Dependency Injection framework, checks for authorizers, auditors, validators etc. to see whether the fetch of that particular row is allowed, converts values if necessary to different types, all kinds of features which can't be skipped because the developer relies on them, however they all add a tiny bit of overhead to each row.
Fair or unfair? Due to this, my ORM will never be faster than Linq to SQL, because Linq to SQL doesn't do all that and has a highly optimized fetch pipeline. If I pull all the tricks I know (and I already do all that), my pipeline will be highly optimal but I still have the overhead of the features I support. Looking solely at the numbers, even if they come from a cleanroom test with many iterations, it doesn't give you a full picture. This is also the reason there's one distinction made between micro-ORMs and full ORMs, and another on change tracked fetches and readonly fetches and for example there's no memory measurement given (also I couldn't do that reliably from code). Again this isn't a scientific paper, it's an indication with a strict definition of its limited scope.
The averages
I posted averages, but contrary to what Kelly states, I also posted the full results. The averages were posted as an easier way to illustrate the point of the whole post, namely an indication. The averages were in my opinion acceptable because the results are practically equal across the board; some iteration is a couple of ms slower than the other, but overall they're more or less equal, for the point of the test: the small differences in milliseconds between the iterations are irrelevant for the point of the test: 6686ms in an iteration is slower than 559ms, averaged or not. If there would have been spikes, averages would have been bad, I completely agree there. One problem I had with the previous code was for example the individual fetch tests, which were run on an VM back then. These fetches flooded the VM so much that the results were completely unreliable with high spikes. Averaging those would be stupid, so I made the code wait in such a way that GC would be forced and VM port flooding would be avoided.
Was the averaging bad? For a scientific benchmark it would be bad indeed, but as this isn't a scientific benchmark, it even runs just 10 iterations and ran on my dev box, not a testlab, and as this isn't about exact numbers, I think it's not particularly bad: the averages given are not different from the set they're pulled from (the standard deviation is extremely low), so they can be used as a raw indication whether that framework is particularly slow or not, compared to the others. Which was precisely the point of the test, nothing more.
'But what about her point?'
The general point Kelly tries to make is justified and I agree with that. The thing is that my indication test isn't something on which it applies. I don't have dedicated hardware for this test (which is essential for a scientific benchmark, so you know nothing else interfered) nor do I have time to run a million iterations of the test, as that will takes months, with the code running 24/7. I don't expect the code to give different results btw if one does so, but as it's on github, feel free to do it yourself. The whole point was to give an indication in a very strict scope, and you can verify it with the code yourself.
On closing, I'd like to address one paragraph from Kelly's post.
We don’t know what the various code is doing. Benchmarking helps to find what kind of trade-offs exist in the code and how they impact workload. 10 iterations is not enough to get anything from.
A million iterations (which btw will take more than 67 days alone for Entity Framework) won't give you anything of that either: you still wouldn't know what the ORM is doing under the surface which causes it to be less fast than another one. Linq to SQL would still be faster, Entity Framework would still be dead last. This is simply because there are no moving parts here: after the first iteration, everything is fixed, no self-optimizing runtime, no optimizer, nothing, the exact same steps will be taken by every framework.
If we are benchmarking something running on the CLR, JVM or Go, when do I care about how fast my ORM is 10 times without the GC? Never. I always care about the GC. The GC is there, and this test is too fast to get it involved. I have no idea if any ORM’s in this test are creating a really bad GC collect situation that will result in major performance drops.
This is true, but it's outside the scope of the test, which is strictly defined, and I did that deliberately to avoid things like the above quote. There's no memory test done with the fetch test. It would have been interesting though, e.g. to illustrate the string caching system LLBLGen Pro uses to avoid large amounts of redundant string instances in memory, but alas, it's outside the scope of the test (it still eats some performance though). The GC might have a harder time on the crap left behind by framework X than it has with the garbage left behind by framework Y. The test was deliberately meant to test how efficient the fetch pipeline of a given framework is, the GC therefore isn't taken into account, it's actually forced to run outside the test so it doesn't skew the results.
There are a lot of things not visible in a test this short. I don’t know if any ORM’s eat my RAM for lunch, or if some are extremely CPU intensive. It’s just too short.
It's too short for your view of what this test illustrates, but that wasn't the intention of the test at all. For example, the memory usage measurement can only be done on a dedicated test setup (so dedicated server, network and client for this benchmark), where a test of a single ORM is run for days. The code I wrote doesn't work for that, as multiple ORMs are run side by side: one can ruin the GC generations for others, which skews the results, if the GC is allowed to run. A dedicated setup is then also required, as each ORM is run separately, so to avoid having different test situations, the setup must be equal to all tests.
I write ORMs for many many years now, I know what tricks others use to get their performance and I also know why some of my code isn't as fast as others. This test was a fun (well, it was fun till people start bashing it as if I have no clue) way to see how do the different frameworks relate to each other. I picked a strict scope to avoid as many crap as possible, and I also didn't pick a situation in which my own code would be the sole winner of it all, simply because that wasn't the point of the test to begin with. I wanted an indication how the fetch performance of the various ORMs relate to each other. I still think the results I posted do give a proper indication about that and that was all I wanted to do with this.
As the fun has bled out of it, this will likely be the last post about this topic. I still hope it was helpful to you all.