Reply to "What ORMs have taught me: just learn SQL"

Tuesday, August 5, 2014

ORM software engineering LLBLGen Pro

This is a reply to "What ORMs have taught me: just learn SQL" by Geoff Wozniak.

I've spent the last 12 years of my life full time writing ORMs and entity modeling systems, so I think I know a thing or two about this topic. I'll briefly address some of the things mentioned in the article.

Reading the article I got the feeling Geoff didn't truly understood the material, what ORMs are meant for and what they're not meant for. It's not the first time I've seen an article like this and I'm convinced it's not the last. That's fine; you'll find a lot of these kind of articles on many frameworks/paradigms/languages etc. in our field. I'd like to add that I don't know Geoff and therefore have to base my conclusions on the article alone.

Re: intro

The reference to the Neward article made me chuckle: sorry to say it but bringing that up always gives me the notion one has little knowledge of what an ORM does and what it doesn't do. An ORM is just a tool to translate between two projections of the same abstract entity model (class and table, which result in instances: object and table row); it doesn't magically make your crappy DB look like one designed by CELKO himself nor does it magically make your 12 level deep, 10K object wide graph persist to tables in a millisecond as if there was just 1 table. Neither will SQL for that matter, but Geoff (and Neward before him) silently ignores that.

An ORM consists of two parts: a low level system which translates between class instances and table rows to transport the entity instances (== the data) back and forth, and a series of sub-systems on top of that to provide entity services (validation, graph persistence, unit of work, lazy / eager loading etc. etc.)

It is not some sort of 'magic connector' which eats object graphs and takes care of transforming those to tabular data of some sort with which you don't want to know anything about. It also isn't a 'magic connector' which reads your insanely crappy relational model into a dense object graph as if you read the objects from memory.

Re: Attribute Creep

He mentions attribute creep (more and more attributes (==columns) per relation (==table)) and FKs in the same section, however I don't think one is related to the other. Having wide tables is a problem but it's a problem regardless of what you're using as a query system. Writing projections on top of an entity model is easy, if your ORM allows you to, but even if it doesn't, the wide tables are a problem of the way the database is set up: they'll be a problem in SQL as well as an ORM.

What struck me as odd was that he has wide tables and also a problem with a lot of joins which sounds like he either has a highly normalized model, which should have resulted to narrow tables, or uses deep inheritance hierarchies. Nevertheless, if a projection requires 14 joins, it requires 14 joins: the data itself isn't obtainable in any other way otherwise it would be doable through the ORM as well (as any major ORM allows you to write a custom projection with joins etc. to obtain the data, materialized in instances of the class type you provide). It's hard to ignore the fact the author might have overlooked easy to use features (which hibernate provides) to overcome the problems he ran into and at the same time it's a bit odd a highly normalized model is the problem of the ORM and won't be a problem when using SQL (which has to work with the same normalized tables)

He says:

Attribute creep and excessive use of foreign keys shows me is that in order to use ORMs effectively, you still need to know SQL. My contention with ORMs is that, if you need to know SQL, just use SQL since it prevents the need to know how non-SQL gets translated to SQL.

I agree with the fact that you still need to know SQL, as you need to formulate the queries in your code in such a way that it leads to more efficient SQL; an ORM can do a bit of optimization but it is almost impossible to do without statistics/data (which are not available at that stage). But you can't conclude from that to 'just use SQL', as that's like recommending to learn to write Java Bytecode because the syntax of Clojure is too hard to grasp. A better conclusion would be to learn the query system better so you can predict the SQL which will be produced.

Re: Data Retrieval

Query performance is always a concern, and anything between code and the actual execution of the DML in the DB is overhead. Hand-optimized SQL might be a good option in some areas, but in the majority of cases queries generated by ORMs are fine, even hibernate's ;). Most ORMs have a query language / system which is derived from SQL to begin with (the mentioned hibernate does: HQL) and it is predictable what SQL it will roughly produce.

Sure, if you create deep inheritance hierarchies over your tables, you might run into a lot of joins, but that's known up front: inheritance isn't free, one knows what it will do at runtime. "Know the tool you're working with". If Geoff was surprised to see a lot of joins because a 14-entity deep inheritance hierarchy was pulled from the DB, he should have known better.

He says:

From what I've seen, unless you have a really simple data model (that is, you never do joins), you will be bending over backwards to figure out how to get an ORM to generate SQL that runs efficiently. Most of the time, it's more obfuscated than actual SQL.

I find this hard to believe with the query systems I've seen and written myself, with one exception: Linq. Linq is a bit different because it has constructs (like GroupBy) which are different in Linq/code than they are in the DB which require a translation of intend from the query to SQL and thus can / will lead to a SQL query which might not be what one would expect when reading the Linq query.

The usage of Window functions and other DB specific features (like query hints) might be something not doable in an ORM query language. There are several solutions to that though, one being creating DB functions which are mapped to code methods so you can execute the constructs inside your query using those methods which will result in using the functions in the SQL query, another being DB Views. They both require actions inside the RDBMS which is less ideal, but if it helps in edge cases, why not? They're equal to adding an index to some fields to speed up data retrieval, or creating a denormalized table because the data is read-only anyway and it saves the system using it a lot of joins.

Re: Dual schema dangers

Here I saw the struggle Geoff had with the concept of ORMs. This isn't uncommon, e.g. Neward (in my opinion) expresses the same struggle in his cited essay. There are two sides with a gap between them: Classes and Table definitions. If you start with classes and try to create table definitions from these it's equal as starting with the table definitions and try to create classes from these: both are the projection result of an abstract entity model and to get one from the other requires reverse engineering the side you start with to the abstract entity model it was the projection of and then projecting that to the side you want to create: starting from classes or table definitions doesn't matter.

I do understand the pain point when you start with either side and have to bridge the gap to the other side: without the abstract entity model as the one true source of truth, it's always a problem when one side changes to update the other side.

Geoff tries to blame this on the ORM but that's not really fair: the ORM is meant to work with both sides (class and tables) at run time, not at design time; it requires a system meant for modeling an abstract entity model to manage both sides, as both sides are the result of that model, not the source of it. (I wrote one, see 'Links to my work' at the top left. I didn't want to pollute this article with references to my work)

Re: Identities

Creating new entity instances which get their PK set by a sequence in the DB are the main cause of the problem if I understand Geoff's description correctly. In memory, these entities have no real ID and referring to them is a bit of a pain, true. But that's related to working with objects in general: any object created is either identified by some sort of ID you give it or its memory location ("the instance itself"). I don't get the struggle with the cache and partial commits: if you want to refer to objects in memory, it's equal to what you would do if they weren't persisted to a DB. That they get IDs in the DB in the case of sequenced PKs is not a problem: the objects get updated after the DB transaction completes. Even hibernate is capable of doing that.

Re: Transactions

This section is a typical description of what happens when you confuse a DB transaction with a business transaction. A business transaction can span more than one DB transaction, might involve several subsystems / services, might even use message queues, might even be parked for a period of time before commit. A DB transaction is more explicit and low-level: you start the transaction, you do work, you commit (or rollback) the transaction and that's it.

Geoffs reference to scope is good, it illustrates that there's a difference between the two and therefore you shouldn't use a DB transaction when you need a business transaction. However it's too bad he misses this himself. Often developers try to implement a business transaction at the level of an ORM by using its unit of work, but it's too low level for that: a business transaction might span several systems and an ORM isn't the right system to control such a transaction; it's meant to control one DB transaction, that's it.

That doesn't mean the ORM shouldn't provide the tools to help a developer write proper business transaction code with the systems controlling the business transaction. After all, the second part of an ORM is 'entity services' and one being 'Unit of work'. Most ORMs follow the Ambler paper and combine a Unit of Work with their central Session or Context object. This leads to the problem that you can't offer a Unit of Work without the central Session or Context object and thus when you actually want a Unit of Work to pass around, collecting work for (a part of) the business transaction, you don't want to deal with a Session / Context object which also controls the DB connection / transaction; it might be that at that level / scope it's not even allowed / possible to do DB oriented work.

It's therefore essential to have an ORM which offers a separate Unit of Work object, which solves this problem. Additionally to that, the developer has to be aware that a business transaction is more than just a DB transaction and should design the code accordingly.

Re: Where do I see myself going

A highly normalized relational model (4+ normal form) which is used to retrieve denormalized sets is not likely to perform well (as the chance of a high number of joins in most queries is significant), no matter what query system you're using. I get the feeling parts of what Geoff ran into is caused by reporting requirements (which often requires denormalized sets of (aggregated) data), parts are caused by inheritance hierarchies (not mentioned but according to the # of joins which were unexpected I think this is the case) and partly caused by poorly designed relational models.

None of those are solved magically if you use SQL instead of HQL or whatever query language you're using in an ORM. Not only is 'SQL' a query language and not a query system, it also doesn't make the core problems go away. Well, perhaps the inheritance one as you can't have inheritance in SQL, but then again, you're not forced to use inheritance in your entity model either.

He says:

By moving away from thinking of the objects in my application as something to be stored in a database (the raison d'être for ORMs) and instead thinking of the database as a (large and complex) data type, I've found working with a database from an application to be much simpler.

Here Geoff illustrates clearly a misconception about ORMs: they're not there to persist object graphs into some magic box in the corner, they're a system to move entity instances(==data) across the gap between two projections of the same abstract entity model. It's no surprise it turns out to be much simpler if you see your DB as part of your application, because it is part of your application. If we ignore the difference in level of abstraction, it's equal to talk to a DB through a REST service as it is to talk to a DB through an ORM which provides you with data: both times you go through an API to work with the entity instances on the other side. The REST service isn't a bucket you throw data in, and neither is the ORM.

Re: conclusion

SQL is a query language, not a query system. It's therefore not an alternative to the functionality provided by an ORM. ORMs make some things, namely the things they're built for, very easy. They make other things, namely the things they're not built for, hard. But the same can be said about any tool, including SQL (if we see a language as a tool): SQL is set oriented, and therefore imperative logic is hard to do, so one shouldn't do imperative logic in SQL. Blaming SQL for being crap in dealing with imperative logic doesn't make it so, it merely shows the person doing the blaming doesn't understand what SQL is meant to do and what it isn't meant to do.

In closing I'd like to not that what's ignored in the article is the optimized SQL ORMs generate with respect to e.g updates and graph fetches (eager loading). Left alone the fact that to execute the SQL query and consume the results, one has to write a system which is the core of any ORM: the low-level query execution system and object materializer.

It always pains me to read an article like Geoff's about a long struggle with ORMs as it's often based on a set of misconceptions what ORMs do and what they don't do. This is partly to blame on some ORM developers (let's not name names) themselves which try to sell the image that an ORM is a magic object graph persister and will turn your RDBMS into an object store. It's also party to blame on the complexity of the systems themselves: you don't simply learn how to use all of the ORM features and quirks overnight.

And sadly, it's also party to blame on the users, the developers using the ORMs, themselves. Suggesting a query language as the answer (and with that the tools that come with it) isn't going to solve anything: the root problem, working with relational data in an OO system, i.e. bridging the cap between class and table definition, still has to be solved, and using SQL and low-level systems to execute it will only move that problem onto your own plate, where you run the risk of re-inventing the wheel, albeit poorly.

Both of you have very valid points. I think the most valid points of all is that both of you had better success using and not using ORMs respectively. Shouldn't that be the critical factor, can you produce useful code for your customer.

My take is not that either one of the camps does not understand the technology, but that technology has flaws. All of it SQL, ORM, languages and programming paradigms. The literature if filled with proofs of useful things that are impossible, we have the halting problem, Gödel's incompleteness theorems, Amdahl's Law, and sooo much more. There are all these disappointing facts that ironically make our profession harder and give us more work to do. We are always looking for local minimums. You found yours and Geoff found his.

There are several discussions that have been settled already. Goto is to be avoided, you can make big applications in a browser. But the ORM discussion does not go away, there has to be a reason that smart people can't agree. I personally think there are flaws in SQL, it does not compose so well. But I do believe that some technology can make our jobs easier, without so much disagreement.

Arturo Hernandez - Tuesday, August 5, 2014 6:56:10 PM

Learning SQL, like Geoff states, would prevent having a 'crappy' table/database design. It comes down to basics - right? You may know how to fix that ultra sexy sink, but when you're building a house yourself, it's important if not necessary to know how to lay a good foundation.

I think you just made his case.

Bob Stamos - Tuesday, August 5, 2014 8:12:08 PM

@Bob: the SQL he referred to is for querying the data in a relational database, not for defining the relational model in the relational database (DDL SQL).

FransBouma - Tuesday, August 5, 2014 8:46:53 PM

@Frans - great points. I've found myself (and my teams) using NHibernate in a very simple way - only using 20% of its feature set to get 80% of the value - just to map from SQL data types to CLR data types - and it produces VERY simple use cases where we can move extremely fast ----- WHILE inspecting every query in SQL Profiler to ensure it's the query that we WANT to be dispatched to the database.

Just like templating engines don't release one from knowing HTML, data access tools don't release one from knowing SQL. Great post.

Jeffrey Palermo - Wednesday, August 6, 2014 5:52:38 AM

I believe ORMs solve the wrong problem. The main problem with SQL is that the data type is defined in code AND in the schema, breaking DRY. If ORMs are designed to solve this problem by using a "root" data type definition, the Entity Model, it would at first seem like a really smart solution. But since RDBMSs have schemas that "own" the datatype, and OO languages certainly "own" their classes, it won't be easy to introduce the entity model to rule over these two unruly dictators.

Solution? Let the OO model be The Model. Store data in a schema less db. Manage the storing and retrieval by hand coded routines that allow changes without big time migrations. Simplify change.

Even if you really prefer RDBMSs it is worth thinking about not using ORMs. Do you really need a third language to retrieve data? Do you really need a third way to define your datatype? With ORMs you end up having to know your programming language, and SQL, and then a third language, that of the ORM. I've seen few ORMs in action and in each project I have heard fellow programmers say things like:

- "I don't really know what is going on here so I'm flushing everything to the db here again, even though I shouldn't have to".
- "We had to write native sql here to make it work".
- "It should work with this annotation/attribute but it doesn't. I'm clueless".
- "Migrations are supposedly automatic with this ORM. But it doesn't work in real life".
- "The entity should have an Id at this point. But it doesn't. We have to make a workaround".

My point is, if one is to use a framework/tool/paradigm/solution then the ROI must be higher than the investment in time to set up and learn it. With ORMs it is not necessarily so. But programmers seem to enjoy using them anyway.

Arne Evertsson - Wednesday, August 6, 2014 9:32:32 AM

@Arne
If the developer doesn't know what s/he's doing, then it's not the problem of the tools used (of with the ORM is just one) but of the developer. Using the OO model as 'the model' is as I tried to point out, the same as using the table model as 'the model': it's not correct: it's as if you're using the IL code as 'the source' instead of the C# code it is distilled from.

Storing the object graphs in an OO / Document DB surely helps in your case, as it removes a tool which requires time and investment. But it doesn't solve a problem, it only hides it: using the data you serialized into the OO/Document DB is difficult if you don't utilize the code which stored it in the first place.

Btw, any tool used by a novice will cause problems, be it a web framework, an ORM, a document DB... it all takes time and investment to make sure you use the tool properly. Unless the developer makes that investment and understands the tools used, it will always be a mess, be it an ORM, a document DB, a web framework or even a language and runtime.

FransBouma - Wednesday, August 6, 2014 10:11:20 AM

Frans is right - a developer has to avoid using an O/RM for what it was not designed: reporting or ETL.

Also, when using an O/RM there are a number of basic preconditions: that you need to load the data in a set of domain objects in memory (there are cases when a simple DataTable might be enough), that the structure of domain classes and database tables might be different to a degree, but not _very_ different (I woudn't use an O/RM directly on top of an event store).
If all these are true, I prefer any day using a good O/RM to hardcodding SQL inside the application code and to remember to update that SQL in hundred of places each time the DB structure is changed.

But any developer should remember that becoming proficient with an O/RM takes weeks or even more, not one or 2 days.

Tudor - Wednesday, August 6, 2014 2:49:28 PM

You say:
"But you can't conclude from that to 'just use SQL', as that's like recommending to learn to write Java Bytecode because the syntax of Clojure is too hard to grasp."
This is a terrible comparison. Even you agree knowing SQL is mandatory, ORM or no ORM. On the other hand, knowing bytecode is nohow mandatory.

veggen - Thursday, August 7, 2014 12:35:41 PM

I wish I read Geoff's last line first. It would've saved me from reading the entire blog post.

"If you're using an RDBMS, bite the bullet and learn SQL."

Apparently he thinks the (main) purpose of ORM is to avoid learning SQL. I have never met anyone who throught, "hmm...SQL is hard, let me learn an ORM instead". It's usually the other way around, most people know SQL, but just don't get ORM.

The claims made against ORMs are so absurd in his blog post, that I don't even wanna attempt to refute them.

kk - Thursday, August 7, 2014 2:19:49 PM

I have been using ORM for the better part of a decade. What is apparent is that most people who argue against them have never used them! And don't get started on SQL experts who simply will not accept that an algorithm could possibly generate at least comparable and efficient SQL as their stored procedures.

For me all these arguments seldom amount to anything constructive. Personally it is the huge amount of time that I save that keeps me ORMing. And the fact that the code patterns are consistent. Look at any enterprise system with even a modest number of entities/tables in the model/database and the plumbing and data access code has a disproportionate share of the code base versus the actual business end of the application. Add a few tables to your database and the amount of classes and code that has to be added to the tiers becomes significant. Code it by hand if you have the patience and budget. Put it all in stored procedures if you really want the database to hold the business logic. I have not coded a stored procedure in years.

As the owner of small consultancy I need time to add functionality and make enhancements to the applications my clients depend on. I now spend only a fraction of my time on doing the plumbing and data access code. The rest is spent on the application code which is really where the business side of the bread lies.

I tried all the well known ORM tools before I chose my favourite ORM tool. I learnt an awful lot about what works and doesn't.There is no silver bullet and no tool will solve all your problems - that is your job - to find what works and find workarounds where a pattern doesn't fit. I've actually been able to take a long holiday each year thanks to all the time I have saved ORMing!

basik - Thursday, August 7, 2014 9:07:39 PM

I'm a DBA. I find the use of a ORM ties my hands. Gone are the days when I can identify and optimize procedures that have performance issues. Gone are the days when I can review a procedure before it goes into production. I find the priorities for development and operations are often very different. Developers tend to put "ease of development" high on the list. I explain to them that, while nice to have, it's near the bottom of the list. There are far more important concerns - stability, security, supportability, performance, etc. Building a good application is hard to do, no matter what the tool. I wish the tool forced best practices, but it does not. In some cases, it just makes it more difficult.

Any chance of updating the tool to include a check of the logical reads and writes of a generated query? It's not hard to do. (BTW, this method will detect the usage within a scalar udf that is not detected with SET STATISTICS IO ON.) I have system procedures for this metric. I was creating a procedure template to include the metric, along with other goodies, for the developer to get with no additional work. What good is that if the ORM won't use procedures? In an ORM, do the developers get all the love?

DECLARE @cpu int, @elapsed bigint, @reads bigint, @writes bigint , @logical bigint

-- measure before

SELECT @cpu = cpu_time,
@elapsed = CAST(ROUND(86400000 * CAST(GETDATE() - start_time as float), 0, 0) as BIGINT), -- avoid total_elapsed_time int limit,
@reads = reads,
@writes = writes,
@logical = logical_reads
FROM sys.dm_exec_requests m WITH (NOLOCK)
WHERE m.session_id = @@SPID AND m.request_id = CURRENT_REQUEST_ID()

-- ORM query

-- put something here

WAITFOR DELAY '00:00:11.000'

-- measure after, saving difference

SELECT @cpu = cpu_time - @cpu,
@elapsed = CAST(ROUND(86400000 * CAST(GETDATE() - start_time as float), 0, 0) as BIGINT) - @elapsed, -- avoid total_elapsed_time int limit,
@reads = 8 * (reads - @reads), -- 1 page = 8 KB
@writes = 8 * (writes - @writes),
@logical = 8 * (logical_reads - @logical)
FROM sys.dm_exec_requests m WITH (NOLOCK)
WHERE m.session_id = @@SPID AND m.request_id = CURRENT_REQUEST_ID()

-- use the metric to let the developers know something is wrong - values are a bit arbitrary

DECLARE @status nvarchar(20)

IF @cpu > 10000 OR @elapsed > 10000 OR @logical > 10000 SET @status = 'you pig'
ELSE IF @cpu > 1000 OR @elapsed > 1000 OR @logical > 1000 SET @status = 'slow'
ELSE IF @cpu > 100 OR @elapsed > 100 OR @logical > 100 SET @status = 'okay'
ELSE SET @status = 'good'

RAISERROR (N'METRIC [%s]: logical %I64d KB, cpu %d ms, elapsed %I64d ms, reads %I64d KB, writes %I64d KB',
10, 1, @status, @Logical, @Cpu, @elapsed, @Reads, @Writes) WITH NOWAIT;

Randy in Marin - Monday, August 11, 2014 6:30:23 PM