Developing Linq to LLBLGen Pro, part 6

Friday, October 12, 2007

(This is part of an on-going series of articles, started here)

I switched to 'part' posts instead of 'day' posts, as I realized the initial plan (post every day) isn't that useful in this case ("Today I stared at 20 lines of code for 3 hours before I realized the ideas I had yesterday didn't work as I thought they would, more on this tomorrow") as the amount of work done per day is still rather limited. The main reason is that if you're in the dark somewhat, no matter what you can pull off the internet in the form of information in this case, you will make a lot of mistakes, wrong assumptions and faulty algorithms. More than often I thought for hours straight about how to solve a given problem, then found a solution, documented it, wrote some code, had to quit because the day was over and the next day I realized I had it all backwards and had to start over. This isn't that bad, no-one gets it right the first time, though if it drags on it can become frustrating, but that's with everything where a clue how to solve it is so close you can almost smell it but yet so far away you can't see it.

So I was on the wrong track. Well, not entirely, I was actually more busy trying to convince myself I could avoid the track I had to go by thinking I could outsmart it with more clever code and algorithms. In these cases, it's better to simply take the famous step back and realize that despite the fact that something might look nasty and should be avoided, it doesn't mean the road you're desperately trying to find is less awful.

In LLBLGen Pro I once made such a mistake, I thought I wouldn't make that mistake again, but apparently I was wrong. What did I do? Well, I tried to outsmart the algorithm of Topological Sorting in a Directed Acyclic Graph (DAG). Yes, I was that stupid. Let me try to talk my way out of this.

In an O/R mapper, there are many things you have to do, besides fetching entities. One of these things is saving piles of work done on entity objects in memory. (To read more about the difference between entity classes, entity instances, entity class instances and friends my essay about entity models). This is typically done by saving a set of graphs of entities. Say for example you fetched an existing set of Customer entities, added a few new Order entities to some of these Customers and to these new Order entities you added new Order Item entities as well. You also added some new Order Item entities to existing Order entities and you altered some Customer entities as well. So all in all a graph of objects (customers, containing orders, containing order items) which has to be processed, in a given order (but which one?) and with different operations (insert, update).

The question for the O/R mapper then becomes: in which order do I have to save all those entities? This is a typical DAG related problem: the graph is a DAG (cycles in relational models which are reflected in entity models is a Bad Thing and typically have to be redesigned) and by using Topological Sorting of the DAG, you get the order of the work handed to you on a silver platter. With Compliments even. Algorithms like Topological Sorting are an example why Graph Theory is so incredibly important in Computer Science and thus for your daily software engineering work as well.

I had one extra problem to solve: LLBLGen Pro doesn't use a central meta-data store. So I couldn't use the DAG constructed from the mapping information to sort the object graph, I had to sort the object graph by detecting the mapping graph on the fly. In general, this isn't a problem, though if you don't realize the problem you're dealing with is more complex than you might have thought it was, it will make you opt for easy solutions which will fail in corner cases (which have the Murphy-gene: they always pop up at most worst possible moment). My algorithm didn't mimic a graph sorter, as that appeared to me as being too slow, as I didn't want to visit every graph node first, and then again every node which had to be handled for save processing, that would mean visiting a node more than once! A small voice inside me said: "If you can come up with an algorithm which can visit a graph node just once, it's faster!". So I wrote an algorithm which did these two combined: it visited every node in the graph and on the fly could decide what to do.

It worked at first, then the corner cases appeared on the horizon, so I adjusted the algorithm to cover these as well, until I realized I wasn't doing what I should be doing: analyze the algorithm needed and look up how that is covered in what Computer Science has produced in the last couple of decades. So I analyzed the algorithm necessary, picked my "Algorithms in C" by Sedgewick from the shelve (old book, great resource) and peeked into it, found the algorithm in a couple of minutes, and I spend an afternoon implementing it and it never had a single issue since, simply because it's a proven algorithm, it should work regardless what you throw at it, corner case or not.

"I know it's friday afternoon, buddy, but could you hurry up a little and get to the point you're trying to make here, Bouma?". Sure thing. . In the previous post in this series, I was talking about special case programming and that I was running into situations were special cases could be identified and therefore the code would become pretty awful (and I remembered my encounter with my Special DAG Sorting attempt ). Matt Warren replied to that post that I should look at the methods like separate sets. Exactly what I wanted to avoid as it would lead to tree manipulation after the interpretation of the Expression tree, so in theory a different place of where the complexity would be located, not something to solve the complexity. At least that's what I thought.

But, Matt is a smart guy and I realized that right on time. So after that previous post I went back to my design and replaced my approach with a 'set per Queryable extension method' approach. In short it comes down to the fact that if you have a Queryable extension method like Where or Select, these work on an input set, and their result is again a set. Some methods simply work on the same set and the result of them is the same set (so no projection) and other methods result in a new set (so with a projection).

So I defined a class, SetDefinition, and with that I now build a tree: every extension method is handled in its own routine and per handler I decide if I have to manipulate the set (no projection) or wrap it with a new set and use the current one as its source (projection). This solved a big problem: it gives me the oppertunity to focus on parts of the puzzle and worry on the rest of them later on, in an easier fashion. I still use the stack based expression parser, as it gives good results and should be OK, though it's just a matter of taste. You can also produce subtrees of own nodes as the result of an expression handler, which is then later on reduced to something usable (which is actually the mechanism behind LR(n) parsing anyway: reduce something you've seen to something you know and while doing that handle the reduced elements).

Of course, there are a couple of problems still ahead. One of them is the reduction of the tree of SetDefinitions to a SetDefinition which contains the elements for the query or hierarchical queries to execute. At the moment I can run Linq queries which perform simple entity fetches with where clauses and single list projections. LLBLGen Pro contains a flexible projection framework already to project resultsets onto whatever you want, so I'm using that to handle the different projection types. One which will be a challenge is the hierarchical fetch of nested projections onto anonymous types, though as LLBLGen Pro already contains a generic merging system of sets based on a relationship (which contains one or more predicates) this shouldn't be too difficult to do.

It might sound a little strange to distinguish the different projections but fetching entities is something else than fetching a set of data which is projected onto a set of class instances, at least for an O/R mapper it is: entities are filled with data using different routes to avoid getting public properties being triggering field-oriented validation logic when an entity is filled; if the O/R mapper is a POCO system, subclassing takes place (not always) and for example if the entity has to participate in a uniquing space, the cache or uniquing context has to be consulted.. etc. etc.

So in short, the system is currently setup as: parse expression nodes with dedicated handlers. Per extension method a clean stack is used and the method parameters are parsed. The expression node handlers place their result on the stack so the caller can handle it further. At the end, the extension method handler pulls the stack elements and builds from that a new SetDefinition, or manipulates the current one. After that, the SetDefinition tree is reduced till nothing is reducable anymore, which in my setup is the state where there's just 1 SetDefinition left. This is the case because LLBLGen Pro can fetch hierarchies in 1 method call, so all parameters for that hierarchical fetch, if necessary, as well as derived tables, scalar queries etc. are elements of the SetDefinition's inner elements (predicate expressions for subqueries, ScalarQueryExpressions for field elements in projections etc.). The SetDefinition is then handled further by the QueryProvider which means it's simply converted to a fetch on the LLBLGen Pro api and the results are returned.

In the end, there's just hard work left: typing in all the code for the fine details, for handling all extension methods, all aggregates, all extension methods of types known to the database, like String etc. However that's the 'easy' part (it's still hard and complex though), as it comes down to writing out what you've already thought out: the design of that code is just following the global design of the overall framework. That's also why it's so important to get the global design of the framework right, hence my struggle.

My next stop is, after fixing a little bug in the tree reducer, adding support for Joins and Aliases. Joins introduce aliases for the rest of the elements. So do things like where e.Manager.Manager.Manager.Firstname=="John", which introduces joins which could (so you have to take it into account) be joins to self, which always require an alias. As Joins introduce anonymous types under the covers, it will be a bit of a challenge, but I think with the approach I have now it will be manageable.

Thanks Matt for making me realize I was on the wrong track. Next time we meet IRL, your beer is on me.

man, if linq even causes *you* this much re-thinking... i worry that it's a level of abstraction that somehow won't sit quite right in the mind of the average workin dev.

from what i've seen i love it... but as i get deeper into using it, will the love disappear and the pain begin? (much like xslt and regular expressions...)

lb - Friday, October 12, 2007 1:21:07 PM

lb: For the USER of Linq, it's not that difficult, I'm at the linq provider side, so I have to parse the expression trees, created by the compiler from the linq queries the user types in. :) So don't worry that much about linq being difficult. It has it's share of weirdness which will cause problems for a lot of people, but they're not faced with the complexity of a linq provider. :)

FransBouma - Friday, October 12, 2007 3:47:52 PM

You are unquestionably one impressive SOB Frans. That last post was something else. Ironically, I was sitting in the office of my fiance's house (where I live on the weekends) and less than a foot away was one of my book shelfs with Algorithms in C++. I hadn't even thought of that book in years and next thing you know I had it cracked open. Excellent post man.

Bill - Sunday, October 14, 2007 3:00:53 PM

3 Comments