Developing Linq to LLBLGen Pro, part 6
(This is part of an on-going series of articles, started here)
I switched to 'part' posts instead of 'day' posts, as I realized the initial plan (post every day) isn't that useful in this case ("Today I stared at 20 lines of code for 3 hours before I realized the ideas I had yesterday didn't work as I thought they would, more on this tomorrow") as the amount of work done per day is still rather limited. The main reason is that if you're in the dark somewhat, no matter what you can pull off the internet in the form of information in this case, you will make a lot of mistakes, wrong assumptions and faulty algorithms. More than often I thought for hours straight about how to solve a given problem, then found a solution, documented it, wrote some code, had to quit because the day was over and the next day I realized I had it all backwards and had to start over. This isn't that bad, no-one gets it right the first time, though if it drags on it can become frustrating, but that's with everything where a clue how to solve it is so close you can almost smell it but yet so far away you can't see it.
So I was on the wrong track. Well, not entirely, I was
actually more busy trying to convince myself I could
avoid the track I had to go by thinking I could
outsmart it with more clever code and algorithms. In these
cases, it's better to simply take the famous step back and
realize that despite the fact that something might look
nasty and should be avoided, it doesn't mean the road you're
desperately trying to find is less awful.
In
LLBLGen Pro
I once made such a mistake, I thought I wouldn't make that
mistake again, but apparently I was wrong. What did I do?
Well, I tried to outsmart the algorithm of
Topological Sorting
in a
Directed Acyclic Graph (DAG). Yes, I was that stupid. Let me try to talk my way out of
this.
In an O/R mapper, there are many things you have to do, besides fetching entities. One of these things is saving piles of work done on entity objects in memory. (To read more about the difference between entity classes, entity instances, entity class instances and friends my essay about entity models). This is typically done by saving a set of graphs of entities. Say for example you fetched an existing set of Customer entities, added a few new Order entities to some of these Customers and to these new Order entities you added new Order Item entities as well. You also added some new Order Item entities to existing Order entities and you altered some Customer entities as well. So all in all a graph of objects (customers, containing orders, containing order items) which has to be processed, in a given order (but which one?) and with different operations (insert, update).
The question for the O/R mapper then becomes: in which order do I have to save all those entities? This is a typical DAG related problem: the graph is a DAG (cycles in relational models which are reflected in entity models is a Bad Thing and typically have to be redesigned) and by using Topological Sorting of the DAG, you get the order of the work handed to you on a silver platter. With Compliments even. Algorithms like Topological Sorting are an example why Graph Theory is so incredibly important in Computer Science and thus for your daily software engineering work as well.
I had one extra problem to solve: LLBLGen Pro doesn't use a central meta-data store. So I couldn't use the DAG constructed from the mapping information to sort the object graph, I had to sort the object graph by detecting the mapping graph on the fly. In general, this isn't a problem, though if you don't realize the problem you're dealing with is more complex than you might have thought it was, it will make you opt for easy solutions which will fail in corner cases (which have the Murphy-gene: they always pop up at most worst possible moment). My algorithm didn't mimic a graph sorter, as that appeared to me as being too slow, as I didn't want to visit every graph node first, and then again every node which had to be handled for save processing, that would mean visiting a node more than once! A small voice inside me said: "If you can come up with an algorithm which can visit a graph node just once, it's faster!". So I wrote an algorithm which did these two combined: it visited every node in the graph and on the fly could decide what to do.
It worked at first, then the corner cases appeared on the horizon, so I adjusted the algorithm to cover these as well, until I realized I wasn't doing what I should be doing: analyze the algorithm needed and look up how that is covered in what Computer Science has produced in the last couple of decades. So I analyzed the algorithm necessary, picked my "Algorithms in C" by Sedgewick from the shelve (old book, great resource) and peeked into it, found the algorithm in a couple of minutes, and I spend an afternoon implementing it and it never had a single issue since, simply because it's a proven algorithm, it should work regardless what you throw at it, corner case or not.
"I know it's friday afternoon, buddy, but could you hurry up
a little and get to the point you're trying to make here,
Bouma?". Sure thing.
.
In the previous post in this series, I was talking about special case programming and that I
was running into situations were special cases could be
identified and therefore the code would become pretty awful
(and I remembered my encounter with my Special DAG Sorting
attempt
).
Matt Warren
replied to that post
that I should look at the methods like separate sets.
Exactly what I wanted to avoid as it would lead to tree
manipulation after the interpretation of the Expression
tree, so in theory a different place of where the complexity
would be located, not something to solve the
complexity. At least that's what I thought.
But, Matt is a smart guy and I realized that right on time. So after that previous post I went back to my design and replaced my approach with a 'set per Queryable extension method' approach. In short it comes down to the fact that if you have a Queryable extension method like Where or Select, these work on an input set, and their result is again a set. Some methods simply work on the same set and the result of them is the same set (so no projection) and other methods result in a new set (so with a projection).
So I defined a class, SetDefinition, and with that I now build a tree: every extension method is handled in its own routine and per handler I decide if I have to manipulate the set (no projection) or wrap it with a new set and use the current one as its source (projection). This solved a big problem: it gives me the oppertunity to focus on parts of the puzzle and worry on the rest of them later on, in an easier fashion. I still use the stack based expression parser, as it gives good results and should be OK, though it's just a matter of taste. You can also produce subtrees of own nodes as the result of an expression handler, which is then later on reduced to something usable (which is actually the mechanism behind LR(n) parsing anyway: reduce something you've seen to something you know and while doing that handle the reduced elements).
Of course, there are a couple of problems still ahead. One
of them is the reduction of the tree of SetDefinitions to a
SetDefinition which contains the elements for the query or
hierarchical queries to execute. At the moment I can run
Linq queries which perform simple entity fetches with where
clauses and single list projections. LLBLGen Pro contains a
flexible projection framework already to project resultsets
onto whatever you want, so I'm using that to handle the
different projection types. One which will be a challenge is
the hierarchical fetch of nested projections onto anonymous
types, though as LLBLGen Pro already contains a generic
merging system of sets based on a relationship (which
contains one or more predicates) this shouldn't be too
difficult to do.
It might sound a little strange to distinguish the different
projections but fetching entities is something else than
fetching a set of data which is projected onto a set of
class instances, at least for an O/R mapper it is: entities
are filled with data using different routes to avoid getting
public properties being triggering field-oriented validation
logic when an entity is filled; if the O/R mapper is a POCO
system, subclassing takes place (not always) and for example
if the entity has to participate in a uniquing space, the
cache or uniquing context has to be consulted.. etc. etc.
So in short, the system is currently setup as: parse expression nodes with dedicated handlers. Per extension method a clean stack is used and the method parameters are parsed. The expression node handlers place their result on the stack so the caller can handle it further. At the end, the extension method handler pulls the stack elements and builds from that a new SetDefinition, or manipulates the current one. After that, the SetDefinition tree is reduced till nothing is reducable anymore, which in my setup is the state where there's just 1 SetDefinition left. This is the case because LLBLGen Pro can fetch hierarchies in 1 method call, so all parameters for that hierarchical fetch, if necessary, as well as derived tables, scalar queries etc. are elements of the SetDefinition's inner elements (predicate expressions for subqueries, ScalarQueryExpressions for field elements in projections etc.). The SetDefinition is then handled further by the QueryProvider which means it's simply converted to a fetch on the LLBLGen Pro api and the results are returned.
In the end, there's just hard work left: typing in all the
code for the fine details, for handling all extension
methods, all aggregates, all extension methods of types
known to the database, like String etc. However that's the
'easy' part (it's still hard and complex though), as it
comes down to writing out what you've already thought out:
the design of that code is just following the global
design of the overall framework. That's also why it's so
important to get the global design of the framework right,
hence my struggle.
My next stop is, after fixing a little bug in the tree
reducer, adding support for Joins and Aliases. Joins
introduce aliases for the rest of the elements. So do things
like
where e.Manager.Manager.Manager.Firstname=="John",
which introduces joins which could (so you have to take it
into account) be joins to self, which always require an
alias. As Joins introduce anonymous types under the covers,
it will be a bit of a challenge, but I think with the
approach I have now it will be manageable.
Thanks Matt for making me realize I was on the wrong track.
Next time we meet IRL, your beer is on me.