Developing Linq to LLBLGen Pro, Day 5

(This is part of an on-going series of articles, started here)

Consuming Expression trees, back to Special Case programming?
I'll show you 5 different queries and what their expression tree looks like in text (tested on Linq to Sql, so you'll see Table references)

Query A:

// C#
var q = from c in nw.Customers
            select new
            {
                Name = c.ContactName,
                Orders = from o in nw.Orders
                         where o.CustomerID == c.CustomerID
                         select o
            };
gives:
// Expression tree .ToString() output
Table(Customer).Select(c => new <>f__AnonymousType0`2(Name = c.ContactName, 
Orders = value(WindowsFormsApplication1.Program+<>c__DisplayClass0).nw.Orders.Where(
    o => (o.CustomerID = c.CustomerID))))

In case you wonder... the WindowsFormsApplication1.Program namespace is the test app namespace I used to run the query with. As you can see, a bug pops up: instead of referring to Table(Order), it refers via the application to the context.Orders for the Where clause. This will be a pain to correct, I'm sure. Also there's just 1 Select call, while there are two in the query. This will be a pain too, but more on that below.

Query B:

// C#
var q = from c in nw.Customers
        let x = c.Region ?? "Undef"
        where (c.Country != "Germany")
            && x != "Undef"
        select new { CompName = c.CompanyName };
gives:
// Expression tree .ToString() output
Table(Customer).Select(c => new <>f__AnonymousType0`2(c = c, x = (c.Region ?? "Undef"))).Where(
    <>h__TransparentIdentifier0 => ((<>h__TransparentIdentifier0.c.Country != "Germany") && 
    (<>h__TransparentIdentifier0.x != "Undef"))).Select(<>h__TransparentIdentifier0 => new 
    <>f__AnonymousType1`1(CompName = <>h__TransparentIdentifier0.c.CompanyName))
Yeah, I also wondered about the transparent identifier, but it's the reference to a derived table-like sub result if you look closely. See also the C# 3.0 spec, where this is explained in detail.

Query C:

// C#
var q = from c in nw.Customers
        where c.Country != "USA"
        select new { CompName = c.CompanyName };
gives:
// Expression tree .ToString() output
Table(Customer).Where(c => (c.Country != "USA")).Select(c => 
    new <>f__AnonymousType0`1(CompName = c.CompanyName))

Query D:

// C#
var q = from c in nw.Customers
        select c;
gives:
// Expression tree .ToString() output
Table(Customer).Select(c => c) 

Query E:

// C#
var q = from c in nw.Customers
        where c.Country=="USA"
        select c;
gives:
// Expression tree .ToString() output
Table(Customer).Where(c => (c.Country = "USA"))

Confused? So am I. Yesterday I spend some time on writing code for the execution of queries, setting up the proper LLBLGen Pro elements such as live transactions and the like so they're usable with the query objects so deferred execution still would take place inside a transaction if required (so SqlServer won't deadlock on you), and I also spend some time analyzing what to do with the Expression tree nodes. Yesterday, it occured to me that it was something like LR(1) parsing: you handle an element with the lookahead info and place the results on the stack for the caller to process further. If the caller sees it has enough info to reduce the elements on the stack, it reduces all those elements and handles it further, placing the result of the reduction onto the stack again.

However, to make that work, the input has to be deterministic, i.e.: you shouldn't wonder in state N in what scope you're in, started in state N-M. Now take a look at queries D and E. D has a Select call, E doesn't. Why on earth doesn't E have a select call? This is undeterministic. What it should have been is:

// What should have been E's correct structure 
Table(Customer).Where(c => (c.Country = "USA")).Select(c=>c)

The expression tree created by the compiler is wrong, because the expression tree parser in the Linq provider has to decide what to leave out, however the compiler decides that for me now. The sad effect of this is that I now have to add code which has to guess what the intention was: when no Select call is present, apparently one still wants to do a select, with the native projection.

This is doable, although it's a form of 'special case programming': as a provider developer you now have to anticipate on a special case: namely the one where there's no Select call in the expression tree despite the fact that there is a select statement in the query (there always is). With the occurance of a special case, one has to wonder: what kind of other special cases are there, which are unknown to me now? And also: if the compiler tries to be clever, why isn't it more clever so parsing the nodes is easier and straight forward?

The main pain with writing the Linq provider isn't the code, it's the unknown voodoo which takes place inside the compiler which makes it emit calls to Queryable(Of T) extension methods which build apparently odd expression trees at runtime. To be able to write a proper Linq provider, you have to understand the intention of the query. There's no other thing to look for. You're handed an expression tree. So by parsing that tree, you try to re-build the query the developer wrote in C# (of VB.NET) which lead to this expression tree. If you solely look at elements inside the tree, you'll run into problems with things which have effect on various elements of the tree, like a join into with DefaultIfEmpty() calls to signal a left join (yes, Linq's join syntax is that cumbersome). If you know the intention, you can produce the proper query code on all databases you support via your provider and also you can optimize the query when necessary.

It's now unclear what lead to which expression tree subtree, so in other words: it's impossible to deterministically determine what the meaning is of a subtree in the expression tree, at least at this point where documentation about why the expression tree looks like the way it does is non-existent.

It's not hard to write code which can handle trees like the ones resulting from queries C, D and E. The main issues arise when someone introduces a line with 'let' like in query B or a second select (projection) in the main projection as in query A.

Take a closer look at the differences between query A and B. You'll notice that although query A has two select statements in the query, it has just one Select method call in the expression tree, while in query B, it has two Select method calls while there's just 1 select statement in the query.

Query B, with its two Select method calls gives another problem: what to do with the results? Does it have to be projected onto new types, or onto entities? Projecting onto entities is different, as shorter paths exist inside the O/R mapper core, simply because one doesn't set entity values through properties when you're fetching (to avoid triggering validation logic which shouldn't trigger at fetch time for example). However, with two (or potentially more) Select method calls exist, which one decides what is done with the end-result? Again, special case programming and not deterministic code: to handle a Select method call, one has to know the scope it is in: it can't decide based on its own parameters and the lookahead what to do: which one is the end-projection? The last one? But do you know that when you run into the first one? No, so you have to seek through the whole tree for another select. If that's not found, you know you're the only one. So doing things based on what you see in node X isn't going to work.

In LR(n) parsers, the stack top is important too. So this could help in the Select method handler to determine if it should decide the end projection or if it should reduce to a derived table. But this could lead to a shift/reduce conflict , which one normally solves by either extending the lookahead or changing the grammar (no can do) or by adding a special case (not recommended).

I'm sorry if I sound a litle bitter, but this whole Linq expression tree parsing system is overly complicated while this isn't necessary. I wrote my own LR(n) parser engine and action/goto generators, I know what parsing trees means, but with Linq expression trees it's like with a box of chocolates, isn't it, Forest...

As I said earlier, what you want with code like this is to be able to proof that the code is correct. As there are 'special cases' popping up left and right, being able to proof something is correct is impossible, unless you have deep knowledge of the inner workings and know why the tree looks like the way it does. Without that knowledge, I can't optimize query A. Query A can be done with 2 queries in generic code (or one, if you cheat, but that leads to special code for single path graphs). However to determine this situation is seen inside the query is hard. Sure, you can look at the expression tree of query A and take a guess that if you see something like that it will be optimizable.

However, Linq to Sql (and also the entity framework) will execute a myriad of queries with query A (number of customers + 1). I'll try to find a way to optimize it, but that depends on whether I can recognize the subquery/subqueries in a normal fashion. This is harder than you might think, as the subquery might or might not have a Select method call (and Microsoft, with knowledge of the inner workings of the compiler didn't know how to optimize it)

Yes, I do wonder if it wouldn't be simpler if I just got the linq query as text and did the parsing myself. Anyway, I'll start simple, with queries D, E and then C. I've the whole mechanism in place now so I can now add dispatch routines to my Expression node dispatcher for the various expression node types within the range of BinaryExpression for example, so I'll move forward with my LR(1) (so 1 lookahead) lookalike engine, which is hopefully enough to handle the cases it has to fight with. It's likely I need tree traversal code for cheating in situations where the 1-lookahead isn't enough. This is cumbersome though and makes things fragile.

3 Comments

  • Check out the section in the C# 3.0 spec on Degenerate query expressions. Essentially it says that a Degenerate query is one that trivially selects the elements of a source and so in order to ensure that the results of the query are not the source object itself, it forces the select method to be called. When a query is not a degenerate query you shouldn't see this because methods like Where, SelectMany, etc... will always ensure that the source objects aren't returned. At least this is how I interpreted it. Check out section 7.15.2.3 in the spec. Hope this helps!

  • I'm a bit confused -- why don't you just automatically remove identity Select calls (.Select(o => o)) everywhere from the call chain? I don't see why you really need them, as you should know that at the end of any call chain (e.g. .Where(...).GroupBy(...).Join(...)), values pop out. Remember that your customers might use extension method calls instead of query syntax; I often use query syntax if I'm foreaching through a list and want to apply a very simple filter.

  • Luke: at some point it has to be decided what kind of fetch method is used on our api: a general entity fetch, a general projection of entity data, a hierarchical projection, an entity fetch with excluded fields (i.e. a projection on entities with less fields fetched).

    They use different methods, as they require different ways to put the data into the object to fill. For example, setting entity field values through property setters is in general not a good idea, because it could trigger code in the properties (and what to do with readonly fields?).

    So this is important. However, where to put that code? What I want to achieve is that I can decide inside the handler of a given Expression node what the node represents. This is essentially the hard part of the linq provider. So if I see a Select method call, I have to know what this select call represents.

    If there's no select method call, what to do and if there are 2 or 3, what to do inside each of them? What I want to avoid is that I have all kinds of flags set when entering recursion of the tree traversal, as that's never a good idea. :) But I think I found a way to do so, Matt put me on the right track with considering that these method calls are in fact their own sets.

    I use a stack for the reduction results. So when a node is reduced, the result is placed on the stack, the caller pops the result and processes it further with its own data, etc.

    Matt: I indeed dismissed your examples as more useful for people who need to emit SQL directly. I'm more or less rewriting the tree, by reducing expression nodes to results which are placed on a stack, similar to LR(n) parsing where series of terminals and non-terminals are reduced to non-terminals which are again reduced to non-terminals etc. till everything is gone :).

    My biggest struggle is the way to determine what a node represents. Take for example the UnaryExpression's Convert type. These things pop up in weird places so sometimes they've to be removed (which is simply a reduction with the operand of the expression). However also sometimes they don't and have to be converted into something (DB function call). When to remove them is decided by the containing expression of the convert, and you can't decide that inside the convert expression handler.

    This opens a bit of a problem, so you have to remove them outside the handler, but that's something you might have to do at various places, but which places?.

    I saw (sorry, I peeked) in the Linq to Sql provider you indeed use your own SqlNode based tree which you modify. I'm not sure if this is 100% necessary for me, as our query api already takes care of a lot of things. So I'm a bit uncertain to proceed with my reduction code with a stack vs. a tree modification routine. What's the main issue is that with the tree modification routine I still have to decide what node I'm looking at, and when I know that, I can also create predicate objects, field objects etc. for my own api.

Comments have been disabled for this content.