Developing Linq to LLBLGen Pro, part 9

Friday, November 23, 2007

(This is part of an on-going series of articles, started here)

So, how is the state of Linq to LLBLGen Pro? Well, it's getting more and more the state I have in mind. The codebase today can handle SelectMany, GroupJoin, DefaultIfEmpty, elements at unexpected places, Let, Join, Select, Where, Take, Distinct, our own TakePage for paging, Coalesce, all expression operators, all filter operators, projections of single value lists, projections of multi-value lists and projections of entities. No hierarchical projections yet, I'll do that at the end of the project I think.

The next big hurdle to take will be first the ordering operators and after that the reference to related sets and entities, followed by aggregates and group by. The ordering should be straight forward more or less, the reference to related entities is a bigger problem: aliases. LLBLGen Pro uses pre-fab relation objects so you don't have to know which fields relate to eachother. The downside is that in a Linq expression tree, every set is aliased. If you pull a relation resulting of an access of a field which represents a related entity into the tree, you have to set the proper alias/aliases for both sides to make it fit into the rest of the relations which will be formed. This will be a challenge I think, but likely not as big as the GroupJoin - DefaultIfEmpty nasties, but more on that below.

Skipping Skip
The Skip method won't be supported. LLBLGen Pro's API has pageNumber and pageSize as values for paging, as those are easier to use, because people in general want to fetch a given page of a given size anyway. The code internally would be capable of doing Skip, however we then have to alter our own API interface, so we decide not to support Skip. Instead, we added a new extension method, TakePage(pageNumber, pageSize), which specifies the page to fetch instead. I don't expect this to be a big problem. If it is really required, we can take the hacky road to specify pageSize as the skip number while keeping pageNumber to 0, but it already hurts my eyes while reading this sentence, left alone when it is written in code.

SelectMany and its ugly head
So, let's take a look at piece of code which is actually pretty nasty, shall we?

var q = from c in nw.Customers
        from o in nw.Orders.Where(
            o => o.CustomerID == c.CustomerID).Take(10)
        select new { c.ContactName, o.OrderDate };

This query requires a cross join with a right side predicate which requires values from the left side using a function. This is what's being done with the SQLServer 2005 feature called CROSS APPLY. Not every database supports CROSS APPLY, for example executing the above code on Sqlserver 2000 with Linq to Sql will give an exception because it can't produce the proper SQL, which is logical: there's no query possible which does this without CROSS APPLY, or without analysing the query in full and re-adding extra branches. Linq to LLBLGen Pro will support SelectMany and nested from statements, even with Where clauses and projections, however the above query is also not going to work with Linq to LLBLGen Pro. It's not something you'll run into much I think, in the past 4 years LLBLGen pro is now on the market we haven't had one request for these kind of queries, so we'll stick with CROSS/INNER/LEFT/RIGHT joins instead.

GroupJoin and DefaultIfEmpty
Leave it to scientists to come up with a complex way to specify a simple thing, eh? I wrote about that earlier. The simple thing here is what's called a left outer join. To see the complexity, we'll first have a look at a simple join (equi-join or 'INNER' join) first.

var q = from c in metaData.Customer
        join o in metaData.Order on c.CustomerId equals o.CustomerId
        where c.Country == "Germany"
        select new { Company = c.CompanyName, OrderID = o.OrderId };

It's a simple join, fetch company name and orderid of all customers and their orders from Germany. The syntaxis above shows the join keyword and after it a predicate expression. A predicate expression is an expression built from predicates (1 or more) which result in true or false. A typical example is a WHERE clause in a SELECT query. The predicate expression here shows the equals operator which compares two fields, one from both sides (customer and order) of the join. No other operator is possible, you can only use equals. That's pretty inflexible. If you want to specify other predicates, you have to go for nested from clauses. This can lead to more inefficient SQL due to the cross-join nature.

So, how do we now formulate a left join, where we for example retrieve all customers without orders? With a left keyword in front of join? Or specified behind it? No, that would be easy. No, MS has cooked up a much more complex way to specify LEFT joins. Oh sorry, I should be more precise: a much more complex way to specify a similar working mechanism which seems to mimic a LEFT join. You see, the mechanism I'll show to you below (which is also documented in the MSDN), isn't really a LEFT join. It can be translated to a LEFT join, as that's what seems to come the closest to it. To see why it's not a real LEFT join, you have to look at the meaning of the key element in this, DefaultIfEmpty. In short, it's a sequence selector which returns the default value if the represented sequence is empty. The syntaxis is the way it is because, I think, this syntaxis is usable in Linq to Objects without any translation of the query. Of course for usage on relational databases, it's kind of odd, and it's likely a root cause for problems with Linq for a lot of people.

Let's look at a query with DefaultIfEmpty.

var q = from c in metaData.Customer
        join o in metaData.Order on c.CustomerId equals o.CustomerId into oc
        from x in oc.DefaultIfEmpty()
        where x.OrderId == null
        select c;

This query returns all customers which have no orders. The join result is stored into oc. Or better: the right side of the join result is stored in oc. The element oc doesn't really exist inside the database, it's there to specify the query; in the database we'll use completely different mechanisms. By selecting from the oc element by using the DefaultIfEmpty extension method on oc, x will be either an order or an empty / default value. If the predicate x.OrderId==null succeeds, the related customer can be selected, otherwise it's skipped.

The sadness begins with the into oc part of the join element: it will result into a GroupJoin call. GroupJoin is something new in Software Engineering, you won't find it in the set of operators in relational algebra. The MSDN documentation states that the group join has no direct equivalent in traditional relational database terms and that it resembles inner joins and left joins. How many, how they're setup, that's left as an exercise for the reader. So what to do? The GroupJoin appears in the Linq expression tree, we can't ignore it and we also can't translate it to a set operation in SQL.

The key is to realize that the GroupJoin's result is actually the GroupJoin, the left side isn't that important. So in the case of the query above, the GroupJoin result is used at the spot of the call to DefaultIfEmpty. At that place the GroupJoin should re-appear. At the place where it actually is located in the tree, we can simply use the left side and ignore the rest. Applying this to our expression tree handlers, the GroupJoin is handled at the spot where it's placed inside the tree, and will be seen there as its left side. At the place where the DefaultIfEmpty call is located, we'll pull the GroupJoin's expression back into the tree, as it references the right side of the GroupJoin and will use its elements there, that is: the both selectors and the right side. This can also be applied in other queries where an aggregate is used on the GroupJoin result, as that usage can then use the GroupJoin elements, selectors and right side, to form a query on its own.

Because the second from in the query above makes it a nested from and thus a SelectMany is used, we'll end up with a cross join in our tree where the right side is a DefaultIfEmpty call. If we pull at that location the GroupJoin back into the tree, use the selectors as the join's selectors we can build a left join instead of a cross join. . I wrote it a bit simpler here than the code is in real life, as you have to make sure join sides are always resolved to sets, as they can be full queries on their own, and you have to make sure you don't use aliases which aren't reachable (more on that in a minute). You also have to make sure that the code isn't something written for a special case, but a general piece of code. Of course, in the sense of relational databases, DefaultIfEmpty is a special case, but still, the tree handlers should be generic enough that they can handle every situation possible, because one day, someone will write a Linq query which will otherwise break your code, it has to be prepared for the worst.

That's one of the biggest challenges I've faced and am still facing during development of Linq to LLBLGen Pro: what possible tree nodes can end up at location X? It's easy to say "any node you can possibly think of", but if that's the case, you have to understand that node at that spot and have to do something with it, as the sole purpose of the code is to create a query which is runnable on an RDBMS. So which nodes make sense there? That's often unclear to some extend, however after some work when the nodes are defined, how do they end up there? Which query/queries create trees which place the nodes in that spot? Are these queries possible or not?

You might wonder, why all the questions? Well, because it's impossible to otherwise release the code to customers: writing unit-tests for a Linq provider is pretty meaningless other than 'it indeed has a state where it starts and works'. I had tests succeeding on a given query Q but the code would fail miserably if I swapped two lines of code in Q, which made Q still be equal for Linq, but not interpretable by the provider. That's the hard part of writing a provider which has to work always, in any occasion, especially since the amount of knowledge how to transform a given subtree into SQL isn't very big at first, you have to build that up during the project. That often results in throwing away pieces of code and re-implementing parts of the system even if that code worked perfectly fine according to the tests, simply because the algorithms the code was the projection of, changed because new knowledge was harvested during research of these trees. During the past 5 years I'm now working on the LLBLGen Pro project, I haven't stared as much at my monitors as I've done in the past 2 months without typing any code, just looking at the trees and trying to understand what they meant.

Reachable aliases
One of the things which costed me a lot of time to get right were aliases. I don't know what it is, but I always have trouble with these fellows, even when I realize that I possible will expect problems. Every recognized set in an expression tree gets an alias, called a SetAlias. A SetAlias is actually an object which refers to the real alias, a string. The SetAlias also keeps track of which fields it aliases, so if we update the inner alias string, it can quickly update its related fields as well. Let's look at a query where these aliases become a bit of a problem.

var q = from c in metaData.Customer
        where c.Country=="Germany"
        join o in metaData.Order on c.CustomerId equals o.CustomerId
        where o.EmployeeId==2
        select c.CompanyName;

The above query gives all company names from customers in Germany who have an order which is handled by employee with ID 2. Pretty basic stuff. Or is it? Take a closer look at this fragment:

var q = from c in metaData.Customer
        where c.Country=="Germany"
        join ....

This is a separate query. In SQL this is called a derived table. A derived table is a query in the FROM clause, for example as part of a JOIN set. Let's write the query above in SQL:

SELECT x.CompanyName
FROM 
(
    SELECT c.CompanyName, c.CustomerID
    FROM Customers c
    WHERE c.Country = 'Germany'
) x INNER JOIN Orders o
ON x.CustomerID = o.CustomerID
WHERE o.EmployeeID=2

As you can see, the customer table is inside a SELECT statement, the derived table. This means that the table itself isn't reachable from the outer query elements like the SELECT projection clause. Instead the derived table has to be referenced. Let's get back to our Linq query. In the select statement we see the projection: select c.CompanyName, where c is the customer element from the Customer entity set. In the expression tree, the projection will refer to a member of an anonymous type. As in general this can be rolled out to reach the deeper level (because if the first where wouldn't have been there, customer would be reachable by the select), when handling the projection of c.CompanyName, we'll get a hold of the SetAlias of metaData.Customer. However, that will go wrong as that query will cause an exception, we can't refer to that alias.

The solution is the concept of alias scopes. This concept is already implemented in LLBLGen Pro's core API when subqueries are used: aliases assigned there can't be used outside the scope. So I used that knowledge and made a new handler which traversed the tree and found all scopes. Per scope it collected all aliases. When a new scope was entered during expression tree handling, the aliases of that scope would be made 'in scope'. The object which controls the aliases and aliased objects then checks if the alias to return for a given object, e.g. a ParameterExpression, is 'in scope', and if not, it will find the scope which contains that alias and will check if the alias of that scope is 'in scope', till it finds one which is. In the query above, it will find that the alias for metaData.Customer isn't in scope for c.CompanyName in the projection. It will then check which scope has this alias, which is the left side of the join. The alias of that left side, x in our example SQL query, is in scope so that alias is used to refer to the set which contains the source for the projection. Pfew!

Bored yet? Yes? Let's have a little rant as last remark in this post to get you all awake again. The elements discussed here in this post show you that there's a big gap between what you write and the SQL structure and SQL query elements you will get. It's not similar to specifying a predicate here, a relation there, it's on a higher level. The advantage of this is that you work with the data as if you're working with data in memory, in objects, not in a relational database. This is also its biggest weakness however: there's no real tweakability with respect to the SQL produced: you can't say: "I want this part to be a subquery, not a join", you have less control over how the end SQL ends up looking. You might want to say that that's a blessing, and for some parts it is. However, because you don't have the flexibility of specifying the structure of the query in detail, as you can with our own query API, you can't optimize a query if it misbehaves with large sets for example.

Not all of us are also DBAs of the big iron boxes our software will target in production: some of us have to work with DBAs who know their big databases inside-out and know when a query is slow and why. If a Linq query produces slow SQL, because it for example ends up as a join and not as a subquery (1:n filters are a good example), and your DBA orders you to rewrite it, it's not always clear what you, as the C#/VB.NET developer working at the abstraction level of Linq, can do to make the query performant again, to make the query look like the DBAs version of the query which completes the task in a heartbeat instead of 5 minutes.

It's important to realize that to get an O/R mapper accepted as the data-access platform and to have the SQL generated for you instead of written by hand, it's key that these kind of things are solvable, that the developer can be comfortable with the tools at hand and re-write the queries to versions which perform better and have the same results. This is also the reason why we'll keep our own query API: flexibility in the hands of the developer, so the person who writes the query is in control of it, in every aspect.

One downside with skipping Skip and Take is that VB handles them as native query expressions, not simply extension moethods. If you were able to support the standard Skip and Take methods, you would potentially benefit from those of us in the VB world being customers.

Jim Wooley - Monday, November 26, 2007 5:09:41 PM

Jim: Take is supported (which results in TOP or LIMIT etc.). The thing which isn't supported (at the moment) is Skip.

I indeed saw after making the decision to skip skip, that VB has skip in the language.

Supporting Skip requires a change in the paging code, though that's not the main problem. The main problem is that our API has pageSize and pageNumber as parameters for the fetch methods (on some overloads). So the parameter isn't really addable to these methods as it will make them look weird.

There's a way perhaps, passing the skip size to the pageNumber or pageSize and leave the other one 0, but that's hacky. So it's not that easy, unfortunately.

It's indeed unfortunate that VB has the elements in the language. We did think about supporting Skip and have it getting reworked as the pageNumber, but that's not always possible, if the skip size isn't a multitude of the number passed to Take ;).

Skip+Take for paging is also likely going to cause some problems: you have to specify them in the right order: First skip, THEN take. If you do it the other way around, you get a different resultset. Perhaps that's the reason VB added them to the language.

FransBouma - Monday, November 26, 2007 5:28:14 PM

How long before this can be used in a released app without worry?

Visual Basic - Wednesday, December 19, 2007 7:17:12 PM

Without worry? Hard to say. There will undoubtedly be bugs in the code when it goes beta, though I assume it will quickly mature because every query which fails makes a generic piece of the code more mature and a lot of the code in a linq provider is used with every query.

FransBouma - Wednesday, December 19, 2007 7:33:16 PM

4 Comments