Developing Linq to LLBLGen Pro, Day 0

Tuesday, September 11, 2007

Now v2.5 of LLBLGen Pro is out the door and the release-stress has gone away, it's time to pick up the next project, which is Linq support for LLBLGen Pro, which will be rolled into v2.6 of LLBLGen Pro, which is scheduled for Q4 2007.

This time around, we thought it would be fun to blog a post on every day I've worked on Linq support with the achievements of that day. I've no idea how long this will take, but my intuition says it won't take months. There are a couple of problems to overcome though, some I'll be addressing briefly below and others I'll discuss in due time.

As LLBLGen Pro has a lot of features and already a true OO query system, it's logical to create an expression tree converter instead of a new SQL emitting system. The expression tree converter should simply create LLBLGen Pro query objects, like predicate expressions, relation collections, excluded fields lists, projections etc. etc. and should pass that to the appropriate fetch method when the Linq query is executed. The previous sentence already shows that 'Linq' is actually all about fetching, i.e. the 'R' in CRUD. The other 3 types of queries executed on a database aren't specified in Linq constructions, these are specified using the native O/R mapper elements.

This is also one of the drawbacks of the whole Linq system: you can't really rely on it to have a true generic O/R mapper querying language: you'll always have to fall back on the O/R mapper specific elements to use it to its maximum potential or worse: to even be able to perform a given action on the data in the database. There are other drawbacks to the current state of Linq: the syntaxis lacks some constructs which makes it again impossible to create a true generic O/R mapper query language. One of the main painpoints I'm refering to is the lack of prefetch path specifications (eager loading) inside the query text: there's always a single select command in any given Linq query (it always results in a single set), there's no room for any graph oriented fetching whatsoever, unless an extension method is created to specify prefetch paths/spans, which is precisely the thing you don't want, because that ties the query specified to the O/R mapper used, and the big question then becomes: why bother with Linq if you can also use the native O/R mapper syntaxis for querying entities?

Let's look at an example: say we want to fetch two graphs: A) all customers from Germany and their orders and their order details, and B) all products and their orders (m:n). These graphs are at the moment separated but you can of course see that we could add product entities from graph B to graph A if we for example add a new Order detail entity to a given order. If we look at what Linq provides us, we have no possibility to specify that we want to fetch these two graphs with the query specified, we have to specify the graph layout outside the query. Linq to Sql for example does this via LoadOptions, which are specified on the context used. And that's precisely where it goes wrong. Because the problem is the deferred execution of Linq queries: if you specify the query for graph A with the LoadOptions for the graph, you can't also specify the LoadOptions for the graph B (let's pretend Linq to Sql can load m:n relations) as well, even though the graph might contain the same entities, as it might be the graph B is fetched using a filter which is relevant for B but not for A. The issue is that the LoadOptions are used when the query is executed, which can be somewhere else.

This gives headaches which are unnecessary: which entities to load eagerly, via paths, spans or whatever you want to call them, together with the entities you're fetching with the query is an element which should be part of that query, as it has only value in the context of that query, not another query. So any O/R mapper which wants to allow its users to specify entity graphs for fetching using Linq has to come up with some sort of extension method. That's of course doable, it's just that it mitigates the whole purpose of Linq: it's then 'just another querying method' to achieve the same thing as with the native O/R mapper constructs. The more a user has to fall back onto the native O/R mapper query elements, the more it will become clear that Linq is actually sugar you can live without, when it comes to fetching entities from the database.

As Linq is meant to be used on entities in a database, objects in memory, XML and whatever else you can cook up on a rainy Sunday afternoon, it embeds the limitations to be useful on all these targets. One of the things which will cause a lot of trouble is the lack of the 'left/right' join keywords: you have to go the route of join into and then select again. Say you want to fetch all employees from Northwind who don't have filed an Order yet (let's say there are employees in Northwind who haven't filed an order yet, ok? ). That's typically a left join query where you test on NULL:

-- TSQL, using '*' for simplicity
SELECT E.* FROM Employees E LEFT JOIN Orders O
ON E.EmployeeID = O.EmployeeID
WHERE O.OrderID IS NULL

So, in Linq, how would you specify this? Intuition tells us to try this first:

 // C#
var q = from e in nw.Employee
	left join o in nw.Order on e.EmployeeID equals o.EmployeeID
	where o.OrderID == null
	select e;

Looks fine? It doesn't compile: 'left' isn't a valid keyword there. You should use a GroupJoin. A what? A group join. Enter the wonderful world where scientists design APIs, enter the world of 'doing it the hard way just because life is already easy enough'. Perhaps, the same query can be used on XML, but I don't give a hoot, I'm talking to a RDBMS, which is pretty clear, as I specified an IQueryable implementing object as the source of the query. In databases, we use ansi joins (well, of course, the poor sods who are still on Oracle 7 or 8 don't) and because the rest of the query has similarities with SQL all the way, why is 'left' and 'right' left out? (pun intended).

So how should this simple query be formulated instead? Check it out:

var q = from e in nw.Employee
	join o in nw.Order on e.EmployeeID equals o.EmployeeID into oe
	from o in oe.DefaultIfEmpty()
	where o.OrderID == null
	select e;

Now, you'll find little info about this on the net. There's some forums thread post written by Anders Hejlsberg where he explains how super cool this group join construct is in relation to consuming XML. But it's totally artificial: the whole DefaultIfEmpty() vehicle is simply thrown away and never used, as well as the hierarchical tuples created in 'oe' which are never really created, as the whole query is simply executed as a single statement on the database, where 'oe' is actually the hash matcher result in the relational algebra executed on the RDBMS to match elements from both sources. If you look at the expression tree produced for this Linq query, you'll see that optimizing this away isn't that trivial. A 'left' or 'right' keyword would have been much simpler.

There are other things, like transaction management so you can execute updates and selects inside a transaction without deadlocks, which will likely lead to extension methods which tie the code to the O/R mapper used, but as there are already necessary extension methods to add, one more isn't going to hurt. However, the question why you want to use Linq for DB queries as it doesn't really bring that much to the table which wasn't available in the native O/R mapper query language, will stay up in the air for some time to come I think.

Since Linq was introduced to the public, people have discussed how to create extensible queries, queries where you can append constructs to to refine its outcome, for example based on user input. One of the ways to do so is by using originalQuery.SelectMany(func).Where... which simply behaves as a new select around the select of the original query. An example can be found in the Linq to Sql article Ian Griffiths posted yesterday, where he appends a query to find the Max price from a set of entities defined by another query. Typically, this leads to a derived table select, where the original query is used as one of the elements in the FROM clause of the actual select. However, not all databases on the planet support derived tables, for example Firebird 1.5 doesn't (v2.0 does though). Of course, Linq to Sql isn't bothered by this, but Linq isn't all about Linq to Sql. So to optimize this away will be the real challenge, also because LLBLGen Pro doesn't support the specification of a derived table in its interface to developers (not a lot of O/R mappers do, it's almost always only supported internally to produce a given query result).

A day 0 in a project is typically setting up the environment, getting started with reading docs etc. etc. I've already read parts of the C# 3.0 specification to see what kind of keywords are possible in a Linq query. This helps to find out what could be constructed as a Linq query and thus what can be expected as an Expression tree: to cover every possible scenario with unit-tests would take years if not longer, so it's key to have a different way to proof that the code is correct and I like to use syntax specfications for that. Also, as there's not that much information available at the moment (the MSDN docs aren't complete yet) a lot of trial/error programming will likely follow, but with the proper tools it should be taken to a minimum I think. One key element is a good expression tree viewer. Luckily in VS.NET 2008 b2, there's a debugger viewer for expression trees in the CSharpExamples archive. With a little hacking the sourcecode can be transformed into a normal viewer which can be embedded in the test code to see what a query construct looks like in an expression tree.

A Linq supporting layer has to implement IQueryable and provide a provider which actually consumes the expression trees and produces the output. Microsoft has done a good job by providing a lot of extension methods on Queryable<T> which handle the passing of the input to the provider to produce a new Queryable object. This makes life easier so we, O/R mapper developers, don't have to write all these extension methods ourselves.

Ok, the initial analysis is done, I'm starting today on developing the first code to see how things behave and to see if what I think is done is also done. Stay tuned for more posts in this series.

"However, the question why you want to use Linq for DB queries as it doesn't really bring that much to the table which wasn't available in the native O/R mapper query language, will stay up in the air for some time to come I think."

I'm a complete newbie, but how about compiler checked syntax? If I change my field name I don't have to go hunt down where that field appears.

I'll be following your progress; it's always interesting for me to hear people from outside MS talk about LINQ as they usually are more critical -- and I need that dose of criticism :-)

PBZ - Tuesday, September 11, 2007 12:40:38 PM

PBZ: LLBLGen Pro already has a compile-time checked Query system :)
The example query I gave with the left join:
EntityCollection employees = new EntityCollection;
RelationPredicateBucket filter = new RelationPredicateBucket();
filter.Relations.Add(EmployeeEntity.Relations.OrderEntityUsingEmployeeId, JoinHint.Left);
filter.PredicateExpression.Add(OrderFields.OrderId == DBNull.Value);
using(DataAccessAdapter adapter = new DataAccessAdapter())
{
adapter.FetchEntityCollection(employees, filter);
}

This system allows you to extend it to no end, without falling back to string-based queries.

FransBouma - Tuesday, September 11, 2007 1:49:17 PM

What about the "var" keyword used with the LINQ?

Why would I want a "unknown" return if I can't use it as return of a method.

i.e.:
public var GetEmployeeName(int employeeId)
{

var q = from e in nw.Employee
where e.EmployeeId== employeeId
select new {e.EmployeeId, e.Name};

return q;
}

This won't compile. It is not possible to have an "unknown" type as a return of a method. But I just DON'T know what “q” is, so what should I use to replace the "var" on the method signature? Before you answer, remember that it must be strong typed.

So, it seems that the return of my method that use something like EntityCollection is still be better approach.

Dals - Tuesday, September 11, 2007 6:07:55 PM

Frans, judging from what you say, would it be totally false to say that there's not much to gain from adding LINQ support, appart from the obvious marketting value of saying that LLBL is "Linq compatible" ? :)

Renaud Martinon - Wednesday, September 12, 2007 8:59:12 AM

I'm a fully-paid up owner and user of LLBLGen 2, so I think I'm in a position to throw my 2 cents in.

LLBLGen is definitely far, far more powerful and feature-packed than LINQ. LLBLGen is also targetted towards RDBMS, rather than being generic.

BUT - building queries with LLBLGen syntax and methods is a whole order of magnitude harder than building them in LINQ, which is far more natural to look at and simpler to understand. LLBLGen has a steep learning curve, LINQ doesn't.

For a lot of people, like me, LINQ is a huge bonus. The big benefit for me of LINQ + LLBLGen is targetting of DBs other than MSSQL.

Boris Yeltsin's Zombie - Wednesday, September 12, 2007 9:57:09 AM

p.s. Keep up the good work. I really appreciate how passionate you are about O/R and how insanely knowledgeable you are. There are only a handful of folks I can see out there in the blogosphere who are pushing back against Microsoft and keeping them honest in the development of things like LINQ which are damned complicated under the covers. It needs people who can annoy Microsoft enough to fix all the scenarios, like the one mentioned above, and make their products better for everyone.

Boris Yeltsin's Zombie - Wednesday, September 12, 2007 10:00:01 AM

@BY's Zombie: (hehe) I agree that there's a learning curve with LLBLGenPro's query system, but I don't think it is much less with Linq. the thing is that a simple select * from table query is easier, but often you have more complex queries to execute, which also makes it not that easy to jump right into the Linq syntaxis, and there are many details you have to learn while working with it, so I think it's not that smaller.

What can be helpful is if someone already has Linq experience and switches to another O/R mapper which supports Linq, then it pays of somewhat, however the developer still has to learn the specifics of the O/R mapper and the Linq-extensions specific for that O/R mapper.

Thanks for the kind words :)

@Dals: Ok, but there you're defining a new type which will contain the projection result. To be able to use that in a typed way, you have to either define the type up front or consume the query inside the method, there's no other way, I'm afraid. It's not ruby ;)

FransBouma - Wednesday, September 12, 2007 2:12:03 PM

7 Comments