Now v2.5 of LLBLGen Pro is out the door and the release-stress has gone away, it's time to pick up the next project, which is Linq support for LLBLGen Pro, which will be rolled into v2.6 of LLBLGen Pro, which is scheduled for Q4 2007.
This time around, we thought it would be fun to blog a post on every day I've worked on Linq support with the achievements of that day. I've no idea how long this will take, but my intuition says it won't take months. There are a couple of problems to overcome though, some I'll be addressing briefly below and others I'll discuss in due time.
As LLBLGen Pro has a lot of features and already a true OO query system, it's logical to create an expression tree converter instead of a new SQL emitting system. The expression tree converter should simply create LLBLGen Pro query objects, like predicate expressions, relation collections, excluded fields lists, projections etc. etc. and should pass that to the appropriate fetch method when the Linq query is executed. The previous sentence already shows that 'Linq' is actually all about fetching, i.e. the 'R' in CRUD. The other 3 types of queries executed on a database aren't specified in Linq constructions, these are specified using the native O/R mapper elements.
This is also one of the drawbacks of the whole Linq system: you can't really rely on it to have a true generic O/R mapper querying language: you'll always have to fall back on the O/R mapper specific elements to use it to its maximum potential or worse: to even be able to perform a given action on the data in the database. There are other drawbacks to the current state of Linq: the syntaxis lacks some constructs which makes it again impossible to create a true generic O/R mapper query language. One of the main painpoints I'm refering to is the lack of prefetch path specifications (eager loading) inside the query text: there's always a single select command in any given Linq query (it always results in a single set), there's no room for any graph oriented fetching whatsoever, unless an extension method is created to specify prefetch paths/spans, which is precisely the thing you don't want, because that ties the query specified to the O/R mapper used, and the big question then becomes: why bother with Linq if you can also use the native O/R mapper syntaxis for querying entities?
Let's look at an example: say we want to fetch two graphs: A) all customers from Germany and their orders and their order details, and B) all products and their orders (m:n). These graphs are at the moment separated but you can of course see that we could add product entities from graph B to graph A if we for example add a new Order detail entity to a given order. If we look at what Linq provides us, we have no possibility to specify that we want to fetch these two graphs with the query specified, we have to specify the graph layout outside the query. Linq to Sql for example does this via LoadOptions, which are specified on the context used. And that's precisely where it goes wrong. Because the problem is the deferred execution of Linq queries: if you specify the query for graph A with the LoadOptions for the graph, you can't also specify the LoadOptions for the graph B (let's pretend Linq to Sql can load m:n relations) as well, even though the graph might contain the same entities, as it might be the graph B is fetched using a filter which is relevant for B but not for A. The issue is that the LoadOptions are used when the query is executed, which can be somewhere else.
This gives headaches which are unnecessary: which entities to load eagerly, via paths, spans or whatever you want to call them, together with the entities you're fetching with the query is an element which should be part of that query, as it has only value in the context of that query, not another query. So any O/R mapper which wants to allow its users to specify entity graphs for fetching using Linq has to come up with some sort of extension method. That's of course doable, it's just that it mitigates the whole purpose of Linq: it's then 'just another querying method' to achieve the same thing as with the native O/R mapper constructs. The more a user has to fall back onto the native O/R mapper query elements, the more it will become clear that Linq is actually sugar you can live without, when it comes to fetching entities from the database.
As Linq is meant to be used on entities in a database, objects in memory, XML and whatever else you can cook up on a rainy Sunday afternoon, it embeds the limitations to be useful on all these targets. One of the things which will cause a lot of trouble is the lack of the 'left/right' join keywords: you have to go the route of join into and then select again. Say you want to fetch all employees from Northwind who don't have filed an Order yet (let's say there are employees in Northwind who haven't filed an order yet, ok? ). That's typically a left join query where you test on NULL:
-- TSQL, using '*' for simplicity SELECT E.* FROM Employees E LEFT JOIN Orders O ON E.EmployeeID = O.EmployeeID WHERE O.OrderID IS NULLSo, in Linq, how would you specify this? Intuition tells us to try this first:
// C# var q = from e in nw.Employee left join o in nw.Order on e.EmployeeID equals o.EmployeeID where o.OrderID == null select e;
Looks fine? It doesn't compile: 'left' isn't a valid keyword there. You should use a GroupJoin. A what? A group join. Enter the wonderful world where scientists design APIs, enter the world of 'doing it the hard way just because life is already easy enough'. Perhaps, the same query can be used on XML, but I don't give a hoot, I'm talking to a RDBMS, which is pretty clear, as I specified an IQueryable implementing object as the source of the query. In databases, we use ansi joins (well, of course, the poor sods who are still on Oracle 7 or 8 don't) and because the rest of the query has similarities with SQL all the way, why is 'left' and 'right' left out? (pun intended).
So how should this simple query be formulated instead? Check it out:
var q = from e in nw.Employee join o in nw.Order on e.EmployeeID equals o.EmployeeID into oe from o in oe.DefaultIfEmpty() where o.OrderID == null select e;
Now, you'll find little info about this on the net. There's some forums thread post written by Anders Hejlsberg where he explains how super cool this group join construct is in relation to consuming XML. But it's totally artificial: the whole DefaultIfEmpty() vehicle is simply thrown away and never used, as well as the hierarchical tuples created in 'oe' which are never really created, as the whole query is simply executed as a single statement on the database, where 'oe' is actually the hash matcher result in the relational algebra executed on the RDBMS to match elements from both sources. If you look at the expression tree produced for this Linq query, you'll see that optimizing this away isn't that trivial. A 'left' or 'right' keyword would have been much simpler.
There are other things, like transaction management so you can execute updates and selects inside a transaction without deadlocks, which will likely lead to extension methods which tie the code to the O/R mapper used, but as there are already necessary extension methods to add, one more isn't going to hurt. However, the question why you want to use Linq for DB queries as it doesn't really bring that much to the table which wasn't available in the native O/R mapper query language, will stay up in the air for some time to come I think.
Since Linq was introduced to the public, people have discussed how to create extensible queries, queries where you can append constructs to to refine its outcome, for example based on user input. One of the ways to do so is by using originalQuery.SelectMany(func).Where... which simply behaves as a new select around the select of the original query. An example can be found in the Linq to Sql article Ian Griffiths posted yesterday, where he appends a query to find the Max price from a set of entities defined by another query. Typically, this leads to a derived table select, where the original query is used as one of the elements in the FROM clause of the actual select. However, not all databases on the planet support derived tables, for example Firebird 1.5 doesn't (v2.0 does though). Of course, Linq to Sql isn't bothered by this, but Linq isn't all about Linq to Sql. So to optimize this away will be the real challenge, also because LLBLGen Pro doesn't support the specification of a derived table in its interface to developers (not a lot of O/R mappers do, it's almost always only supported internally to produce a given query result).
A day 0 in a project is typically setting up the environment, getting started with reading docs etc. etc. I've already read parts of the C# 3.0 specification to see what kind of keywords are possible in a Linq query. This helps to find out what could be constructed as a Linq query and thus what can be expected as an Expression tree: to cover every possible scenario with unit-tests would take years if not longer, so it's key to have a different way to proof that the code is correct and I like to use syntax specfications for that. Also, as there's not that much information available at the moment (the MSDN docs aren't complete yet) a lot of trial/error programming will likely follow, but with the proper tools it should be taken to a minimum I think. One key element is a good expression tree viewer. Luckily in VS.NET 2008 b2, there's a debugger viewer for expression trees in the CSharpExamples archive. With a little hacking the sourcecode can be transformed into a normal viewer which can be embedded in the test code to see what a query construct looks like in an expression tree.
A Linq supporting layer has to implement IQueryable and provide a provider which actually consumes the expression trees and produces the output. Microsoft has done a good job by providing a lot of extension methods on Queryable<T> which handle the passing of the input to the provider to produce a new Queryable object. This makes life easier so we, O/R mapper developers, don't have to write all these extension methods ourselves.
Ok, the initial analysis is done, I'm starting today on developing the first code to see how things behave and to see if what I think is done is also done. Stay tuned for more posts in this series.