Developing Linq to LLBLGen Pro, day 2

Friday, September 14, 2007

(This is part of an on-going series of articles, started here)

Adding Linq support to an O/R mapper like LLBLGen Pro is a matter of choice: either you implement new SQL engines or you convert the expression trees to native query language components. The former is a lot more work and the latter will probably cause problems here and there. We decided to create a converter and I think that's the only real option for any O/R mapper developer out there who wants to support Linq queries. It's not without problems though. During my second day of working on the project I already hit a major roadblock. I'll describe that in a bit.

In my previous posts I've explained that the source of the Linq query is the most important part as there it all starts and ends. To be able to specify the source in the context of LLBLGen Pro, I generate a class which simply returns for each known entity a DataSource<T> instance, where T is the type of the entity, e.g. CustomerEntity. The Linq oriented classes like the QueryProvider etc. are placed in a separate assembly, and will be merged into our runtime library at release (the .NET 3.5 build that is), if it's not possible to keep things separated, otherwise we'll release a separate dll. With the roadblock I hit this morning, things look a bit complicated so I think having a separate assembly isn't going to cut it alone, runtime library changes will be needed.

So which problem did I ran into? It's actually based on a little Linq to Sql query posted in the C# newsgroup this morning. The query looked something like this:

// C#
var q = from c in nw.Customers
        select new { ContactName = c.ContactName, TotalOrders = 
            c.Orders.Sum(o => o.Order_Details.Sum(
                od => (double?)od.Quantity * (double?)od.UnitPrice)) };

This query retrieves for every northwind customer its overall total for all orders ordered. I looked at it for a while and wondered: is that 'Sum' executed in memory or in the database? So I added it to a little Linq to Sql project and it resulted in this query:

-- T-SQL
SELECT [t0].[ContactName], (
    SELECT SUM([t4].[value])
    FROM (
        SELECT (
            SELECT SUM([t3].[value])
            FROM (
                SELECT (CONVERT(Float,[t2].[Quantity])) * 
                       (CONVERT(Float,[t2].[UnitPrice])) AS [value], [t2].[OrderID]
                FROM [dbo].[Order Details] AS [t2]
                ) AS [t3]
            WHERE [t3].[OrderID] = [t1].[OrderID]
            ) AS [value], [t1].[CustomerID]
        FROM [dbo].[Orders] AS [t1]
        ) AS [t4]
    WHERE [t4].[CustomerID] = [t0].[CustomerID]
    ) AS [value]
FROM [dbo].[Customers] AS [t0]

Yes, I was also surprised to see that at first, but of course, the nested Sum() aggregates leave little room to enhance this. However, if I would have to write this in plain SQL, I'd do it like this:

-- T-SQL
SELECT 	c.contactname, SUM(quantity * unitprice) AS TotalOrders
FROM 	customers c INNER JOIN orders o
	ON c.customerid = o.customerid
	INNER JOIN [order details] od 
	ON o.orderid = od.orderid
GROUP BY c.contactname

A complete different approach. The reason I picked this one and not the nested selects is because it's very efficient (check the execution plans if you want). So I wondered, how can I rewrite my Linq to Sql query to get the group by query over the joined set? But... I don't know! . I tried everything I could think of but I couldn't get a compilable query which gave me the right results!

The thing is that LLBLGen Pro doesn't support derived tables in query specifications (the FROM ( SELECT ... ) constructs), because it determines the FROM clause of a select from the elements passed to the fetch method, as this is easier for the developer using the API and it leads to less mistakes. For example the query with the group by posted earlier can be written in LLBLGen Pro (using Adapter) as:

 // C#
ResultsetFields fields = new ResultsetFields(2);
fields.DefineField(CustomerFields.ContactName, 0);
fields.DefineField(new EntityField2("Total",
        (OrderDetailsFields.Quantity * OrderDetailsFields.UnitPrice), AggregateFunction.Sum), 1);
RelationPredicateBucket filter = new RelationPredicateBucket();
filter.Relations.Add(CustomerEntity.Relations.OrderEntityUsingCustomerId);
filter.Relations.Add(OrderEntity.Relations.OrderDetailsEntityUsingOrderId);
GroupByCollection groupBy = new GroupByCollection();
groupBy.Add(fields[0]);
DataTable table = new DataTable();
using(DataAccessAdapter adapter = new DataAccessAdapter())
{
    adapter.FetchTypedList(fields, table, filter, 0, null, true, groupBy);
}

As it fetches the set into a datatable, it's of course not really 'typed', but for the rest it gets the data out as expected with the query as expected (I can also fetch it as a datareader and project it onto a class if I want to, the query given is just for illustrational purposes, so I used a datatable). The result of this is that I have a Linq query I can't convert to LLBLGen Pro constructs and I have a query which is specifyable in LLBLGen Pro constructs but which I can't formulate in Linq. And it's not even monday!

Clash of the Paradigms
So I wondered... why is it so hard to write the SQL query in Linq (or why am I so stupid not to understand the C# 3.0 spec )? And then it hit me: Linq isn't set oriented, but SQL is. At least, that's my conclusion. The group by approach is logical from a math / set oriented viewpoint, but it's an odd approach if you look at it from an imperative / functional viewpoint. In an imperative / functional executed piece of code, you want to specify what has to be done and at the end of your set of statements you arrive at your result.

I have no idea if my Group By SQL query is even possible with Linq, however possible or not, the conclusion can only be: developers using the Linq extensions to C# and VB.NET will think in an imperative way, they don't bother with the fact that they're now suppose to switch their minds into Mode.SetOriented (poke &H6EF8, &H0E to be exact), they just want to write the query as if they were writing it in C# as they actually are doing that. For the people who think Linq will therefore have a small learning curve: I'm afraid I have to disappoint you: the same problems arise as with every other O/R mapper query language: 'how do I formulate this [insert complex SQL written by a 70 year old Oracle DBA here] SQL statement in [insert O/R mapper query language constructs here]?'. That won't go away, simply because there's no 1:1 mapping between Linq and SQL.

So the problem then arises for me: what to do? It's not as if the query presented is one only some bloke in South East Alaska would run once a year. However it's also not the end of the world: the developer won't be able to avoid all O/R mapper constructs anyway: there will be O/R mapper specific language elements in the final Linq query or around it and all CUD (Create, Update, Delete) operations aren't even mentioned. So if a given required set of data isn't specifyable in Linq, it's likely the native O/R mapper query language constructs will offer a way to obtain the data in that form using that particular query. So the option to not support given Linq constructs isn't that big of a deal as it seems.

However, the one true reason we're doing this Linq implementation in the first place is for marketing and strategic reasons: to be able provide an upgrade path for users of Linq to Sql towards our framework and to be able to offer people who have enjoyed Linq courses or have experience with Linq a way to leverage that experience on our framework. It doesn't add any functionality, the framework already supports almost all the ways you want to fetch any data in what kind of wacky format you can think of. So the only really satisfying solution is to solve the problem of the derived tables and be able to support as much Linq queries as possible.

My initial research shows that of our supported databases only Firebird 1.5 doesn't support derived tables (even MySql does, who would have thought!). I can live with that, as there's a v2.0 for firebird which does support derived tables. The only thing I have to solve is: how do I add this to our API elegantly and also without breaking a lot of code already out there (and of course also in such a way that the amount of code to change is minimal)? This is the root aspect of maintainable software, though it's one we call can't avoid sooner or later. So my next stop is first to alter our runtime library API in such a way that the derived table specification is possible so I can formulate the Linq constructs in our own query language elements without much conversion. Because, it's of course (in theory) also possible to transform the group of nested selects into a join set with group by, after all they lead to the same result, and the nested selects actually represent a join + group by. However my head already starts to hurt when I think about that so I leave that to the set-theory junkies over at MIT.

Try the following:

var q2 = from det in nw.Order_Details
group det by det.Order.Customer into g
select new
{
g.Key.ContactName,
TotalOrders = g.Sum(d => (double)d.Quantity * (double?)d.UnitPrice)
};

It generates the following, which actually gives a better execution plan (at least theoretically) than your custom SQL:

SELECT [t4].[ContactName], [t3].[value] AS [TotalOrders]
FROM (
SELECT SUM([t2].[value]) AS [value], [t2].[CustomerID]
FROM (
SELECT (CONVERT(Float,[t0].[Quantity])) * (CONVERT(Float,[t0].[UnitPrice])) AS [value], [t1].[CustomerID]
FROM [dbo].[Order Details] AS [t0]
INNER JOIN [dbo].[Orders] AS [t1] ON [t1].[OrderID] = [t0].[OrderID]
) AS [t2]
GROUP BY [t2].[CustomerID]
) AS [t3]
LEFT OUTER JOIN [dbo].[Customers] AS [t4] ON [t4].[CustomerID] = [t3].[CustomerID]

Curiously enough, I just ran all three versions with SQL Profiler running.
Original LINQ SQL: 38 reads
My LINQ SQL: 194 reads
Your SQL: 242 reads

Very interesting!

Luke Breuer - Friday, September 14, 2007 6:16:25 PM

Luke: I doubt these read numbers match. Northwind contains over 800 rows in Order Details, and they have to be read each one of them to calculate each order's total, also all customers have to be read, so that's way more than 39 reads. Or I don't understand the sql profiler read numbers (could be, I don't play with it that much ;))

The query you posted is indeed interesting, thanks for that :). Let's wait for a SQL server DBA to tell us which one is really faster :)

FransBouma - Friday, September 14, 2007 6:54:06 PM

You could try something like this?

from c in Pm_customers
join o in Pm_orders on c.Id equals o.Customerid
join od in Pm_order_products on o.Id equals od.Orderid group od by c.Id into groups select new { TotalAmount = groups.Sum(od => od.Qty * od.Price) }

This is obviously for a different db than you are using, but this generates this sql...

SELECT SUM((CONVERT(Float,[t2].[qty])) * [t2].[price]) AS [TotalAmount]
FROM [pm_customers] AS [t0]
INNER JOIN [pm_orders] AS [t1] ON [t0].[id] = [t1].[customerid]
INNER JOIN [pm_order_products] AS [t2] ON [t1].[id] = [t2].[orderid]
GROUP BY [t0].[id]

Justin Etheredge - Friday, September 14, 2007 9:34:13 PM

Luke: Still odd that it can read much less data in one query than in another while it still has to access the same # of rows... :)

Justing: cool! :) Thanks for showing the query, I already wondered if it would be possible at all. The odd group by statement (IMHO) is something Microsoft should really add documentation on I think, it's not trivial.

Anyway, I'll still proceed with the derived table implementation as well, as there will be users who will write the query the way I described it in the blog post so it has to be possible to emit it as derived tables. :)

FransBouma - Monday, September 17, 2007 8:17:29 AM

hi Todd,
It is a trade-off: writing the stuff in the O/R mapper native query system can be abit tedious sometimes, and include compromises. The advantage is that you have a single system to access the DB. If you branch out to use hand-coded SQL, you probably will run into a problem when you want to target a different DB with your code: you can with the o/r mapper queries, you can't with the sql.

It also gets pretty complex to mimic things like prefetch paths (eager loading) with handwritten sql, as these queries are optimized per node based on the data of their parent node, and that's pretty hard to do in hard-coded sql ;)

FransBouma - Thursday, September 20, 2007 12:11:27 PM

5 Comments