Developing Linq to LLBLGen Pro, part 10
(This is part of an on-going series of articles, started here)
Whoa, almost a month without an update! The truth is that I wanted to finish GroupBy support before posting another article in this ongoing series, and it took almost 3 weeks to get it right. But more on that later, first some easy stuff to get things started up again.
OrderBy, ThenBy, OrderByDescending, ThenByDescending
Linq has four different sort clause extension methods. When
you use a normal query in C# or VB.NET without using
extension methods, you won't notice that there are actually
four methods. These four methods are a little weird, because
Linq expression trees and also extension method usage in
Linq to Objects do have an order in which the
statements/expressions are defined and to be used. But
perhaps I'm overlooking a tiny detail which made Microsoft
include four methods instead of one with a sort direction
parameter.
Sorting is pretty easy with expression trees. You simply convert OrderBy and ThenBy to ascending operations and the other two to descending operations. Each operation is its own 'query', with no projection nor anything else but a sort operation in a single field in the projection. In the previous article I talked about aliases and alias scopes. A sort clause isn't a scope on its own, so you can merge them all back together afterwards, by merging the parent in front of the source to keep the right order. As scope determination earlier in the process already defines which 'query' definitions can be merged and which don't (and thus become a derived table in their parent's source (FROM clause)), you won't have any problem with misplaced order by statements. Misplaced order by statements are order by statements before a where for example: in SQL an ORDER BY is always placed after a WHERE. The main reason is that the ORDER BY is executed on the projection result of the rest.
Aggregates without group by
After I implemented the sort operations in a couple of
hours, I hoped the rest would be as easy as these, after I
had seen the problems
Matt Warren
had in
his last post about order by operators
in his Linq provider example. Two of the biggest hurdles yet
to take in our Linq provider were yet to be taken and the
sooner these were out of the way the better. One of them
were Aggregates, the other its ugly step-brother
GroupBy.
Linq defines 5 (actually 6) aggregate functions: Average,
Count, Sum, Min and Max. It also contains a LongCount, which
is actually simply Count but with an Int64 typed result
value. SQL Jedi's will immediately spot that this list is
rather small. Where are the Standard Deviation and Variance
aggregates? And where are the distinct variations of these
aggregate functions? And why is there just a Count(filter),
but not a Count which is usable on a field or expression?
Questions which popped in my head pretty quickly. For
standard deviation and variance, two simply extension
methods would do the trick, so I added
Queryable.StandardDeviation and Queryable.Variance, both
accepting a field specification and an optional boolean for
Distinct usage.
First a word about distinct aggregate functions. Say you
want a list of all CustomerIDs from Northwind and added to
that in a second column the number of different employees
these customers have handled their orders. This requires a
distinct count. To do that in linq you'd do:
// Linq to SQL
NorthwindContext nw = new NorthwindContext();
var q = from c in nw.Customers
select new { c.CustomerID, NumberOfDifferentEmployees = c.Orders.Select(
o => o.EmployeeID).Distinct().Count() };
This gives this SQL: -- SQL from Linq to SQL SELECT
[t0].[CustomerID], ( SELECT COUNT(*) FROM ( SELECT DISTINCT
[t1].[EmployeeID] FROM [dbo].[Orders] AS [t1] WHERE
[t1].[CustomerID] = [t0].[CustomerID] ) AS [t2] ) AS
[NumberOfDifferentEmployees] FROM [dbo].[Customers] AS [t0]
A COUNT(DISTINCT EmployeeID) would have been easier. It's
not to say this is less efficient, but it's a little
different from what you'd expect. It took me a while before
I realized how to write the query with Count(). Of course
after I already added Queryable.CountColumn which
accepts a field/expression specification and an optional
boolean for Distinct usage.
With Sum and Average it could get more complicated so I
added another overload to Queryable.Sum and to
Queryable.Average for applying the aggregate on a Distinct
set, by adding an optional boolean for Distinct usage to my
versions of Sum and Average.
Aggregates aren't difficult to implement in Linq per se.
That is, in most very straight forward cases. However, there
are cases where things get out of hand. The thing is that
aggregate functions are typical scalar functions which
represent a single value but the source of what they
aggregate is always a set. In the query above you see a
typical example of it: a scalar query is placed inside the
projection and the scalar query itself fetches a set and
applies the aggregate (Count(*) in this case) on that set.
Let's look at a tough example of this. I'll first give you
the
LLBLGen Pro
query using Linq and then the produced SQL query. The main
element to see is that aggregates applied on aggregated
values need a set to aggregate, so this set has to be
created from the nested aggregate, otherwise the RDBMS will
complain that you're on crack and should go home to get some
sleep.
Ready?
// Linq to LLBLGen Pro nasty multi-aggregate query
[Test]
public void GetAllCustomersWithAnOrderTotalHigherThan5000()
{
using(DataAccessAdapter adapter = new DataAccessAdapter())
{
LinqMetaData metaData = new LinqMetaData(adapter);
var q = from c in metaData.Customer
where c.Orders.Sum(o => o.OrderDetails.Sum(
od => od.Quantity * od.UnitPrice)) > 5000
select c;
int counter = 0;
foreach(var v in q)
{
counter++;
}
Assert.AreEqual(54, counter);
}
}
Looks pretty simple? The query is pretty small, most of the
code given is test code or initialization stuff. As you can
see, we have disconnected meta-data for query construction
and the target the query is executed on. So with Linq to
LLBLGen Pro you can write db generic queries without
knowning up front which db the query is executed on later
on.
So, what SQL does this baby produce?
SELECT DISTINCT [LPLA_1].[CustomerID] AS [CustomerId], [LPLA_1].[CompanyName],
[LPLA_1].[ContactName], [LPLA_1].[ContactTitle], [LPLA_1].[Address], [LPLA_1].[City],
[LPLA_1].[Region], [LPLA_1].[PostalCode], [LPLA_1].[Country], [LPLA_1].[Phone],
[LPLA_1].[Fax]
FROM [Northwind].[dbo].[Customers] [LPLA_1]
WHERE
(
SELECT SUM([LPLA_5].[LPAV_]) AS [LPAV_]
FROM
(
SELECT DISTINCT [LPLA_2].[CustomerID] AS [CustomerId],
(
SELECT SUM([LPLA_4].[LPAV_]) AS [LPAV_]
FROM
(
SELECT DISTINCT [LPLA_3].[OrderID] AS [OrderId],
[LPLA_3].[Quantity] * [LPLA_3].[UnitPrice] AS [LPAV_]
FROM [Northwind].[dbo].[Order Details] [LPLA_3]
WHERE [LPLA_2].[OrderID] = [LPLA_3].[OrderID]
) LPLA_4
) AS [LPAV_]
FROM [Northwind].[dbo].[Orders] [LPLA_2]
WHERE [LPLA_1].[CustomerID] = [LPLA_2].[CustomerID]
) LPLA_5
) > @LPFA_31
(The DISTINCT keywords are sometimes still misplaced, this is a small glitch still left to be fixed, but not important now) As you can see, the aggregate functions are converted into scalar queries. One thing to notice is that the reference to a related set e.g. c.Orders, doesn't result in a join in this case. It results internally into what I dubbed a correlation relation. This relation is the connection of a set reference to another entity reference somewhere else in the query. In this particular context, the correlation relations end up in a WHERE clause inside the scalar queries in the projections of the derived tables which are the sources of the two SUM aggregates.
The mechanism behind this is quite complicated at first, but if you start with a couple of simple queries and build them out you see the pattern pretty quickly and it then also makes perfect sense. It's then not hard to write code for it which handles all cases. That is, if the aggregate targets a non-grouped set. Yes, people, aggregates like in the query above are first-grader material compared to what's unleashed when GroupBy enters the room.
GroupBy
Let's start with GroupBy first and then go back to
Aggregates, as GroupBy deserves its own introduction.
GroupBy is Linq's weakest link (pun intented). What's the
problem? Well, grouping operators in SQL have a nasty
side-effect: they have a fixed location where they can be.
Furthermore, they have a limitation: boolean values. SQL
normally doesn't understand boolean values in projections.
Take this example query
from the MSDN documentation:
// Linq to Objects query
var booleanGroupQuery =
from student in students
group student by student.Scores.Average() >= 80; //pass or fail!
Now try to rewrite this in a Linq to
db system variant, e.g. Linq to Sql. You won't get
the result the Linq to Objects query will produce. This
isn't the fault of Linq to Sql, it's the ease of which a
user can write a query in Linq without thinking what the
target will be when the query is executed. This is a weak
spot in Linq, as it shows that the developer
may never ever forget what the target is of the
query: a set of in-memory objects or a database system. So
the matra 'a query is the same, no matter what the source
is', isn't really something you should live by.
GroupBy has some overloads, some useful in a database
scenario, some not. With the various overloads, you can
create pretty awkward queries which result into garbage on a
database system. So if you're using GroupBy in a Linq query
targeting a database system, pay attention to
what you want to achieve, what your
intention is with the query. This is valid for Linq
to Sql but also for Linq to LLBLGen Pro and I'd be surprised
if it wasn't true for other Linq implementations to the
various O/R mappers out there. So a query which works on
in-memory objects could probably not work on a database even
though it's not really clear why.
Aggregates with group by
The last three weeks I've spend on making aggregates work
with GroupBy, make ordering and WHERE clauses work with
GroupBy keys (which can be anonymous types with multiple
fields!) and it wasn't a picknick. What's causing the delay?
Well, when an aggregate is targeting a groupby it simply has
a parameter which represents the GroupBy expression as its
source. This sounds like a dull detail, but it's essential.
The earlier mentioned non-groupby targeting aggregates above
could be handled and placed at the spot where they appeared:
simply create a scalar query at the spot the aggregate was
seen and apply some algorithm on the source of the query the
aggregate is placed in to handle aggregates on aggregates,
and that was it. But not with aggregates targeting GroupBy
expressions: these aggregates have to be
injected into the projection of the GroupBy query. At
the same time, at the spot where the aggregate appears in
the expression tree (e.g. in the projection, in a filter) a
field referencing the field in the GroupBy query has
to be placed. It makes sense actually: aggregates targeting
a GroupBy have to be executed on the grouped set so they
have to appear in the SELECT clause of the same query scope
where the GROUP BY is located as well.
But you're not there yet. Because multiple times an
aggregate in the query, targeting the same GroupBy requires
for example different sets these aggregate functions are
applied on. As an example of such a complicated monster,
I'll show you a bogus query which nevertheless illustrates
the point: it has multiple aggregates on the same GroupBy
and all operate on their own subset of data, so, you guessed
it, we need a similar algorithm for folding the source of
the GROUP BY clause into a derived table. This algorithm is
quite more complicated as the algorithm has to take into
account correlation relations, which might target fields on
a derived table which are actually from a table inside the
derived table.
var q = from o in metaData.Order
group o by o.Customer.Country into g
orderby g.Key
where g.Sum(n => n.OrderDetails.Count()) > 10
select new { Country = g.Key,
Num = g.Average(n => n.OrderDetails.Count(od => od.ProductId == 3)) };
Looks complicated? Wait till you see the SQL
The complicated elements of this query aren't the misplaced
orderby statement. The complicated elements are the g.Sum
and the g.Average at two different locations, both targeting
a different set of data to work on. For kicks, I also threw
in a filter on the Count, plus a grouping on a related
entity's field.
-- Linq to LLBLGen Pro query output
SELECT [LPLA_3].[Country], [LPLA_3].[LPAV_1] AS [Num]
FROM
(
SELECT [LPLA_11].[Country], SUM([LPLA_11].[LPAV_]) AS [LPAV_],
AVG([LPLA_11].[LPAV_1]) AS [LPAV_1]
FROM
(
SELECT [LPLA_10].[OrderId], [LPLA_10].[Country], [LPLA_10].[LPAV_],
(
SELECT COUNT(*) AS [LPAV_]
FROM [Northwind].[dbo].[Order Details] [LPLA_4]
WHERE [LPLA_10].[OrderID] = [LPLA_4].[OrderID]
AND [LPLA_4].[ProductID] = @ProductId1
) AS [LPAV_1]
FROM
(
SELECT [LPLA_1].[OrderID] AS [OrderId], [LPLA_2].[Country],
(
SELECT COUNT(*) AS [LPAV_]
FROM [Northwind].[dbo].[Order Details] [LPLA_4]
WHERE [LPLA_1].[OrderID] = [LPLA_4].[OrderID]
) AS [LPAV_]
FROM [Northwind].[dbo].[Customers] [LPLA_2]
RIGHT JOIN [Northwind].[dbo].[Orders] [LPLA_1]
ON [LPLA_2].[CustomerID]=[LPLA_1].[CustomerID]
) LPLA_10
) LPLA_11
GROUP BY [LPLA_11].[Country]
) LPLA_3
WHERE [LPLA_3].[LPAV_] > @LPAV_2
ORDER BY [LPLA_3].[Country] ASC
The query isn't really interesting for the result it gives, but it's interesting to see what has to be done to make it valid SQL. If you try the query in Linq to Sql you'll get a similar looking query. The complicated issue is, at least it was for me, the alias rewriting when a query part was folded into a derived table during the evaluation of the expression tree. Don't make the mistake of doing that at a point in time which is too late, because you'll then have no other option than to revert to re-aliasing the complete query, however that gives a problem with joins with self (Employee join Employee, which one are you referring to?).
I must say that I'm quite pleased with the result, even though it was quite a struggle to get everything lined up to do their thing in every situation thinkable: the main point was to fully understand which elements had to be moved around in the query and why, so you'll get some context for reasoning when it might fail.
So am I done now? No, there's still a lot of work to be
done, however the biggest hurdles are behind me: I've
covered all parts of a SQL query now. There are still some
nasty Queryable extension methods which might cause some
headscratching, like the All() method which results in a NOT
EXISTS query, and the vast amount of database functions
which can be called by mapping extension methods of .NET
classes onto these functions, as well as, converting some of
these functions to predicates like IN and LIKE. One thing
that already strikes me is that Microsoft didn't opt for
making the extension method -> DB function mapping an
extensible system, while it is so simple to do, as the query
is evaluated at runtime anyway. That's definitely something
I'll do differently: make it pluggable which DB function is
called when an extension method is used in the Linq query.
Anyway, let's first enjoy the holidays and a few days with
low work pressure.
.