Developing Linq to LLBLGen Pro, part 14
Update: I made a mistake in the first Linq to Sql query. It's not that slow as I previously posted. I didn't filter on country, which made it pull the rows of all 91 customers into memory instead of the 11. Fetching 91 customer rows, 818 order rows and 2015 order details rows took Linq to Sql over 900ms, fetching 11 customers took 165ms, not over 900. The text has been corrected for that.
(This is part of an on-going series of articles,
started here)
We're getting closer to the goal: a full-featured
Linq provider for
LLBLGen Pro. Hopefully next week the Linq CTP for LLBLGen Pro is ready
for release to customers, fingers crossed!
Last time
I talked about the function mapping support for database
functions and in-memory function/method/property access
extravangansa, today I'll discuss one of the cornerstones of
using an O/R mapper in the first place: fetching trees of
objects using hierarchical fetches.
Eager-loading of related data using prefetch paths
One of the two ways of fetching hierarchical data is by
fetching a tree of entity instances. For example: fetch all
customers from Germany and their Order entities and for
every Order entity the related Order Detail entities as
well. Using an O/R mapper should make this simple, and
efficient, i.e.: the right entities should be fetched
and all entities should be merged into a tree so their
relation to each other is shown. This means that all the
Order entity instances are stored in
Orders collections inside the Customer entity which
owns the Order instances. All Order Detail instances should
be placed inside OrderDetails collections, namely the
ones of the owning Order entity.
To do this efficiently, it would take at most 3 queries to
do this: one for the Customers, one for the Orders of these
Customers and one for the OrderDetails of these Orders.
LLBLGen Pro has a build-in mechanism which is called
Prefetch Paths, which lets you specify the paths
along a tree, and every node which should be fetched
together with the main entity / entities to fetch. In this
particular case, we're fetching Customers from Germany, and
our path therefore is pretty simple: Customer - Order -
OrderDetails. If you want, you can specify a filter per
node, a sort expression per node and a limitation (so only
the last 5 orders for example). It also allows you to
specify multiple branches in the graph. This means that the
path doesn't have to be linear, you specify a tree with
multiple branches, e.g. Customer - Employee (m:n via
Orders), Customer - Order - OrderDetails and Order -
Employee.
Prefetch paths are also polymorphic. So you can specify a
branch in the tree which is only available in certain
subtypes. For example if you take the dreaded inheritance
hierarchy Employee - Manager - BoardMember and BoardMember
has a relation with CompanyCar (as they are the only ones
allowed to have a company car, the lucky ********!), you can
define a path for fetching Employees and BoardMember -
CompanyCar: for every Employee which is a BoardMember, the
related CompanyCar is fetched. Yes, it's pretty neat
.
It's key that these eager-loading specifications, fetch
plans, prefetch paths or whatever you want to call them, are
specified at the query level and are executed
with the query. So when you're fetching the customers
together with the prefetch path, you are fetching the
complete tree defined by the specified path. Now,
it's pretty clear that the complete path has to be fetched
at once, but why is it important that the path is specified
with the query? After all, doesn't Linq to Sql use a model
where the tree to fetch is specified via LoadOptions on the
DataContext? The main issue with defining it elsewhere is
that if you execute multiple queries on the DataContext,
you'll have to make sure the LoadOptions for that particular
query are set to the correct values. The Entity Framework
uses a different approach if I'm not mistaken, they indeed
use an extension method to specify the related entities, yet
it uses strings which I find particular weird considering
the fact that the sole purpose of Linq being better than
embedded SQL strings was that it's compile-time checked.
Linq to LLBLGen Pro will have a Queryable extension method
called WithPath, which allows you to specify a
complete tree inside the query, together with filters per
node, sorting per node, exclusion of fields per node etc.,
i.o.w. the usual Prefetch Path stuff. Everything is
specifyable using compile-time checked constucts, so no
string mess. Let's get back to our little example of
Customers from Germany and their Orders and OrderDetails
entities. This is how the query looks like:
// Query A
var q = (from c in metaData.Customer
where c.Country=="Germany"
select c).WithPath(
new PathEdge<OrderEntity>(CustomerEntity.PrefetchPathOrders,
new PathEdge<OrderDetailsEntity>(OrderEntity.PrefetchPathOrderDetails)));
You simply define the path edges of the path to the related entities to fetch together with the main entities and you're done. As edges can be defined inside other edges, it's possible to specify multi-branched trees inside the query, and every PathEdge object can contain a lambda-based filter, a sort expression, a limiter and a list of fields to exclude. As Linq does allow you to specify lambda expressions wherever you want, the filter is specifyable using lambdas, and as the type to define the lambda on is specified with the PathEdge, you'll get full intellisense. The sort expression is a different matter: sorting is specified in Linq using 'orderby', but that's only accepted at a specific place, namely inside a query, but not at every spot. So we've to fall back onto normal LLBLGen Pro sort expressions for that, but that's ok, they're easy to specify. For example if we want to sort the orders on the OrderDate, descending, we would do instead:
// Query B
var q = (from c in metaData.Customer
where c.Country=="Germany"
select c).WithPath(
new PathEdge<OrderEntity>(CustomerEntity.PrefetchPathOrders, null,
new SortExpression((OrderFields.OrderDate | SortOperator.Descending), 0,
new PathEdge<OrderDetailsEntity>(OrderEntity.PrefetchPathOrderDetails)));
The code is still compile time checked. Our prefetch path
code has been in the framework since 2004 and it has gone
through some revisions where it received more and more
tweaks. One of the main tweaks, which is developer
controllable, is the setting when LLBLGen Pro should switch
from a subquery based filter to an IN (values...) based
filter for fetching related entities. LLBLGen Pro uses 1
query per path node and merges sets through hash-value
comparisons. It doesn't use joins between parent/child in a
tree, as that leads to problems: with 1:n relations it leads
to a lot of duplicates and with multi-branched trees it
leads to complicated scenarios to fetch the data out the
resultset and to very wide resultsets. So using 1 query per
path node is much more efficient, and as a bonus: the fetch
code for the node itself is the same as you'd normally use
so easier to implement as well. The Entity Framework uses
joins for related entities, which I don't think is a wise
thing to do.
Say in the above path, the engine has fetched the Customers
from Germany is now going to fetch their Orders. This could
be done in SQL with:
SELECT o.OrderID, o.CustomerID, ...
FROM Orders o
WHERE o.CustomerID IN
(
SELECT CustomerID
FROM Customers
WHERE Country=@country
)
A clever Irish developer, Marcus Mac Innes, the owner of pix.ie, said to me: "Isn't that optimizable with an IN (customerID, customerID...) query in certain conditions? and he wrote a proof of concept. In Northwind, there are 11 customers from Germany. As these are fetched before the Orders, and therefore you know the CustomerIDs to filter the Orders on, you can instead do:
SELECT o.OrderID, o.CustomerID, ... FROM Orders o WHERE o.CustomerID IN (@customerID1, @customerID2, ...., @customerID11)
This is faster, as the RDBMS just has to read one table. There's a catch: as the # of parameters increases with the increased number of parent objects already fetched, the question arises: is this always faster? Marcus did some extensive testing and the sweetspot for average sized tables was near a 100 parameters. Above that number, and the subquery as shown in the first SQL snippet was faster. This is excellent tweaking material, so LLBLGen Pro sports a threshold which allows you per query to optimize this setting, as it can be that your tables require a lower number, or perhaps in some cases a higher number. We set the threshold initially to 50 to be on the safe side for all queries.
Performance of prefetch paths and Expression tree
parsing
So, how does this all perform? We first have to measure how
fast our Linq provider parses Expression trees. As our
provider doesn't have a compiled query equivalent like Linq
to Sql, it's key to know if using Linq instead of our normal
API is of any value. Our tests show that a medium complex
query in Linq is fully handled by our provider in under 1
millisecond on a core2quad system with XP. So the overhead
is pretty minimal. This also shows that the necessity of a
compiled query feature isn't necessary: the overhead isn't
really noticable when you take into account network latency
/ db roundtrips, object materialization etc. etc.
Let's run the first query, Query A a 100 times on this
system and let's see how fast it is. After that it's
compared to the same query using Linq to Sql instead. Now, I
want to stress that benchmarks in general have to be taken
with a grain of salt: it's a snapshot of a moment in time
under the conditions of that moment, not a fact of life. So
it's an indication of what the performance will be.
With all tests I've done, I've first warmed up the CLR a bit
with fetching a query or two, so all assemblies are loaded,
database connections are live etc. Tracing/logging was
switched off, the networked database server was idle
(SqlServer 2000).
As we'll see later on in the article, the performance of
LLBLGen Pro in hierarchical fetches is pretty good and Linq
to Sql, well... let's say it doesn't do that well. The fact
that Linq to Sql has a hard time with hierarchical fetches
isn't new: several people have blogged about this in the
past:
Patrik Löwendahl,
Roger Jennings
and
David Hayden.
Here's my simple LLBLGen Pro test code. I use an empty
foreach instead of calling ToList() on the query, because
ToList() in every provider copies the results into a new
List object, which takes performance without adding anything
to the story, so I left that out. Our queries are typed with
an interface which allows you to obtain the results directly
in a list, so you can avoid this performance penalty. With
Linq to Sql, you either have to call ToList() or traverse
the set.
Stopwatch sw = new Stopwatch();
sw.Start();
int amount = 100;
for(int i = 0; i < amount; i++)
{
FetchNestedSetTest();
}
sw.Stop();
Console.WriteLine("Fetching the nested set {2} times took: {0}ms. On average this took per fetch: {1}ms",
sw.ElapsedMilliseconds, sw.ElapsedMilliseconds / amount, amount);
//...
// the routine FetchNestedSetTest:
public static void FetchNestedSetTest()
{
using(DataAccessAdapter adapter = new DataAccessAdapter())
{
LinqMetaData metaData = new LinqMetaData(adapter);
var q = (from c in metaData.Customer
where c.Country=="Germany"
select c).WithPath(
new PathEdge<OrderEntity>(CustomerEntity.PrefetchPathOrders,
new PathEdge<OrderDetailsEntity>(OrderEntity.PrefetchPathOrderDetails)));
foreach(var v in q)
{
}
}
}
The output: Fetching the nested set 100 times took: 1906ms.
On average this took per fetch: 19ms
That's 3 queries per call, 19ms total for setup the query,
adapter, db connection query parsing etc. This is a debug
build for the complete stack (Linq provider, O/R runtime,
SQL engine) and connects to a SqlServer 2000 database over
the network.
Ok, now let's see what Linq to Sql does in this case (same
machine, same target database, ran after LLBLGen Pro's
version).
Stopwatch sw = new Stopwatch();
sw.Start();
int amount = 100;
for(int i = 0; i < amount; i++)
{
FetchNestedSetTest();
}
sw.Stop();
Console.WriteLine("Fetching the nested set {2} times took: {0}ms. On average this took per fetch: {1}ms",
sw.ElapsedMilliseconds, sw.ElapsedMilliseconds / amount, amount);
//...
// the routine FetchNestedSetTest:
public static void FetchNestedSetTest()
{
NorthwindDataContext nw = new NorthwindDataContext();
DataLoadOptions loadOptions = new DataLoadOptions();
loadOptions.LoadWith<Customer>(c => c.Orders);
loadOptions.LoadWith<Order>(o => o.Order_Details);
nw.LoadOptions = loadOptions;
var q = from c in nw.Customers
where c.Country=="Germany"
select c;
foreach(var v in q)
{
}
}
Ready? This is the output: Fetching the nested set 100 times
took: 16558ms. On average this took per fetch: 165ms
Yes, I've checked it a couple of times, and before you ask
if my code indeed fetches data, it does. 19ms vs. 165ms.
That's a big difference. How is this possible? The
main reason is that MS' code isn't clever in this case. It
doesn't do any fancy optimization of the graph fetch, while
it has all the information at hand to do so. Instead it
executes a lot of queries: 1 for the Customers, one
to fetch for every Customer's Orders and per Order, one to
fetch the Order's OrderDetail entities. That's a lot of
queries, which take a lot of time. Not only that, it will
burn your server to the ground if you have many people
accessing your database and executing these queries. So if
you're considering Linq to Sql, be sure you know this
serious performance bottleneck. See also the 3 articles by
Patrik, Roger and Dave I've linked to above, which explain
the issue more in detail. In a bit, we'll see that this
approach could also lead to wrong results in some
cases. More on that below.
Microsoft has put a lot of effort in optimizing the Linq to
Sql runtime via IL code generation at runtime (see
Rico Mariani's posts about this subject) to get fast set fetches. This code indeed does its work
well, simple sets are fetched very quickly, a couple of
milliseconds per query faster than our code. But, I'm sorry
to say it Rico and friends, isn't this work all in vain when
the optimization in the end turns out to be a
micro-optimization (which is considered a bad thing)?
Shouldn't this work have been applied to the bigger picture
which clearly shows that the Linq to Sql framework isn't
equipped with solid, tweakable, clever logic to perform
all queries very fast, especially the queries for
hierarchical fetches, one of the core reasons to use an O/R
mapper in the first place? Optimizing frameworks is a
delegate manner, and optimizing O/R mappers is something you
have to work hard for: the classic 'it depends' is
applicable to a lot of situations and to get the best
performance in all cases, it's key that the framework
notices which situation it is in and adapts to that
situation. Code to quickly do hierarchical fetches with
tweakable features is somewhat complex to write, especially
the merging of the sets, but Linq to Sql doesn't even
support m:n relations, so the hardest part of hierarchical
fetches was already off the table!
Let's look at the other way of doing hierarchical fetches
with Linq.
Fetching hierarchical sets of non-entity projections
Linq allows you to specify a projection inside the
projection of your query, or more than one. I'll show you
two examples. The first one I'll show you is a complex, 5
entity touching hierarchical fetch with multiple paths. The
other one is a simpler one which fails to return the proper
set in Linq to Sql which is caused by their way of handling
nested sets in the projection.
Linq to LLBLGen Pro treats nested queries inside the
projection the same way as prefetch paths: nested queries
are fetched with a separate query per set, the sets are
merged using hashvalue matching and the queries for the
nested sets are optimized using the same bag of tricks used
for prefetch paths (yes, we're using more tricks than just
Marcus' trick
). The nested queries in a Linq query have an additional
problem though: if the end result is a set of anonymous
types you can only materialize the objects if you have all
the data: anonymous types don't have setter properties, only
getters. This requires a 2-phase projection: first the data
is fetched from the db using a simple projection, and stored
into a temporary storage in-memory. Then all nested queries
are fetched, and the results of each query are placed in the
slot per row of the set to materialize. Then that rowset is
projected again, but this time in-memory with the
actual projectors so in-memory method calls are ran and
anonymous type instantiation works OK.
Ok, let's look at our phat nested query first:
var q = from c in metaData.Customer
where c.Country == "Germany"
select new
{
c.CompanyName,
c.City,
Orders = from o in c.Orders
select new
{
OrderDetails = from od in o.OrderDetails
select new
{
od.Product,
od.Quantity,
od.UnitPrice
},
o.Employee,
o.OrderDate
}
};
This query combines fetches of anonymous types with data
projected from entity sets with fetches of entities:
od.Product and o.Employee are fetches of related entities,
while the rest are projections to anonymous types. This
query fetches data from Customer, Order, OrderDetails,
Product and Employee, and should use at most 5 queries.
Let's put this baby on the testbench to see how fast our
provider is. We use the same setup as above, only this time
FetchNestedSetTest() simply executes the query above instead
of the prefetch path query. The foreach assures proper
enumeration. The query is also in a unittest with
data-checks, so it indeed does fetch all data, so the
timings are correct.
The output is: Fetching the nested set 100 times took:
4363ms. On average this took per fetch: 43ms
This number isn't that much of a surprise, considering the
fact that the overall set of data is pretty small: just 11
customers, their orders, order details, the products used in
those order details (70 in total) and the employees who
filed the orders. With the approach of one query per node,
and with tricks like switching to an IN (value, value..)
query instead of a subquery, it makes the individual sets
pretty small, and the merger routine which has to merge the
different rows based on hashvalues doesn't have a hard time.
A debug build was used.
Now, let's see how Linq to Sql handles this query. It comes as no surprise that a nested query inside a projection is handled the same way as with the loadoptions: the query is executed per parent, except when it is a scalar, then it is folded into the parent's select for obvious reasons (we do that too, btw). To avoid controversy, here's the Linq to Sql variant of this query:
var q = from c in nw.Customers
where c.Country == "Germany"
select new
{
c.CompanyName,
c.City,
Orders = from o in c.Orders
select new
{
OrderDetails = from od in o.Order_Details
select new
{
od.Product,
od.Quantity,
od.UnitPrice
},
o.Employee,
o.OrderDate
}
};
The output: Fetching the nested set 100 times took: 30714ms.
On average this took per fetch: 307ms
So 43ms vs. 307ms. Analyzing what Linq to Sql does, you see
the same thing: a lot of queries are executed.
Clearly there's no clever code in place which makes this
more optimal.
It might be that you now think these are edge cases., but I
beg to differ. Just because you can group sets so easily
together in projections in Linq, it's mandatory that these
fetches are done optimally. Futhermore, and I already
stressed that point above: the ability to fetch trees of
objects is a big advantage of an O/R mapper over simpler
code which fills classes with results from a query: merging
sets efficiently without any effort from the developer is a
very nice feature as the data is grouped in trees, objects
in collections inside other objects, exactly as the class
model dictates. Isn't that one of the reasons people want to
work with entities, objects and the like and not with flat
tabular data?
There's another reason why the approach Linq to Sql uses
isn't that good: it can lead to wrong results. The following
Linq to Sql query results in 818 rows, one for every Order
entity in the database. However, that's wrong, as 'Distinct'
is used in the query: Customers have often more than 1 order
and often multiple orders are filed by the same Employee.
The right answer has to be 472 (on our northwind test
database) instead of 818 rows. But because Linq to Sql
doesn't use 2-phase projections, it can't take into account
the Distinct at all, leading to the wrong results:
var q = (from o in nw.Orders
select new
{
o.EmployeeID,
CustomerData = (from c in nw.Customers where c.CustomerID == o.CustomerID select new { c.CompanyName, c.City, c.Country }).Single()
}).Distinct();
The query is perhaps not what you'd see in your daily work, but it proves the point: an O/R mapper has to produce the right results according to the query.
I'll conclude with an example of Linq to LLBLGen Pro using a nested query which result is passed to a method which is executed in-memory, thus executed during projection. The method is very simple but illustrates the mechanism.
// helper class which contains the in-memory method to call
public class InMemoryMethodCallsForTests
{
public static bool IsUsefulEntity(CustomerEntity customerEntity)
{
return customerEntity.Country == "Germany";
}
}
// unit test which illustrates the point of calling this method with a nested query result.
[Test]
public void InMemoryMethodCallWithNestedSetResultAsParameter()
{
using(DataAccessAdapter adapter = new DataAccessAdapter())
{
LinqMetaData metaData = new LinqMetaData(adapter);
var ordersFromGermany = from o in metaData.Order
where o.Customer.Country=="Germany"
select o.OrderId;
List<int> orderIds = ordersFromGermany.ToList();
Assert.AreEqual(121, orderIds.Count);
var q = from o in metaData.Order
select new {
o.OrderId,
Value = InMemoryMethodCallsForTests.IsUsefulEntity(o.Customer)
};
foreach(var v in q)
{
if(v.Value)
{
Assert.IsTrue(orderIds.Contains(v.OrderId));
}
else
{
Assert.IsFalse(orderIds.Contains(v.OrderId));
}
}
}
}
This test calls for every anonymous type fetched the IsUsefulEntity routine by passing the result of the nested query on the Customer to it. The result is placed in the slot for 'Value' and is used to materialize the anonymous type instances returned by the query.
So, done yet?
Not entirely but almost. We want to add support for Full
text search to our Linq provider (our API supports it, so
it's a small thing), a way to specify fields to exclude in
entity fetches (so fetch all fields from Employee except the
Photo field) which is something already implemented in
LLBLGen Pro so it shouldn't be a lot of work to add it and
some very minor details to wrap up and then it's done!
If everything goes according to plan, we'll release
next week a Linq CTP of LLBLGen Pro v2.6 to our
customers. This CTP includes new runtime (v2.6), templates,
the Linq provider and the sourcecode of all unittests for
Linq so you get a lot of queries to look at to get started.
After that, we'll spend a month or so to make the changes
we've planned for v2.6 which will then be released in beta.
I think that gives us enough time to hammer out the
bugs/glitches/things we forgot from the Linq provider before
release.
Final words
This was the last article in this series. It spans 14
episodes and I think it's a great read how my journey
through Linq land went: from being in the pitch-black with
just a flashlight with dead batteries, to the bright sunny
dailylight of today with the fully working, high performance
provider it has become. Compiled in release build, the Linq
provider alone is 170KB, and the provider alone spans
roughly 925KB of C# sourcecode (and there's still some small
things left!). The first branch in SVN was created in
September 2007, so roughly 5-6 months worth of work. I was
on the edge of giving up on this Linq provider a couple of
times, mainly due to having no real information when what
appears where in the tree, however I'm glad I made it: the
Linq provider offers a nice way of querying all databases
supported by LLBLGen Pro and makes a nice addition to the
featureset we already provide.
I'd like to thank all people who have supported me during
this journey! I hope you all enjoyed reading these articles.
The wait is almost over, though I can assure you, it was
worth it.