Developing Linq to LLBLGen Pro, part 14

Update: I made a mistake in the first Linq to Sql query. It's not that slow as I previously posted. I didn't filter on country, which made it pull the rows of all 91 customers into memory instead of the 11. Fetching 91 customer rows, 818 order rows and 2015 order details rows took Linq to Sql over 900ms, fetching 11 customers took 165ms, not over 900. The text has been corrected for that.

(This is part of an on-going series of articles, started here)

We're getting closer to the goal: a full-featured Linq provider for LLBLGen Pro. Hopefully next week the Linq CTP for LLBLGen Pro is ready for release to customers, fingers crossed! Last time I talked about the function mapping support for database functions and in-memory function/method/property access extravangansa, today I'll discuss one of the cornerstones of using an O/R mapper in the first place: fetching trees of objects using hierarchical fetches.

Eager-loading of related data using prefetch paths
One of the two ways of fetching hierarchical data is by fetching a tree of entity instances. For example: fetch all customers from Germany and their Order entities and for every Order entity the related Order Detail entities as well. Using an O/R mapper should make this simple, and efficient, i.e.: the right entities should be fetched and all entities should be merged into a tree so their relation to each other is shown. This means that all the Order entity instances are stored in Orders collections inside the Customer entity which owns the Order instances. All Order Detail instances should be placed inside OrderDetails collections, namely the ones of the owning Order entity.

To do this efficiently, it would take at most 3 queries to do this: one for the Customers, one for the Orders of these Customers and one for the OrderDetails of these Orders. LLBLGen Pro has a build-in mechanism which is called Prefetch Paths, which lets you specify the paths along a tree, and every node which should be fetched together with the main entity / entities to fetch. In this particular case, we're fetching Customers from Germany, and our path therefore is pretty simple: Customer - Order - OrderDetails. If you want, you can specify a filter per node, a sort expression per node and a limitation (so only the last 5 orders for example). It also allows you to specify multiple branches in the graph. This means that the path doesn't have to be linear, you specify a tree with multiple branches, e.g. Customer - Employee (m:n via Orders), Customer - Order - OrderDetails and Order - Employee.

Prefetch paths are also polymorphic. So you can specify a branch in the tree which is only available in certain subtypes. For example if you take the dreaded inheritance hierarchy Employee - Manager - BoardMember and BoardMember has a relation with CompanyCar (as they are the only ones allowed to have a company car, the lucky ********!), you can define a path for fetching Employees and BoardMember - CompanyCar: for every Employee which is a BoardMember, the related CompanyCar is fetched. Yes, it's pretty neat .

It's key that these eager-loading specifications, fetch plans, prefetch paths or whatever you want to call them, are specified at the query level and are executed with the query. So when you're fetching the customers together with the prefetch path, you are fetching the complete tree defined by the specified path. Now, it's pretty clear that the complete path has to be fetched at once, but why is it important that the path is specified with the query? After all, doesn't Linq to Sql use a model where the tree to fetch is specified via LoadOptions on the DataContext? The main issue with defining it elsewhere is that if you execute multiple queries on the DataContext, you'll have to make sure the LoadOptions for that particular query are set to the correct values. The Entity Framework uses a different approach if I'm not mistaken, they indeed use an extension method to specify the related entities, yet it uses strings which I find particular weird considering the fact that the sole purpose of Linq being better than embedded SQL strings was that it's compile-time checked.

Linq to LLBLGen Pro will have a Queryable extension method called WithPath, which allows you to specify a complete tree inside the query, together with filters per node, sorting per node, exclusion of fields per node etc., i.o.w. the usual Prefetch Path stuff. Everything is specifyable using compile-time checked constucts, so no string mess. Let's get back to our little example of Customers from Germany and their Orders and OrderDetails entities. This is how the query looks like:

// Query A
var q = (from c in metaData.Customer
    where c.Country=="Germany"
    select c).WithPath(
               new PathEdge<OrderEntity>(CustomerEntity.PrefetchPathOrders, 
                    new PathEdge<OrderDetailsEntity>(OrderEntity.PrefetchPathOrderDetails)));

You simply define the path edges of the path to the related entities to fetch together with the main entities and you're done. As edges can be defined inside other edges, it's possible to specify multi-branched trees inside the query, and every PathEdge object can contain a lambda-based filter, a sort expression, a limiter and a list of fields to exclude. As Linq does allow you to specify lambda expressions wherever you want, the filter is specifyable using lambdas, and as the type to define the lambda on is specified with the PathEdge, you'll get full intellisense. The sort expression is a different matter: sorting is specified in Linq using 'orderby', but that's only accepted at a specific place, namely inside a query, but not at every spot. So we've to fall back onto normal LLBLGen Pro sort expressions for that, but that's ok, they're easy to specify. For example if we want to sort the orders on the OrderDate, descending, we would do instead:

// Query B
var q = (from c in metaData.Customer
    where c.Country=="Germany"
    select c).WithPath(
          new PathEdge<OrderEntity>(CustomerEntity.PrefetchPathOrders, null, 
                new SortExpression((OrderFields.OrderDate | SortOperator.Descending), 0, 
                new PathEdge<OrderDetailsEntity>(OrderEntity.PrefetchPathOrderDetails)));

The code is still compile time checked. Our prefetch path code has been in the framework since 2004 and it has gone through some revisions where it received more and more tweaks. One of the main tweaks, which is developer controllable, is the setting when LLBLGen Pro should switch from a subquery based filter to an IN (values...) based filter for fetching related entities. LLBLGen Pro uses 1 query per path node and merges sets through hash-value comparisons. It doesn't use joins between parent/child in a tree, as that leads to problems: with 1:n relations it leads to a lot of duplicates and with multi-branched trees it leads to complicated scenarios to fetch the data out the resultset and to very wide resultsets. So using 1 query per path node is much more efficient, and as a bonus: the fetch code for the node itself is the same as you'd normally use so easier to implement as well. The Entity Framework uses joins for related entities, which I don't think is a wise thing to do.

Say in the above path, the engine has fetched the Customers from Germany is now going to fetch their Orders. This could be done in SQL with:

SELECT o.OrderID, o.CustomerID, ... 
FROM Orders o 
WHERE o.CustomerID IN
(
     SELECT CustomerID 
     FROM Customers
     WHERE Country=@country
)

A clever Irish developer, Marcus Mac Innes, the owner of pix.ie, said to me: "Isn't that optimizable with an IN (customerID, customerID...) query in certain conditions? and he wrote a proof of concept. In Northwind, there are 11 customers from Germany. As these are fetched before the Orders, and therefore you know the CustomerIDs to filter the Orders on, you can instead do:

SELECT o.OrderID, o.CustomerID, ... 
FROM Orders o 
WHERE o.CustomerID IN
(@customerID1, @customerID2, ...., @customerID11)

This is faster, as the RDBMS just has to read one table. There's a catch: as the # of parameters increases with the increased number of parent objects already fetched, the question arises: is this always faster? Marcus did some extensive testing and the sweetspot for average sized tables was near a 100 parameters. Above that number, and the subquery as shown in the first SQL snippet was faster. This is excellent tweaking material, so LLBLGen Pro sports a threshold which allows you per query to optimize this setting, as it can be that your tables require a lower number, or perhaps in some cases a higher number. We set the threshold initially to 50 to be on the safe side for all queries.

Performance of prefetch paths and Expression tree parsing
So, how does this all perform? We first have to measure how fast our Linq provider parses Expression trees. As our provider doesn't have a compiled query equivalent like Linq to Sql, it's key to know if using Linq instead of our normal API is of any value. Our tests show that a medium complex query in Linq is fully handled by our provider in under 1 millisecond on a core2quad system with XP. So the overhead is pretty minimal. This also shows that the necessity of a compiled query feature isn't necessary: the overhead isn't really noticable when you take into account network latency / db roundtrips, object materialization etc. etc.

Let's run the first query, Query A a 100 times on this system and let's see how fast it is. After that it's compared to the same query using Linq to Sql instead. Now, I want to stress that benchmarks in general have to be taken with a grain of salt: it's a snapshot of a moment in time under the conditions of that moment, not a fact of life. So it's an indication of what the performance will be. With all tests I've done, I've first warmed up the CLR a bit with fetching a query or two, so all assemblies are loaded, database connections are live etc. Tracing/logging was switched off, the networked database server was idle (SqlServer 2000).

As we'll see later on in the article, the performance of LLBLGen Pro in hierarchical fetches is pretty good and Linq to Sql, well... let's say it doesn't do that well. The fact that Linq to Sql has a hard time with hierarchical fetches isn't new: several people have blogged about this in the past: Patrik Löwendahl, Roger Jennings and David Hayden.

Here's my simple LLBLGen Pro test code. I use an empty foreach instead of calling ToList() on the query, because ToList() in every provider copies the results into a new List object, which takes performance without adding anything to the story, so I left that out. Our queries are typed with an interface which allows you to obtain the results directly in a list, so you can avoid this performance penalty. With Linq to Sql, you either have to call ToList() or traverse the set.

Stopwatch sw = new Stopwatch();
sw.Start();
int amount = 100;
for(int i = 0; i < amount; i++)
{
    FetchNestedSetTest();
}
sw.Stop();
Console.WriteLine("Fetching the nested set {2} times took: {0}ms. On average this took per fetch: {1}ms",
          sw.ElapsedMilliseconds, sw.ElapsedMilliseconds / amount, amount);
          
//...
// the routine FetchNestedSetTest:
public static void FetchNestedSetTest()
{
    using(DataAccessAdapter adapter = new DataAccessAdapter())
    {
        LinqMetaData metaData = new LinqMetaData(adapter);
        var q = (from c in metaData.Customer
                where c.Country=="Germany"
                select c).WithPath(
                    new PathEdge<OrderEntity>(CustomerEntity.PrefetchPathOrders, 
                        new PathEdge<OrderDetailsEntity>(OrderEntity.PrefetchPathOrderDetails)));
        foreach(var v in q)
        {
        }
    }
}

The output: Fetching the nested set 100 times took: 1906ms. On average this took per fetch: 19ms
That's 3 queries per call, 19ms total for setup the query, adapter, db connection query parsing etc. This is a debug build for the complete stack (Linq provider, O/R runtime, SQL engine) and connects to a SqlServer 2000 database over the network.

Ok, now let's see what Linq to Sql does in this case (same machine, same target database, ran after LLBLGen Pro's version).

Stopwatch sw = new Stopwatch();
sw.Start();
int amount = 100;
for(int i = 0; i < amount; i++)
{
    FetchNestedSetTest();
}
sw.Stop();

Console.WriteLine("Fetching the nested set {2} times took: {0}ms. On average this took per fetch: {1}ms",
     sw.ElapsedMilliseconds, sw.ElapsedMilliseconds / amount, amount);

//...
// the routine FetchNestedSetTest:
public static void FetchNestedSetTest()
{
    NorthwindDataContext nw = new NorthwindDataContext();
    DataLoadOptions loadOptions = new DataLoadOptions();
    loadOptions.LoadWith<Customer>(c => c.Orders);
    loadOptions.LoadWith<Order>(o => o.Order_Details);
    nw.LoadOptions = loadOptions;
    var q = from c in nw.Customers
            where c.Country=="Germany"
            select c;

    foreach(var v in q)
    {
    }
}

Ready? This is the output: Fetching the nested set 100 times took: 16558ms. On average this took per fetch: 165ms
Yes, I've checked it a couple of times, and before you ask if my code indeed fetches data, it does. 19ms vs. 165ms. That's a big difference. How is this possible? The main reason is that MS' code isn't clever in this case. It doesn't do any fancy optimization of the graph fetch, while it has all the information at hand to do so. Instead it executes a lot of queries: 1 for the Customers, one to fetch for every Customer's Orders and per Order, one to fetch the Order's OrderDetail entities. That's a lot of queries, which take a lot of time. Not only that, it will burn your server to the ground if you have many people accessing your database and executing these queries. So if you're considering Linq to Sql, be sure you know this serious performance bottleneck. See also the 3 articles by Patrik, Roger and Dave I've linked to above, which explain the issue more in detail. In a bit, we'll see that this approach could also lead to wrong results in some cases. More on that below.

Microsoft has put a lot of effort in optimizing the Linq to Sql runtime via IL code generation at runtime (see Rico Mariani's posts about this subject) to get fast set fetches. This code indeed does its work well, simple sets are fetched very quickly, a couple of milliseconds per query faster than our code. But, I'm sorry to say it Rico and friends, isn't this work all in vain when the optimization in the end turns out to be a micro-optimization (which is considered a bad thing)? Shouldn't this work have been applied to the bigger picture which clearly shows that the Linq to Sql framework isn't equipped with solid, tweakable, clever logic to perform all queries very fast, especially the queries for hierarchical fetches, one of the core reasons to use an O/R mapper in the first place? Optimizing frameworks is a delegate manner, and optimizing O/R mappers is something you have to work hard for: the classic 'it depends' is applicable to a lot of situations and to get the best performance in all cases, it's key that the framework notices which situation it is in and adapts to that situation. Code to quickly do hierarchical fetches with tweakable features is somewhat complex to write, especially the merging of the sets, but Linq to Sql doesn't even support m:n relations, so the hardest part of hierarchical fetches was already off the table!

Let's look at the other way of doing hierarchical fetches with Linq.

Fetching hierarchical sets of non-entity projections
Linq allows you to specify a projection inside the projection of your query, or more than one. I'll show you two examples. The first one I'll show you is a complex, 5 entity touching hierarchical fetch with multiple paths. The other one is a simpler one which fails to return the proper set in Linq to Sql which is caused by their way of handling nested sets in the projection.

Linq to LLBLGen Pro treats nested queries inside the projection the same way as prefetch paths: nested queries are fetched with a separate query per set, the sets are merged using hashvalue matching and the queries for the nested sets are optimized using the same bag of tricks used for prefetch paths (yes, we're using more tricks than just Marcus' trick ). The nested queries in a Linq query have an additional problem though: if the end result is a set of anonymous types you can only materialize the objects if you have all the data: anonymous types don't have setter properties, only getters. This requires a 2-phase projection: first the data is fetched from the db using a simple projection, and stored into a temporary storage in-memory. Then all nested queries are fetched, and the results of each query are placed in the slot per row of the set to materialize. Then that rowset is projected again, but this time in-memory with the actual projectors so in-memory method calls are ran and anonymous type instantiation works OK.

Ok, let's look at our phat nested query first:

var q = from c in metaData.Customer
        where c.Country == "Germany"
        select new
        {
            c.CompanyName,
            c.City,
            Orders = from o in c.Orders
                     select new
                     {
                         OrderDetails = from od in o.OrderDetails
                                        select new
                                        {
                                            od.Product,
                                            od.Quantity,
                                            od.UnitPrice
                                        },
                         o.Employee,
                         o.OrderDate
                     }
        };

This query combines fetches of anonymous types with data projected from entity sets with fetches of entities: od.Product and o.Employee are fetches of related entities, while the rest are projections to anonymous types. This query fetches data from Customer, Order, OrderDetails, Product and Employee, and should use at most 5 queries. Let's put this baby on the testbench to see how fast our provider is. We use the same setup as above, only this time FetchNestedSetTest() simply executes the query above instead of the prefetch path query. The foreach assures proper enumeration. The query is also in a unittest with data-checks, so it indeed does fetch all data, so the timings are correct.

The output is: Fetching the nested set 100 times took: 4363ms. On average this took per fetch: 43ms
This number isn't that much of a surprise, considering the fact that the overall set of data is pretty small: just 11 customers, their orders, order details, the products used in those order details (70 in total) and the employees who filed the orders. With the approach of one query per node, and with tricks like switching to an IN (value, value..) query instead of a subquery, it makes the individual sets pretty small, and the merger routine which has to merge the different rows based on hashvalues doesn't have a hard time. A debug build was used.

Now, let's see how Linq to Sql handles this query. It comes as no surprise that a nested query inside a projection is handled the same way as with the loadoptions: the query is executed per parent, except when it is a scalar, then it is folded into the parent's select for obvious reasons (we do that too, btw). To avoid controversy, here's the Linq to Sql variant of this query:

var q = from c in nw.Customers
        where c.Country == "Germany"
        select new
        {
            c.CompanyName,
            c.City,
            Orders = from o in c.Orders
                     select new
                     {
                         OrderDetails = from od in o.Order_Details
                                        select new
                                        {
                                            od.Product,
                                            od.Quantity,
                                            od.UnitPrice
                                        },
                         o.Employee,
                         o.OrderDate
                     }
        };

The output: Fetching the nested set 100 times took: 30714ms. On average this took per fetch: 307ms
So 43ms vs. 307ms. Analyzing what Linq to Sql does, you see the same thing: a lot of queries are executed. Clearly there's no clever code in place which makes this more optimal.

It might be that you now think these are edge cases., but I beg to differ. Just because you can group sets so easily together in projections in Linq, it's mandatory that these fetches are done optimally. Futhermore, and I already stressed that point above: the ability to fetch trees of objects is a big advantage of an O/R mapper over simpler code which fills classes with results from a query: merging sets efficiently without any effort from the developer is a very nice feature as the data is grouped in trees, objects in collections inside other objects, exactly as the class model dictates. Isn't that one of the reasons people want to work with entities, objects and the like and not with flat tabular data?

There's another reason why the approach Linq to Sql uses isn't that good: it can lead to wrong results. The following Linq to Sql query results in 818 rows, one for every Order entity in the database. However, that's wrong, as 'Distinct' is used in the query: Customers have often more than 1 order and often multiple orders are filed by the same Employee. The right answer has to be 472 (on our northwind test database) instead of 818 rows. But because Linq to Sql doesn't use 2-phase projections, it can't take into account the Distinct at all, leading to the wrong results:

var q = (from o in nw.Orders
         select new
         {
             o.EmployeeID,
             CustomerData = (from c in nw.Customers where c.CustomerID == o.CustomerID select new { c.CompanyName, c.City, c.Country }).Single()
         }).Distinct();

The query is perhaps not what you'd see in your daily work, but it proves the point: an O/R mapper has to produce the right results according to the query.

I'll conclude with an example of Linq to LLBLGen Pro using a nested query which result is passed to a method which is executed in-memory, thus executed during projection. The method is very simple but illustrates the mechanism.

// helper class which contains the in-memory method to call
public class InMemoryMethodCallsForTests
{
    public static bool IsUsefulEntity(CustomerEntity customerEntity)
    {
        return customerEntity.Country == "Germany";
    }
}


// unit test which illustrates the point of calling this method with a nested query result.
[Test]
public void InMemoryMethodCallWithNestedSetResultAsParameter()
{
    using(DataAccessAdapter adapter = new DataAccessAdapter())
    {
        LinqMetaData metaData = new LinqMetaData(adapter);
        var ordersFromGermany = from o in metaData.Order
                                where o.Customer.Country=="Germany"
                                select o.OrderId;

        List<int> orderIds = ordersFromGermany.ToList();
        Assert.AreEqual(121, orderIds.Count);

        var q = from o in metaData.Order
                select new    {
                        o.OrderId,
                        Value = InMemoryMethodCallsForTests.IsUsefulEntity(o.Customer)
                    };

        foreach(var v in q)
        {
            if(v.Value)
            {
                Assert.IsTrue(orderIds.Contains(v.OrderId));
            }
            else
            {
                Assert.IsFalse(orderIds.Contains(v.OrderId));
            }
        }
    }
}

This test calls for every anonymous type fetched the IsUsefulEntity routine by passing the result of the nested query on the Customer to it. The result is placed in the slot for 'Value' and is used to materialize the anonymous type instances returned by the query.

So, done yet?
Not entirely but almost. We want to add support for Full text search to our Linq provider (our API supports it, so it's a small thing), a way to specify fields to exclude in entity fetches (so fetch all fields from Employee except the Photo field) which is something already implemented in LLBLGen Pro so it shouldn't be a lot of work to add it and some very minor details to wrap up and then it's done!

If everything goes according to plan, we'll release next week a Linq CTP of LLBLGen Pro v2.6 to our customers. This CTP includes new runtime (v2.6), templates, the Linq provider and the sourcecode of all unittests for Linq so you get a lot of queries to look at to get started. After that, we'll spend a month or so to make the changes we've planned for v2.6 which will then be released in beta. I think that gives us enough time to hammer out the bugs/glitches/things we forgot from the Linq provider before release.

Final words
This was the last article in this series. It spans 14 episodes and I think it's a great read how my journey through Linq land went: from being in the pitch-black with just a flashlight with dead batteries, to the bright sunny dailylight of today with the fully working, high performance provider it has become. Compiled in release build, the Linq provider alone is 170KB, and the provider alone spans roughly 925KB of C# sourcecode (and there's still some small things left!). The first branch in SVN was created in September 2007, so roughly 5-6 months worth of work. I was on the edge of giving up on this Linq provider a couple of times, mainly due to having no real information when what appears where in the tree, however I'm glad I made it: the Linq provider offers a nice way of querying all databases supported by LLBLGen Pro and makes a nice addition to the featureset we already provide.

I'd like to thank all people who have supported me during this journey! I hope you all enjoyed reading these articles. The wait is almost over, though I can assure you, it was worth it.

2 Comments

Comments have been disabled for this content.