Developing Linq to LLBLGen Pro, part 13

Tuesday, February 19, 2008

(This is part of an on-going series of articles, started here)

Last time I talked about implementing Single. It turned out to be fairly straightforward, but as I explained in the previous episode, it's a weird method and has different behavior related to where it's used in the query: sometimes it does and sometimes it doesn't result in an exception and also it doesn't always result in the same type of exception. This same aspect of not having the same behavior, or better: end result, related to where the element is used in the query is seen with using function mappings, local methods etc. in a Linq query. Let's look at an example or two:

// Example A: usage of method mapped onto db function
LinqMetaData metaData = new LinqMetaData(adapter);
var q = from c in metaData.Customer
        where String.Compare(c.Country, "USA")==0
        select c;
List<CustomerEntity> customers = q.ToList();
Assert.AreEqual(13, customers.Count);

// Example B: usage of in-memory call of method not mapped on db function
LinqMetaData metaData = new LinqMetaData(adapter);
var q = from o in metaData.Order
        where o.CustomerId=="CHOPS"
        select new
        {
            DaysInMonth = DateTime.DaysInMonth((int)o.OrderDate.Value.Year, 
                 (int)o.OrderDate.Value.Month),
            OrderId = o.OrderId,
            Constant = true
        };

Example A is a typical example of a .NET method which is mapped onto a DB function / construct. However, with example B we see a different usage: DateTime.DaysInMonth isn't mapped onto a DB function in this case, it's executed in memory on the query result produced by the query during the materialization of the objects in q, so during the projection of the raw query results onto objects. This gives a bit of a problem: if there's no mapping of a method / property onto a DB construct, the method / property apparently has to be executed in-memory after the results have been received. Though, what about this query instead:

// Example C: non-working query due to in-memory method call in where clause
var q = from o in metaData.Order
        where o.CustomerId=="CHOPS"
        && DateTime.DaysInMonth((int)o.OrderDate.Value.Year, 
                (int)o.OrderDate.Value.Month)==30
        select o;

Here, we use the exact same construct of DateTime.DaysInMonth, however now it's inside a where clause and therefore it is part of the query which means it will run inside the database. But that doesn't work if there's no way to convert the .NET code to SQL. So we have a query with the same construct but it has different results where it is used: sometimes it works, in other cases it ends in tears.

It is very important that developers understand this difference: that it's possible that an element does work in one area of the query but doesn't work in another area, because there's no conversion possible to SQL, so it can't run on the database. The problematic thing about this is that you'll find this out at runtime, not at compile-time, as the query is evaluated at runtime so the code compiles fine.

The main cause of this confusion is that Linq offers a lot of features to be used inside a query and with a lot of features comes a lot of responsibility. This means that if you're going to write a Linq query which is will run in the database, be absolutely sure you know where which part is executed. Not only will that save you from the trouble that things won't work at runtime, it also will safe you from the trouble fixing performance issues because what you thought would be running inside the DB is actually ran on the client. To add to this confusion, one can create a mapping for DateTime.DaysInMonth to a function in the database which returns the same value as the .NET version does. After that the construct will be convertable to SQL and will run inside the database and not in memory, so example C would then work properly. Later in this post we'll explore how to achieve exactly that.

One could argue if all this is really a good thing or not. I'm a bit undecided about this ("Now, that's a first!"): after writing a lot of tests for our method/property mapping system, I can only say: being able to map a .NET method/property to a DB construct is awesome. However I fully understand the problematic aspects illustrated in the examples B and C above and what the impact will be for people who read code written by others: they have a hard time understanding what's really going on. Not fully understanding what's going on is one of the main causes why bugs are introduced and not found by the developer.

For the people who have never seen code for an O/R mapper, Linq seems like magic and the fancy, shiny aura of it makes them forget about all the real life problems they've to deal with: this will solve all the problems, once and for all!. However the people who have worked with O/R mappers know that dropping SQL to replace it with another query language doesn't make things in all cases easier to implement: Linq is just another querying language and with all querying languages, it shares the same disadvantage: it's not a 1:1 equivalent of SQL. This is a disadvantage that will eventually become apparent for the novice O/R mapper users among us: you will still wonder from time to time how to write the same query in Linq as you would do in SQL.

LLBLGen Pro was the first O/R mapper on .NET which had a compile-time checked querying system. At the same time it forced the developer to use a different style of querying than one would use with SQL. With Linq, one might be tempted to think this will be different: it won't be hard anymore to rewrite a query originally written in SQL in C# using Linq. But that's naive and, if I may say so, a tiny bit wrong. Linq abstracts away so many things, that the learning curve to be able to write whatever query you'd want in Linq is still steep. The power offered by Linq to combine .NET code running on the client with code which will be converted to SQL into one query is great, but it also comes with a price: you've to be fully aware of what you're doing, as described in examples B and C. That's not a bad thing though. Writing software requires that you know what you're doing and with Linq and a good O/R mapper beneath it, a lot of power is right there at your fingertips.

Now that that lecture is out of the way, let's look at a couple of nasty buggers shown in the examples above.

Bugger 1: Constants in the projection
This is a problem: do you emit the constant in the query end-result or do you pass the constant along to the projection engine? The problem with the first approach is that if the value isn't convertable to a SQL construct, you're in trouble. So the best way is to pass it to the projection engine. The constant I'm talking about here is a constant similar to the 'true' boolean in example B and C. I first thought this would be a bit silly to add support for this, but after a while it did make sense, so adding support for this was necessary.

LLBLGen Pro has a fine-grained projection framework build-in, which allows you to define projections from sets onto datatables, entity classes or custom classes. It also allows you to define a projection from a view of an entity collection onto datatables, other entity classes or custom classes. This framework works with a set of source values, a set of destination elements and a set of projectors which define how to set each destination element to a value, as in: how to convert the source values into destination values or value (the number of source elements doesn't have to be the same as the number of destination elements). The input of a projection is the set of rows of data to project, e.g. the rows in a resultset of a query. Onto each row, the set of projectors are applied to produce the values for the set of destination elements, which for example can be a set of properties, or a set of constructor arguments or both.

I made a change to this framework where each projector to produce a destination element's value could have a delegate as well. This delegate would then be invoked with the row of data and it would produce a value. That value then would be the result of the projector to set its corresponding destination element with. A constant in the projection therefore is a simple delegate which produces... that constant!

Creating delegates in Linq is rather easy: you create a LambdaExpression and call Compile on it. If you predefine the signature of the delegate with a delegate definition, you can produce a delegate from the LambdaExpression by specifying the delegate type to the Expression.Lambda static method which creates the lambda and you just call Compile onto it to make it a true delegate of that delegate type. After that, you can invoke it with the signature of the delegate type. This same trick is used with in-memory method/property calls as well.

Bugger 2: In-memory methods/property calls in projection
To be able to execute the in-memory method/property during the projection, we've to create a delegate out of it. With constants, we've seen that we can produce a lambda expression from it, compile it and we'll get a delegate. There are a couple of issues though. One of them is that every method can have a different number of arguments, and if we want to have a delegate which executes fast, we have to avoid DynamicInvoke however this implies we have to solve the problem that we can't have a gazillion different delegate signatures to deal with. Another is that the elements of the query are still inside the lambda. These have to be replaced with references to parameters. Take our DaysInMonth example:

DaysInMonth = DateTime.DaysInMonth((int)o.OrderDate.Value.Year, (int)o.OrderDate.Value.Month)

Here we see two references to elements in the query: OrderDate.Year and OrderDate.Month (it's a nullable type, hence the 'Value' reference). Year and Month are properties mapped onto a DB construct, namely YEAR(date) and MONTH(date), but more on that later. We've to pull these elements out of the lambda, as otherwise they'd end up in the compiled delegate, and instead have them replaced with parameters. We could rewrite the method call as: DateTime.DaysInMonth(values[indexes[0]], values[indexes[1]]). Here, values are the set of values from the input and indexes is the array of indexes into values for the parameters. This is easily done with a special visitor-like handler which traverses the lambda subtree and rewrites elements, and at the same time collects the elements rewritten. This also offers a way to use a standard delegate signature with two input parameters: a values array and an index array. Because of this, all we need to do now is compute the proper indexes for each input parameter, store them with the projector and we're done, as the compiled lambda always has the same signature and therefore we can use the very fast delegate.Invoke(values, indexes) instead of using DynamicInvoke, no matter how many parameters the in-memory method has .

These kind of delegate invokes are very fast: 10 million times calling Invoke takes less than 17ms. I find that acceptable, especially since 10 million times calling DynamicInvoke takes over 4 seconds! With this system, constants and in-memory method calls/property accesses in projections could be handled without problems, no matter how complex the calls would be (in-memory method call which takes in-memory method call's result's property access etc. ). There's a problem with object instantiation, but I'll cover that later in the post.

Mapping methods and properties to DB constructs
We already saw an example of this feature with Year and Month and also in Example A, with the String.Compare() method call. How does this work, and how flexible is it? From the start I wanted to have this system as flexible and as extensible as possible: developers have to be able to map their own db functions to .NET methods / properties and also should be able to overrule existing mappings because they think their construct works better (which is possible, see the string section below). The system build into Linq for LLBLGen Pro offers that: a mapping feature for mapping .NET methods and properties to database constructs. So not only plain functions, but whole snippets of SQL.

LLBLGen Pro's DbFunctionCall query element offers a way to call a function which accepts a constant value as a parameter (e.g. DATEADD() requires a parameter which is a constant, but is placed inside the (), so you can't map 'DATEADD' to some other name). It does this by accepting a SQL snippet with placeholders for the parameters where LLBLGen Pro's query engine will emit the parameter names for the input values of the function. This feature is used here as well.

By default, every dynamic query engine (DQE) for each database comes with a set of pre-defined mappings for a set of methods/properties of .NET classes onto database constructs. For example the String.Compare method call is mapped as follows in SqlServer:

// Compare(2)
_functionMappings.Add(new FunctionMapping(typeof(string), "Compare", 2, 
        "CASE WHEN {0} < {1} THEN -1 WHEN {0} = {1} THEN 0 ELSE 1 END"));

These function mappings are defined in a FunctionMappingStore object inside the DQE. A developer can also define his/her own function mappings in an own FunctionMappingStore instance. Internally, the function mappings retrieved from the DQE used in the particular query, are obtained automatically by the Linq provider, so the developer only has to pass mappings defined in an own FunctionMappingStore instance. This FunctionMappingStore is placed in front of the default one, which means that if a mapping is found in the function mappings passed in by the developer, that mapping is used, otherwise the mapping found in the default ones. This offers the facility to overrule method/property mappings in-code and on a query by query basis.

The following example shows how one defines a mapping and how to pass it to the query engine. We're going to map a .NET method called 'CalculateOrderTotal' to a function defined in the Northwind database called fn_CalculateOrderTotal. This function takes two parameters, orderID and a boolean to use discount prices or not.

// defining the method in .NET. This method is empty as it isn't called in .NET code.
public class NorthwindFunctions
{
    public static decimal CalculateOrderTotal(int orderId, bool useDiscounts)
    {
        return 0.0M; // never called
    }
}

// defining the mapping in our own function mapping store, which is a class in our
// own project or in a referenced assembly, whatever you prefer.
public class NorthwindFunctionMappings: FunctionMappingStore
{
    public NorthwindFunctionMappings()
    {
        // define the mapping. SQLServer 2000 needs the schema to be present for the 
        // function call, so we specify that as well.
        this.Add(new FunctionMapping(typeof(NorthwindFunctions), "CalculateOrderTotal", 2, 
                    "fn_CalculateOrderTotal({0}, {1})", "Northwind", "dbo"));
    }
}

As you can see, you could opt for specifying a catalog and a schema as well. This is handy if the function you want to call is in another schema or catalog. LLBLGen Pro offers automatic catalog / schema name overwriting at runtime by setting a simple config file setting, so this hard-coded names aren't limiting flexibility.

Now that we have our mappings defined, we can use this method in our Linq queries:

using(DataAccessAdapter adapter = new DataAccessAdapter())
{
    LinqMetaData metaData = new LinqMetaData(adapter, new NorthwindFunctionMappings());
    var q = from o in metaData.Order
            where o.CustomerId == "CHOPS"
            select new 
            { 
                o.OrderId, 
                OrderTotal = NorthwindFunctions.CalculateOrderTotal(o.OrderId, true) 
			};
    int count = 0;
    decimal totalOfOrders = 0.0M;
    foreach(var v in q)
    {
        count++;
        totalOfOrders += v.OrderTotal;
    }
    Assert.AreEqual(8, count);
    Assert.AreEqual(21045.0M, totalOfOrders);
}

The LinqMetaData class accepts an optional FunctionMappingStore instance which is then used inside the queries produced with that LinqMetaData instance. As LLBLGen Pro doesn't use xml files to store meta-data nor does it have to reflect over an assembly, instantiating a LinqMetaData class is very fast, it doesn't have any overhead to deal with.

In the query above, you'll see it looks almost the same as the ones in the first examples. I've left the DataAccessAdapter creation in the code snippet so you have a clear picture of the complete code needed to execute the query. The SQL produced by this query looks like:

SELECT    [LPLA_1].[OrderID] AS [OrderId], 
        [Northwind].[dbo].fn_CalculateOrderTotal([LPLA_1].[OrderID], @LO11) AS [OrderTotal] 
FROM     [Northwind].[dbo].[Orders] [LPLA_1]  
WHERE     ((((([LPLA_1].[CustomerID] = @CustomerId2)))))

// parameter definitions:
Parameter: @LO11 : Boolean. Length: 0. Precision: 0. Scale: 0. Direction: Input. Value: True.
Parameter: @CustomerId2 : StringFixedLength. Length: 5. Precision: 0. Scale: 0. Direction: Input. Value: "CHOPS".

With this system, you can define wicked pieces of SQL and map a .NET method onto it and use it in your Linq queries very easily. As mentioned you can overrule the predefined mappings as well. The main issue is with strings. Databases have various string types in most cases: some being fixed length, others being variable length. In SqlServer you have a couple of functions to produce the length of a string: DATALENGTH and LEN. The difference is that with DATALENGTH you'll include trailing spaces, with LEN you don't. LEN is thus ideal for (N)CHAR typed fields, DATALENGTH for variable length fields. Our mappings are build with LEN, though if you require mappings based on DATALENGTH, you can easily override them in your own mappings.

Booleans in projections
In earlier episodes of this series I said we wouldn't support booleans in projections because SQL doesn't support booleans in projections either. However, with the function mapping system in place, it was very easy to implement this with a CASE construct onto which we mapped our own method. This method isn't used by the developer though we see a boolean expression in the projection as a method call to that method to produce the proper construct. As the rest of this function mapping pipeline has already been implemented, it automatically works once we pull the proper mapping out of the FunctionMappingStore instances. Take this example:

// Boolean expression in projection
LinqMetaData metaData = new LinqMetaData(adapter);
var q = from c in metaData.Customer
        select new
        {
            c.CustomerId,
            IsAmerican = (c.Country == "USA"),
        };

The SQL query produced is:

SELECT    [LPLA_1].[CustomerID] AS [CustomerId], 
    CASE 
        WHEN [LPLA_1].[Country] = @LO13651f01 THEN 1 
        ELSE 0 
    END AS [LPFA_2], 
FROM [Northwind].[dbo].[Customers] [LPLA_1] 

// parameter definition:
Parameter: @LO13651f01 : String. Length: 3. Precision: 0. Scale: 0. Direction: Input. Value: "USA".

Though this does cause a bit of a problem: the query returns '1', not 'true'. Though our lambda-to-delegate trick helps here. We simply produce a lambda which converts the query value to boolean by comparing it to 1 and we're done.

Lambda-based function usage
As the new versions of C# and VB.NET allow us to define simple delegates as Func instances using lambda expressions, why not use them inside a query? This results in a similar approach as with normal in-memory calls: compile the lambda to a delegate with a known signature and use that during projections. Here's an example:

// in-memory lambda function usage in projection.
Func<string, string> stringChopper = s => s.Substring(0, 3);

LinqMetaData metaData = new LinqMetaData(adapter);
var q = (from o in metaData.Order
         select new { FirstChars = stringChopper(o.Customer.CompanyName), 
                     o.Customer.CompanyName }).Distinct();

In this example, we define a simple lambda which performs a substring call on the input and we use it in our projection. To illustrate that this function indeed does run in-memory, the query produced looks like:

SELECT DISTINCT [LPLA_2].[CompanyName] AS [FirstChars0], [LPLA_2].[CompanyName] 
FROM [Northwind].[dbo].[Customers] [LPLA_2]

Object creation inside projections
In the previous examples, you've seen that the projection of the query produces new, anonymous typed objects, one per result row. This is done by the 'new' operator, which results during the expression tree processing in an object representing a projection onto an anonymous type. However this approach broke down on the following code I'll show you in a bit. It's about creating an object in the projection with 'new' but the result of a method call or property access on this method is the result of the projection of each row.

// class used in example
public class Concatenator
{
    private string _val1, _val2;

    public Concatenator() {}

    public Concatenator(string val1, string val2)
    {
        _val1 = val1;
        _val2 = val2;
    }

    public string ConcatMembers()
    {
        return _val1 + _val2;
    }

    public static string Concat(string val1, string val2)
    {
        return new Concatenator(val1, val2).ConcatMembers();
    }

    public string ConcatValues(string val1, string val2)
    {
        _val1 = val1;
        _val2 = val2;
        return ConcatMembers();
    }

    public string Val1
    {
        get { return _val1; }
        set { _val1 = value; }
    }

    public string Val2
    {
        get { return _val2; }
        set { _val2 = value; }
    }
}

I know the class doesn't make any sense in real life, but it's a great way to illustrate the point: consider the following query:

LinqMetaData metaData = new LinqMetaData(adapter);
var q = from c in metaData.Customer
        where c.Country != null
        select new Concatenator(c.Country, c.City).ConcatMembers();

This query doesn't produce a list of Concatenators, it produces a list of strings, namely the result of the call to ConcatMembers on every row. This means that the 'new' expression seen in the tree has to be seen as an in-memory call, and therefore has to be compiled into a lambda. This took me some time to get right but it works now. I still have a little bug to fix with a property access instead of a method call, but that should be pretty straight forward.

These kind of constructs blurr more and more the line between what's ran inside the database and what's ran in-memory. A developer should be careful with these powerful features but once understood, it opens up a lot of power. To illustrate what can be done with this, I'd like to show you a combination of our Concatenator fellow shown above with a boolean subquery which result is converted to string. The subquery is fetched twice for the unittest to fulfill its duty.

LinqMetaData metaData = new LinqMetaData(adapter);
var q = from c in metaData.Customer
        where c.Country != null
        select new
        {
            c.City,
            Contains3 = (from o in c.Orders select o.EmployeeId).Contains(3),
            ConcatenatedResult = Concatenator.Concat(
                (from o in c.Orders select o.EmployeeId).Contains(3).ToString(), c.City)
        };

This is particular nasty, because the boolean value returned from Contains normally would have been converted from the int returned from the CASE statement in the SQL query to a real bool value with a delegate. However, the ToString method following it is a method which has a mapping in the database. This means that unless there's something done under the hood, we'll get '1' instead of 'True' in our results as it's not possible to fit the in-memory delegate call between the CASE statement and the database construct onto which ToString is mapped. The solution appears to be simple: 'ToString', as any other method or property, can have a mapping per-type, so we simply map ToString() for a boolean onto "CASE {0}=1 THEN 'True' ELSE 'False' END" and it works automatically, as the system finds a mapping of the method call, in this case ToString, for the type, in this case System.Boolean, and uses that mapping.

Example C revisited
Let's look again at our example C, which had the DaysInMonth function in the where clause and therefore couldn't work because it has no mapping to a database construct. But what if we would map it to (*drumroll*): "DATEDIFF(d, CONVERT(datetime, '' + CONVERT(NVARCHAR(4), {0}) + '-' + CONVERT(NVARCHAR(4), {1}) + '-01'), DATEADD(m, 1, CONVERT(datetime, '' + CONVERT(NVARCHAR(4), {0}) + '-' + CONVERT(NVARCHAR(4), {1}) + '-01')))" ? That will result in the proper # of days in the month specified by the two parameters, which will be placed on {0} and {1} resp. Let's see if that works:

// we add the mapping to our NorthwindFunctionMappings class ctor
// DaysInMonth(2)
this.Add(new FunctionMapping(typeof(DateTime), "DaysInMonth", 2, "DATEDIFF(d, CONVERT(datetime, '' + CONVERT(NVARCHAR(4), {0}) + '-' + CONVERT(NVARCHAR(4), {1}) + '-01'), DATEADD(m, 1, CONVERT(datetime, '' + CONVERT(NVARCHAR(4), {0}) + '-' + CONVERT(NVARCHAR(4), {1}) + '-01')))"));

// after that we can create a test to see if this indeed works!
[Test]
public void CustomMappedVersionOfDaysInMonthUsageInWhereClauseTest()
{
    using(DataAccessAdapter adapter = new DataAccessAdapter())
    {
        LinqMetaData metaData = new LinqMetaData(adapter, new NorthwindFunctionMappings());
        var q = from o in metaData.Order
                where DateTime.DaysInMonth(o.OrderDate.Value.Year, o.OrderDate.Value.Month) == 30
                select o;

        int count = 0;
        foreach(var v in q)
        {
            count++;
            Assert.IsTrue(v.OrderDate.HasValue && 
                   (DateTime.DaysInMonth(v.OrderDate.Value.Year, 
                         v.OrderDate.Value.Month) == 30));
        }
        Assert.AreEqual(252, count);
    }
}

Which succeeds. As you can see, you can add your own mappings to overcome roadblocks on your path to success.

Linq to Sql odd-ness
It keeps surprising me how different the quality is of some parts of Linq to Sql. As if multiple teams have worked on those parts and one team simply dropped the ball in a lot of its tasks. While Linq to Sql is often capable in producing a proper, optimized query even if the query is very very complex and awkward, it sometimes drops the ball on simple things. Take for example left-shift and right-shift operator support. Back in the days when I was hammering out MC68000 assembler on my Amiga 500, we demosceners knew that to multiply with a power of 2, it was faster to add the values together than to multiple by 2 or to shift 1 bit. Shifting became faster if you had to multiply with a value larger than 4. Shifting is equal to multiplying with values of a power of 2 and vice versa. So shifting a value v n bits to the left can be mapped onto (v * POWER(2, n)). Shifting to the right is equal to: (v / POWER(2, n)). Linq to Sql can't deal with left/right shift operators in code. As a programmer I find that a little odd as bit operations and bit logic is a part of every first year of every CS course. It's odd as well as they otherwise have mapped a lot of operators and methods onto SQL constructs.

One could say that bitshifts aren't that important, but sometimes values are packed in bitfields in the database. Shifting is important in that case.

Another one is the lack of support for tuple matching with Contains and Exist. Granted, this is a bit fragile because it can go wrong with properties not being there, however in the cases where these things are properly set up, tuple matching is very handy. Consider the following query:

LinqMetaData metaData = new LinqMetaData(adapter);
var q = (from c in metaData.Customer
     select new Pair<string, string>() { Value1 = c.Country, Value2 = c.City })
            .Contains(new Pair<string, string>("USA", "New York"));

Pair is a simple generic bucket class which has two properties. The thing is: this query is actually not a clean-room testcase, but a real-life usable query. Admitted, you could also write this as a fetch and then execute a Count() over it, but you can do that with all Contains calls.

Perhaps they'll add it in the next version.

What's next?
After the bug with the property access on an in-memory object created in the projection has been fixed, the next thing to do will be defining the mappings for the functions for other databases than SqlServer. After that I want to start with the last part of the provider: hierarchical fetches. LLBLGen Pro has a powerful hierarchical fetch plan system called prefetch paths and the design to include this into Linq queries has been done already, so writing the code should be 'fairly' straight forward (as the core of the code is already done inside the LLBLGen Pro framework). The hierarchical fetches which are less simple are the ones which are nested in projections. Linq to Sql delegates these queries to be executed when the property is accessed, which results in a lot of queries and slow execution, and with our hierarchical fetch code already in place (so all the merging logic etc. is already written) it should be doable to connect the hierarchical query with this code to do hierarchical fetches of nested queries in a projection with one query per node per tree branch. See this example:

var q = from c in metaData.Customer
        select new
        {
            Name = c.ContactName,
            Orders = from o in metaData.Order
                     where o.CustomerId == c.CustomerId
                     select o
        };

This query results in (1 + #of customers) amount of queries in Linq to Sql, though it's doable with just 2 queries. The trick is that there's a correlation relation inside the nested query, which can be used to tie the two queries together to drive the merge logic. But that's for the next episode of this series, which hopefully will be the last one. Stay tuned!

No Comments