Developing Linq to LLBLGen Pro, part 11

Tuesday, January 15, 2008

(This is part of an on-going series of articles, started here)

In the previous post in this series, I mentioned that I had completed the work on all the major parts of a SELECT query. SELECT is what a Linq provider is all about (as Linq queries are focussed on fetching data, not manipulating it). It was sort of a milestone for me, it gave the feeling that most of the work was done. What could possibly be the work in the rest of the Queryable extension methods, compared to nasties like GroupJoin and GroupBy / Aggregates?

All / Any
So I started with the top of the list of extension methods of the Queryable class and started implementing the ones which weren't implemented / supported yet in the code. The first one, Aggregate, could be skipped as it's unclear to me till today how on earth to even use it. I couldn't produce a single compilable piece of code which could mimic for example an aggregate function like Sum, and neither could the people I asked, so I skipped it.

Next was All. All is the opposite of Any, implementation wise: All is equal to a NOT EXISTS(query on source with negated predicate operand).Any is equal to EXISTS(query on source with predicate operand). So you can write one routine to handle them both. As with all EXISTS queries, you've to make sure you fold in the correlation relation which ties the two queries (outer query and EXISTS query) together as a predicate in the subquery. LLBLGen Pro doesn't support boolean values in projections, as databases in general don't support booleans in projections, so All() and Any() without parameters as the final methods in the query aren't supported. Using All() and Any() inside a Where clause however is of course supported.

Cast
We already covered Average when aggregates were implemented so let's move on to Cast. Cast is a silly method more or less: it's there to make the query compilable. It's not really necessary in the query. Let's look at an example.

// LLBLGen Pro Linq example
var q = from e in metaData.Employee.Where(e => e is BoardMemberEntity).Cast<BoardMemberEntity>()
	select e.CompanyCarId;

Employee, Manager and BoardMember are in an inheritance hierarchy, where each type is mapped onto its own table or view (the inheritance type Linq to Sql doesn't support). I know this particular set of types isn't worlds best example of explaining inheritance, but still it offers possibilities to test things out. BoardMember for example is the only entity which has a relation with CompanyCar. The query above fetches all CompanyCarId values from all BoardMember employees. The Where clause adds a type filter for the BoardMemberEntity type, so 'e' will only be a BoardMemberEntity. LLBLGen Pro supports inheritance since v1.0.2005.1 and has an easy facility to add a type filter so adding support for this wasn't hard. It comes down to implementing support for TypeBinaryExpression nodes in the tree and to make sure Cast worked, these TypeBinaryExpressions were required, so I added support for them first.

In the query above, the Cast call is actually not required, as the type filter on BoardMemberEntity already makes sure that the fetch will only return BoardMemberEntity instances (or instances of subtypes of BoardMemberEntity, if any). However, if you remove it, the code won't compile as Employee doesn't have a field 'CompanyCarId', as that's a field only present in its subtype, BoardMember. The Cast is therefore solely for the compiler and can be skipped completely. That is, if you've solved the polymorphic entity fetch problem elsewhere of course .

Contains
LLBLGen Pro doesn't support UNION queries, so we can skip Concat. The main problem is that LLBLGen Pro doesn't work with Query objects, but with methods which perform work for you, based on the parameters passed in. Not having Union isn't a big problem, one can always work around this without much effort. The method which is way more interesting is Contains. For everyone writing a Linq provider: Contains will tie your hands to the keyboard for at least a week or so, so be sure you calculate this in.

At first it looks pretty easy, Queryable.Contains(operand) can't be that bad. However, in the expression tree, you'll see a call to 'Contains', and then have to check which 'Contains' it is: Queryable.Contains, IList.Contains, String.Contains, EntityCollection.Contains etc. Every one of them results in a different query or query fragment. As if that's not enough, 'operand' can be a query too, but also an entity already in scope, a constant, a field etc. Yes, Contains isn't something you'll implement on a Sunday afternoon.

Let me first show you two queries which illustrate another problem I ran into with Contains:

// Query 1
var q = from c in metaData.Customer
	where new List<string>() { "USA", "UK" }.Contains(c.Country)
	select c;
	
// Query 2
var q = from c in metaData.Customer
	where someInMemoryList.Where(v=>v.StartsWith("A")).Contains(c.Country)
	select c;

What's the problem? Well, to process the expression tree into a SQL query, we've to execute parts of it as if it was C# code! The first query initializes a new List instance with two strings. This statement will appear as a ListInit in your expression tree, together with some friends to make things complicated. The second query shows a combination of Linq to Objects with a Linq query which will be executed on the database. The Linq to Objects snippet, namely someInMemoryList.Where(v=>v.StartsWith("A")) is a query on its own, it's not meant to be ran on the database, as it operates on an in-memory IEnumerable object. My personal opinion is that this kind of Linq voodoo should be avoided, because it's complicated to understand which parts of a Linq query will be ran on the database and which parts won't. But the world isn't perfect so there will be people who have to or want to create these kind of queries and thus support for this has to be added.

Funky Tizers
As this can become rather complex to handle, how to cut out the expression elements which have to be ran locally and evaluate them locally as well? Well, it's actually quite simple: you have to traverse the tree and every expression which has parameters which aren't referring to another element is a candidate. After these expressions are identified, you compile them using Expression.Compile with a trick: you make ()=>expression lambdas out of them and compile these and run these by invoking them. The result of the invoke is the result of the expression. Microsoft has dubbed this Funcletization. In .NET 3.5 you can find an internal class which does the job for you more or less, however it's internal (Duh!, you thought MS would make things easy for you? ), and also it's best to check what it does and rewrite it yourself so it matches your needs: The last thing you need is that it finds false positives. You therefor have to tailor the process to the provider you're writing. Doing a Google search on 'Funcletize' will give you some info about it. It's a key component of any Linq provider targeting a database, as otherwise it will run the risk of crashing on simple in-memory evaluations inside the Linq query, or trying to convert Linq to Objects elements to SQL.

So after I had figured out how to convert all expression nodes which could be evaluated locally into expression objects which were processed internally outside the query, I could go back to Contains. Implementing Contains is rather complex because you have to deal with the situation where the source (the element Contains works on) can be a query but also the operand (the element which is checked to see if it's in the source). Contains is also a vital part of the Linq query system, so it's key that you implement all possible uses. I'll enlist a couple below so you can understand the complexity of Contains a bit .

// Query 1, simple entity check in entity list
var q = from c in metaData.Customer
    where c.Orders.Where(o=>o.EmployeeId==3).Contains(order)
    select c;
        
// Query 2, operand is entity which is result of query
var q = from c in metaData.Customer
    where c.Orders.Contains(
        (from o in metaData.Order where o.EmployeeId == 2 select o).First())
    select c;        
    
// Query 3, operand and source are both queries.
var q = from c in metaData.Customer
    where c.Orders.Where(o => o.EmployeeId == 2).Contains(
        (from o in metaData.Order where o.EmployeeId == 2 select o).First())
    select c;
    
// Query 4, constant compare with value from query. Yes this is different.
var q = from c in metaData.Customer
    where c.Orders.Where(o => o.EmployeeId > 3).Select(o => o.ShipVia).Contains(2)
    select c;
    
// Query 5, check if a constant tuple is in the result of a query
var q = from c in metaData.Customer
    where c.Orders.Select(oc => new { EID = oc.EmployeeId, CID = oc.CustomerId }).Contains(
        new { EID = (int?)1, CID = "CHOPS" })
    select c;

// Query 6, idem as 5 but now compare with a tuple created with a query
var q = from c in metaData.Customer
    where c.Orders.Select(oc => new { EID = oc.EmployeeId, CID = oc.CustomerId }).Contains(
        (from o in metaData.Order where o.CustomerId == "CHOPS" 
            select new { EID = o.EmployeeId, CID = o.CustomerId }).First())
    select c;

// Query 7, checking if the value of a field in an entity is in a list of constants
List<string> countries = new List<string>() { "USA", "UK" };
var q = from c in metaData.Customer
    where countries.Contains(c.Country)
    select c;
    
// Query 8, as 7 but now with an IEnumerable
LinkedList<string> countries = new LinkedList<string>(new string[] { "USA", "UK"});
var q = from c in metaData.Customer
    where countries.Contains(c.Country)
    select c;
    
// Query 9, combination of 2 queries where the first is merged with the second and
// only the second is executed. (this is one of the reasons why you have to write 
// your own Funcletizer code.
var q1 = (from c in metaData.Customer
    select c.Country).Distinct();
var q2 = from c in metaData.Customer
    where q1.Contains(c.Country)
    select c;
    
// Query 10, as 7 but now with an array obtained from another array.
string[][] countries = new string[1][] { new string[] { "USA", "UK" } };
var q = from c in metaData.Customer
    where countries[0].Contains(c.Country)
    select c;
    
// Query 11, the grand finale ;)
List<Pair<string, string>> countryCities = new List<Pair<string, string>>();
countryCities.Add(new Pair<string, string>("USA", "Portland"));
countryCities.Add(new Pair<string, string>("Brazil", "Sao Paulo"));

// now fetch all customers which have a tuple of country/city in the list of countryCities.
var q = from c in metaData.Customer
    where countryCities.Contains((from c2 in metaData.Customer
          where c2.CustomerId == c.CustomerId
          select new Pair<string, string>() 
            { Value1 = c2.Country, Value2 = c2.City }).First())
        select c;

And then I skipped a couple as well, like string-based Contains calls. As String.Contains() required handling a Like predicate, I also implemented String.StartsWith and String.EndsWith in one go. The rest of the string methods will be covered by the DB Function mapping system we'll introduce, and which will allow developers to create their own extension methods and map them onto a DB function. String.Contains/String.StartsWith and String.EndsWith can for now only handle constant strings, similar to Linq to Sql, but perhaps if customers require it, support for operands which are queries will be added as well.

Now that this is done, we can look at the next method to implement, which will be ElementAt(n). Interestingly, Linq to Sql doesn't support ElementAt(n), while it just comes down to Skip(n-1).Take(1), or to use our own code, TakePage(n, 1). It's interesting because Microsoft did go to great lengths to make Linq to Sql be able to handle whatever query you throw at it, however that they don't support ElementAt(n) suggests that it actually is a problematic method, but it's at this point unclear to me why. But there's only one way to find out .

Roger, understood, however I think it has no real value in DB terms. the example in MSDN with Aggregate used on a set of strings is OK, but also not convertable to DB statements, like how to write Sum, or Max in Aggregate statements... But perhaps I miss something :)

FransBouma - Tuesday, January 15, 2008 7:51:43 PM

The LINQ Aggregate query operator is a catch-all operator that you can use to define your own aggregation algorithm. It has no translation into SQL. This operator is different than the VB Aggregate keyword. The VB keyword denotes a clause where you define your aggregate expressions, which are all predefined aggregates. VB translates this into a GroupBy method call.

Matt Warren - Friday, January 18, 2008 6:21:49 AM

Nice. I've been working on a Linq To Sql translator of my own but having read your article I see room for optimisation of how calls to ANY, ALL and DISTINCT are handled.

Jahmani Mwaura - Friday, January 25, 2008 11:31:37 AM

3 Comments