September 2007 - Posts
(This is part of an on-going series of articles, started here)
I hear you thinking... "In the Netherlands, days apparently have 168 hours"
. Well... no (really?). Today is officially my 4th day I work on Linq support. The past couple of days since the previous post I've been working on Derived Table support in LLBLGen Pro, so Linq queries are creatable without excessive re-formulating of the expression tree (for what this is all about, see the post on day 2)
Did that take 7 days? Yes, more or less. Of course, writing the final design and writing the final code didn't take a whole week, perhaps a couple of hours. The main thing is that adding such a feature which potentially affects a big deal of the inner workings of an O/R mapper can be challenging. As I discussed in the post on Day 3, LLBLGen Pro's query API has some nice things like it can figure a lot of things out for you, so you don't have to specify them so you avoid potential bugs. The downside is that it can sometimes be a challenge for the inner code if the query API has to figure out something which isn't specified. Adding Derived Table support brought one problem to the table: the select list fields have to be able to use a source alias to be able to make a projection in SQL from the Derived Table data. LLBLGen Pro's API didn't allow you to specify this alias as it was unnecessary to do so before, till now.
Changing an O/R mapper's core isn't about firing up VS.NET and hammering out unit tests and altering code here and there. It's about defining a couple of things before even looking at the code:
- What the feature is all about. This means: which changes are to be expected in the public API. This can be tested up-front with mocks and unittests. This is the easy part and is often over before you know it.
- Which parts of the design are involved in the API elements which will be changed? This requires the developer to look at the design of the whole O/R mapper pipeline to see where things happen, why they happen there and which elements will be affected if the API change will be applied to the codebase
- Which steps to take to migrate the codebase to the version with the new feature/change. This step involves careful planning what to change and why. This step takes the main chunk of the time. The main issue is that an API change in an O/R mapper is often affecting multiple elements in the system, and if you're unlucky elements on all levels of the pipeline. To be successful in this, you have to know up front what you're going to change. The reason is simple: if you don't know what you're going to change, the chances are you'll change a lot more than you initially might have expected, and in the code editor it's not always possible to overview the scope of a code change you've made right there in the editor, as the change might affect a lot of call chains.
The Agile troopers might call me a lame duck now, because they'll say they have the unit-tests to proof they made the right changes. However this is pretty meaningless in a system with long call chains like in an O/R mapper, which has several sub-systems, all involved in the complete process. Because of these call chains, the number of different call chains you can have is practically unlimited. This means that there are never enough unit-tests to test everything: even with 100% code coverage, you can never be sure the system works, simply because the number of scopes the system can be used in are unlimited (think of all the possible different queries one can write in SQL for example). You need different methods to be sure your code works: proof that your code is correct. Sure, unit-tests are necessary, and we have a truckload of them. They might spot a change you shouldn't have made. But they also might not: a unittest only tests what is in the unit-test. Testing a routine is OK, however testing a whole call chain of routines to be correct, requires for every use-case they might be used in a different test. Can you imagine how many tests that will be? That's right, an unlimited number of tests.
It's key that the changes to apply are known when you will apply them. This makes the whole migration process deterministic, and that's preferable, because you then know what the end result will be. So, we thought we had a solution to our problem, we defined the steps to take, which code to alter, and altered the code. Tests were run. They succeeded. All is well, you might think? Well.... not quite. As discussed previously, our inner code needed to find the alias to use for a field in the select list as that alias wasn't specified. However, there are queries thinkable where this is impossible because there is ambiguity: which column to pick from the total set of columns in the FROM clause of the SELECT statement if the column appears twice? So, after 5 days or so of work, which looked like they delivered the right code, we had to roll back our changes and start over. Our unit-tests proved our code was OK, but as I described above: there are so many many different queries thinkable which all will affect their own order of events inside the O/R mapper, it's undoable to write for every one of them a couple of unit-tests. So you write the tests for the more common ones and proof that the code is correct for the rest. Trying to proof the code was correct made me ran into the problem of having multiple times the same column in the FROM clause.
Is proving that the code is correct hard? No. It's sometimes tedious work, but it's not hard. It comes down to two things: a) reading the code written and checking if that code is indeed correctly implemented and b) proving that the algorithms used are the right ones for the job.
This last part might be hard, but it's not that hard. The trick is to think in 'What will make this algorithm do the wrong thing'. When you think about O/R mapper query pipelines, it's obviously in the line of 'what will make this query pipeline deliver the wrong results'. Often developers are solely focussed on 'what will make this algorithm/code do the right thing', but that's often much harder to do, as 'the right thing' might be a solution space with an unlimited number of elements. So I looked for queries which could break what we had designed. It took a while but we found at least one (in this case it's not that obvious as it might look like
). As soon as you find a way to break your design, go back to the design and reason why the design breaks. One of the biggest mistakes you can make is to 'patch' the design for the situation you found that breaks it, because what if there's another way to break it as well? If you reason why the design breaks, you will often realize what the real flaw is and will come up with the right solution. It's like scientific proof: scientific proof isn't focussed on proving that something is correct, it's focussed on proving that something can't be wrong.
It turned out what we wanted wasn't possible: the query we found was proof that our solution to the alias problem was broken. So we had to alter our design so that the developer has to specify the alias for the select. Let's go back to our migration plan step. Because this migration plan was thought out, we knew what we had to roll back and what we had to alter instead to make it reflect our design. The fun part is: this design is provable to be correct in all situations. All that's left after that, is manual labor: typing code in an editor: making the code changes and writing the tests to see if the code changes were made correctly. Mind you: the tests aren't meant to proof the design works as expected: as said before, that's impossible in code because there are an unlimited number of different call chains. The tests are meant to proof that the code works OK, i.e. if we typed it in correctly. Confusing? Perhaps I wrote it down in a not that clear way, but what I wanted to explain was that the code isn't the design, the code is the result of the design. If the design has flaws, you can test the code to be correct, and it will be correct with respect to the design, but will still be wrong with respect to reality.
Let me show the final code how Derived Tables will work in LLBLGen Pro v2.6 (the release with Linq support).
// C#
// first specify the elements in the derived table select (which is a dyn. list)
// these are: fields list, filter, relations, group by, sort expression, limiter and
// the alias for the table. No paging.
ResultsetFields dtFields = new ResultsetFields(2);
dtFields.DefineField(OrderDetailFields.OrderId, 0);
dtFields.DefineField(new EntityField2("Total",
(OrderDetailFields.Quantity * OrderDetailFields.UnitPrice), AggregateFunction.Sum), 1);
GroupByCollection dtGroupBy = new GroupByCollection(dtFields[0]);
DerivedTableDefinition dtDefinition = new DerivedTableDefinition(dtFields,
"OrderDetailTotals", null, dtGroupBy);
// then specify the relation between entity and derived table, in code.
DynamicRelation relation = new DynamicRelation(dtDefinition, JoinHint.Inner,
EntityType.OrderEntity, "O",
(new EntityField2(OrderDetailFieldIndex.OrderId.ToString(), "OrderDetailTotals") ==
OrderFields.OrderId.SetObjectAlias("O")));
// then specify the rest of the query elements
RelationPredicateBucket filter = new RelationPredicateBucket();
filter.Relations.Add(relation);
// alias for the fields which will be used for the select list to fetch orders.
filter.SelectListAlias = "O";
filter.PredicateExpression.Add(new EntityField2("Total", "OrderDetailTotals") > 5000);
// then fetch the data
EntityCollection<OrderEntity> orders = new EntityCollection<OrderEntity>();
using(DataAccessAdapter adapter = new DataAccessAdapter())
{
adapter.FetchEntityCollection(orders, filter);
}
Assert.AreEqual(38, orders.Count);
Now, this might look a little cumbersome, with the relation defined in code. However, it's a way to define a relation between unrelated elements in a typed way, compile time checked (except for the derived table alias of course). The key was the filter.SelectListAlias. I had to make just 1 tiny breaking change to make this all work, a change hardly anyone will run into. Defining the relation in code isn't normally necessary: LLBLGen Pro generates the relations for you, so it's just picking a property on a static class, however with derived tables joined to entities, these aren't generated up front of course so the relation has to be written out in code. The upside is that as it's an object structure, everything is very flexible and there aren't any limits on how the derived table looks like, how many relations are in the join list etc.
Linq providers and the Visitor Pattern
So, now these Derived Table fellows are on board, we can again start focusing on Linq code! After setting up the projects properly, I was ready to dig deeper into Linq territory. It turned out that, no surprise there, the key object to focus on is the IQueryProvider implementing class. Some sites already reported that the pattern to use is the Visitor pattern. The visitor pattern is a pattern which is normally waved away with the words "I understand all the GoF patterns and their usage, except this visitor pattern, I never saw a use case scenario for it". The main reason for this is of course that the scenario to use it in is uncommon for most of the developers out there: traversing trees with nodes of various types where both the node type and the visitor (the node traverser of the tree) are under control of the developer. Linq's Expression trees however seem like the perfect candidate for the Visitor pattern.
Looking at the pattern as defined in the GoF book, you'll quickly see that what should be used with Expression trees isn't really the visitor pattern: the visitor isn't passed to the node: the dispatching of which handler to call based on the type of the node is outside the tree, not inside the node. This seems odd at first and one could quickly dive into the seemingly infinite "Microsoft again didn't understand a pattern and cooked up its own hocuspocus-pattern instead" way of thinking. However that's not really fair: Microsoft couldn't do any other thing here: the thing is: if they add a new node type, the main visitor class also has to be adjusted, but that class isn't Microsoft's, its the class of the provider writer. So to avoid this problem, one could argue that Microsoft should have provided a generic Visitor class so it would be their class. They decided not to do that apparently so the only other way is to force the provider writer to have a dispatch routine outside the nodes.
The outside dispatcher has to check which Expression type it deals with and call the right handler. Now, MS has created several Expression classes so you could use polymorphism here. However, there are more expression tree node types than there are expression class types: the BinaryExpression for example handles a lot of node types: all of them are BinaryExpression instances. So what should be done instead is having a big switch statement on the node type and call the right handler routine for the expression class. Big switch statements are often a red flag, a code smell.
The reason for this is: the expression node type is an enum. This means that if you want to invent your own Expression classes, you can't extend the mechanism used here, as you can't inherit from an enum to extend it. I don't quite understand why they went the enum route instead of creating subclasses of the main expression classes for the expression node types, as that would have made the system more extensible. However how many people will write their own Expression classes? Not that many I think (only provider writers). So for this, one could argue it doesn't really matter much.
So I'm currently looking at a nice switch statement which calls handler routines, all that's left is filling in those handler routines. The fun can finally begin!
(This is part of an on-going series of articles, started here)
In the previous post in this series, I discussed the problem of a select with aggregates using derived tables vs. a query which resulted in the same resultset but used a group by clause and my problem with formulating that simple group by query using Linq. A clever reader, Justin Etheredge, solved that mistery with the following Linq query:
// C#
var q = from c in nw.Customers
join o in nw.Orders on c.CustomerId equals o.Customerid
join od in nw.Order_Details on o.OrderId equals od.OrderId
group od by c.CustomerId into groups
select new { TotalAmount = groups.Sum(od => od.Quantity * od.UnitPrice) };
The problem I had was that the set which was the result of the two joins doesn't have a name, so how do you specify that you want to group that set? It turns out, you don't. You simply tell Linq to group one of the entities in the set, and you can specify any field in the total set of the join to group on. I have to say: that's not really intuitive, and I can imagine that grouping on multiple fields from multiple entities will cause another headache, but let's not get overboard here
We decided it was best for LLBLGen Pro that our API would get support for derived tables, so our Linq provider will have it a bit easier to convert the Linq expression tree to our query elements. Furthermore, it fills a little gap we still had with respect to SQL coverage with our API. SQL coverage is the amount of SQL constructs you can define in the O/R mapper specific API. If the O/R mapper has a low coverage, the chance is high you have to revert to stored procedures or hard-coded SQL strings inside your application to perform a given query. As soon as that happens, you have to maintain two different paradigms to use the database in your application which can lead to problems in the long run. Derived tables are necessary for a small group of queries, like consumption of aggregated sets inside the main query, not your every day SQL stuff, but for some reports or other edge-case queries, it can be helpful if derived table support is there, so by adding support for it, we're squashing two flies with one stone. With derived tables we're pretty close to total SQL coverage: the only things we don't support after that are table / join hints (other than NOLOCK), UNION and the OLAP statements.
Software engineering all the way
Simply adding support for this isn't tic-tac-toe: our API is well established and has been used in many thousands of software projects all around the world in the last 4 years. Simply changing it by refactoring the public API isn't going to give you a lot of happy users because they'll have to refactor their code because of that as well before they can upgrade to this new version. So we're talking about major maintenance here which is the heart of the Software Engineering field. I'd like to refer to the various articles written about this topic by my good friend Jeroen van den Bos at his blog at http://jvdb.org/blog for the details about maintainability of software and how it affects our every day software engineering work.
There are two different scenario's in our case: a) we're fetching entities and the source for the entities is a derived table or a joined set with one or more derived tables and b) we're fetching a list of fields from one or more entities (a list). The scenario b) isn't the problem, everything has to be specified in that scenario anyway. However for scenario a) there's a difficulty. Let's illustrate this with an example, using our Adapter paradigm (we also support another paradigm called SelfServicing which has a different API). Say I want to fetch all Customer entities which have an order which was filed by employee with ID 2.
// C#
EntityCollection<CustomerEntity> customers = new EntityCollection<CustomerEntity>();
RelationPredicateBucket filter = new RelationPredicateBucket();
filter.Relations.Add(CustomerEntity.Relations.OrderEntityUsingCustomerId);
filter.PredicateExpression.Add(OrderFields.EmployeeId==2);
using(DataAccessAdapter adapter=new DataAccessAdapter())
{
adapter.FetchEntityCollection(customers, filter);
}
The routine which fetches the customers is 'FetchEntityCollection', from the database belonging to the adapter object the method is called on. What it has to fetch and from which table(s) / view(s) is specified by the target of the fetch, the collection. The filter defines the set boundaries. The reason this is done this way is that it leads to less errors: you can't specify fields from a table you're not targeting, you can't forget an alias etc. etc. The FetchEntityCollection has a couple of overloads which accepts more objects like excluded fields, a sort expression, prefetch path etc. The API is simply build like you would do it by hand: you have an object which is the adapter to your database to target and you call the method, an action, to do what you want it to do for you: fetch things and you specify the parameters like with any other method: as parameters of the method.
It has a downside however: you can't specify an alias for the entities to fetch. This isn't designed in, because it's not needed. That is, up till now: with a derived table, the fields in the SELECT clause have to target a derived table instead of a table / view in the database, if the source of the data is a derived table and not the table(s) / view(s) the entity is mapped on. So this alias has to be determined automatically, or you have to specify the alias somewhere. The latter would break the code or would require a group of new overloads of FetchEntityCollection. Not that great. However, automatic alias determination isn't a picknick either. Sure, with one derived table, it's OK, however what if the join is deep and there are many derived tables? Fortunately, we found a way to do so, with a pre-process step which is completely transparent to the outside code. This way the existing code and method signatures can be used without problems. So far. We still have to solve the problem with how to specify a derived table.
LLBLGen Pro doesn't use a string-based query system, but a set of different objects which together define the information used by the fetch logic. In the simple example above, you see it in action abit: it adds the relation object between customer.CustomerID and order.CustomerID to the relations collection in the filter bucket and it defines a predicate object and together with the relation, it passes that to the fetch method. The fun thing is that you can write your own predicate classes and combine them in the filter if you like. At runtime they're then evaluated and the object gets a call to produce its SQL fragment using DB specific producers for aliases, fields, functions etc. it gets injected at runtime. This system is flexible and also compile-time checked. However, where to fit in a derived table definition? Because the relation objects have to be compile time checked as well, they're generated into the code as properties, as illustrated in the example above. Though the derived table is specified in code by the developer, and has to be usable inside join expressions with existing generated relations. Oh, and changing any of this must not break any existing code in production so the changed API is backwards compatible with what's already there and can offer the flexibility of adding a derived table or tables.
So refactoring the public API is out of the question. We can add new code though, which behind the scenes could lead to the same core: the collection of relation objects will be responsible of producing the FROM segment of the select, so if we can make that utilize the derived tables and also manage to make it join derived tables with tables / views, we're set!
How to approach this? Well, use a page from the Agile books: write the code first how you want to write it if the API was developed by someone else. The thing we should look at is: how can we define a relation object in code (like the generated code does it too, btw) so we can add it to the relation collection passed to the fetch method and by doing that also automatically make sure a derived table is joinable to tables/views which are the target of entities? A normal LLBLGen Pro EntityRelation object is focussed on fields: it defines the relation between attributes from one side and from the other side, one is the PK side and one is the FK side, exactly how you would define an FK constraint in the database. However, a derived table isn't a field, so we need a different class. Sure we can refactor the interface of the EntityRelation class, but that too will break code or make it look like an object which is a combination of several things (and that's also considered a Bad Thingtm)
Let's first look at a mockup of the code we think is reasonable for this:
// C#
// first specify the elements in the derived table select (which is a dyn. list)
// these are: fields list, filter, relations, group by, sort expression, limiter and
// the alias for the table. No paging.
ResultsetFields dtFields = new ResultsetFields(2);
dtFields.DefineField(OrderDetailFields.OrderId, 0);
dtFields.DefineField(new EntityField2("Total",
(OrderDetailFields.Quantity * OrderDetailFields.UnitPrice),
AggregateFunction.Sum), 1);
GroupByCollection dtGroupBy = new GroupByCollection(dtFields[0]);
DerivedTableDefinition dtDefinition =
new DerivedTableDefinition(dtFields, "OrderDetailTotals", null, dtGroupBy);
// then specify the relation.
DynamicRelation relation =
new DynamicRelation(dtDefinition, JoinHint.Inner, EntityType.OrderEntity, string.Empty,
(new EntityField2("Total", null).SetObjectAlias("OrderDetailTotals") == OrderFields.OrderId));
// then specify the rest of the query elements
RelationPredicateBucket filter = new RelationPredicateBucket();
filter.Relations.Add(relation);
filter.PredicateExpression.Add(
new EntityField2("Total", null).SetObjectAlias("OrderDetailTotals") > 5000);
// then fetch the data
EntityCollection<OrderEntity> orders = new EntityCollection<OrderEntity>();
using(DataAccessAdapter adapter = new DataAccessAdapter())
{
adapter.FetchEntityCollection(orders, filter);
}
Assert.AreEqual(38, orders.Count);
We need to define a class which holds the elements for the derived table: the DerivedTableDefinition. We then simply define a DynamicRelation. This is a different type of relation object, namely a relation between two sides, left and right, which can be different things than fields from entities and which should be joined together based on the expression specified. This can be: derived table join entity targets (views, tables), derived table join derived table or entity targets join entity targets (a right join is equal to the swap of the left join). If we refactor the inner workings of the relation collection a bit so it can handle DynamicRelation objects as well in combination of EntityRelation objects, we're there: the outside API would be untouched and only enriched with newer possibilities, and we will achieve what we needed: full, flexible support of derived tables without compromises.
Sure, the question now arises: what's the need for EntityRelation again? But that's not something you can get rid of: there's code out there, many many lines of code, which relies on that class. You can't simply remove it because you thought of something much more appropriate and which can replace what you already had. That's the main difference between refactoring APIs during the development of the first version and refactoring APIs after the development of the first version, AKA production code. So remember: if some Agile / XP guru tells you that refactoring is great and you should do it all the time, don't forget that after v1 is out, you are bound to keep the interfaces and classes around, because stuff depends on it from then on so you have to be very very careful. The whole aspect of 'adaptive to change' (which is the main thing about Agile) is actually meant to be applied to the application as a whole: migrate it to the next version, within the boundaries defined by the reality you're in. So focus on that main aspect, and not on blind refactoring because you can: after v1, you are in a total different world with different rules and different laws.
Today I came as far as that the inner workings of the Relation collection are left for the API functionality migration. As all Dynamic Query Engines (the engines per database which produce SQL specific for the database they represent) already have a facility to create a subquery from query fragments (so alias scoping etc. is abstracted away inside the engines), connecting everything together is pretty straight forward. It's still a bit of code to write and test, but the tough part is over: once we're inside the library, behind the public API, you can refactor and add and change whatever you want, as long as the functionality of the public API doesn't change.
Tomorrow I hope to have this running so I can go back to my Linq code.
.
(This is part of an on-going series of articles, started here)
Adding Linq support to an O/R mapper like LLBLGen Pro is a matter of choice: either you implement new SQL engines or you convert the expression trees to native query language components. The former is a lot more work and the latter will probably cause problems here and there. We decided to create a converter and I think that's the only real option for any O/R mapper developer out there who wants to support Linq queries. It's not without problems though. During my second day of working on the project I already hit a major roadblock. I'll describe that in a bit.
In my previous posts I've explained that the source of the Linq query is the most important part as there it all starts and ends. To be able to specify the source in the context of LLBLGen Pro, I generate a class which simply returns for each known entity a DataSource<T> instance, where T is the type of the entity, e.g. CustomerEntity. The Linq oriented classes like the QueryProvider etc. are placed in a separate assembly, and will be merged into our runtime library at release (the .NET 3.5 build that is), if it's not possible to keep things separated, otherwise we'll release a separate dll. With the roadblock I hit this morning, things look a bit complicated so I think having a separate assembly isn't going to cut it alone, runtime library changes will be needed.
So which problem did I ran into? It's actually based on a little Linq to Sql query posted in the C# newsgroup this morning. The query looked something like this:
// C#
var q = from c in nw.Customers
select new { ContactName = c.ContactName, TotalOrders =
c.Orders.Sum(o => o.Order_Details.Sum(
od => (double?)od.Quantity * (double?)od.UnitPrice)) };
This query retrieves for every northwind customer its overall total for all orders ordered. I looked at it for a while and wondered: is that 'Sum' executed in memory or in the database? So I added it to a little Linq to Sql project and it resulted in this query:
-- T-SQL
SELECT [t0].[ContactName], (
SELECT SUM([t4].[value])
FROM (
SELECT (
SELECT SUM([t3].[value])
FROM (
SELECT (CONVERT(Float,[t2].[Quantity])) *
(CONVERT(Float,[t2].[UnitPrice])) AS [value], [t2].[OrderID]
FROM [dbo].[Order Details] AS [t2]
) AS [t3]
WHERE [t3].[OrderID] = [t1].[OrderID]
) AS [value], [t1].[CustomerID]
FROM [dbo].[Orders] AS [t1]
) AS [t4]
WHERE [t4].[CustomerID] = [t0].[CustomerID]
) AS [value]
FROM [dbo].[Customers] AS [t0]
Yes, I was also surprised to see that at first, but of course, the nested Sum() aggregates leave little room to enhance this. However, if I would have to write this in plain SQL, I'd do it like this:
-- T-SQL
SELECT c.contactname, SUM(quantity * unitprice) AS TotalOrders
FROM customers c INNER JOIN orders o
ON c.customerid = o.customerid
INNER JOIN [order details] od
ON o.orderid = od.orderid
GROUP BY c.contactname
A complete different approach. The reason I picked this one and not the nested selects is because it's very efficient (check the execution plans if you want). So I wondered, how can I rewrite my Linq to Sql query to get the group by query over the joined set? But... I don't know!
. I tried everything I could think of but I couldn't get a compilable query which gave me the right results!
The thing is that LLBLGen Pro doesn't support derived tables in query specifications (the FROM ( SELECT ... ) constructs), because it determines the FROM clause of a select from the elements passed to the fetch method, as this is easier for the developer using the API and it leads to less mistakes. For example the query with the group by posted earlier can be written in LLBLGen Pro (using Adapter) as:
// C#
ResultsetFields fields = new ResultsetFields(2);
fields.DefineField(CustomerFields.ContactName, 0);
fields.DefineField(new EntityField2("Total",
(OrderDetailsFields.Quantity * OrderDetailsFields.UnitPrice), AggregateFunction.Sum), 1);
RelationPredicateBucket filter = new RelationPredicateBucket();
filter.Relations.Add(CustomerEntity.Relations.OrderEntityUsingCustomerId);
filter.Relations.Add(OrderEntity.Relations.OrderDetailsEntityUsingOrderId);
GroupByCollection groupBy = new GroupByCollection();
groupBy.Add(fields[0]);
DataTable table = new DataTable();
using(DataAccessAdapter adapter = new DataAccessAdapter())
{
adapter.FetchTypedList(fields, table, filter, 0, null, true, groupBy);
}
As it fetches the set into a datatable, it's of course not really 'typed', but for the rest it gets the data out as expected with the query as expected (I can also fetch it as a datareader and project it onto a class if I want to, the query given is just for illustrational purposes, so I used a datatable). The result of this is that I have a Linq query I can't convert to LLBLGen Pro constructs and I have a query which is specifyable in LLBLGen Pro constructs but which I can't formulate in Linq. And it's not even monday!
Clash of the Paradigms
So I wondered... why is it so hard to write the SQL query in Linq (or why am I so stupid not to understand the C# 3.0 spec
)? And then it hit me: Linq isn't set oriented, but SQL is. At least, that's my conclusion. The group by approach is logical from a math / set oriented viewpoint, but it's an odd approach if you look at it from an imperative / functional viewpoint. In an imperative / functional executed piece of code, you want to specify what has to be done and at the end of your set of statements you arrive at your result.
I have no idea if my Group By SQL query is even possible with Linq, however possible or not, the conclusion can only be: developers using the Linq extensions to C# and VB.NET will think in an imperative way, they don't bother with the fact that they're now suppose to switch their minds into Mode.SetOriented (poke &H6EF8, &H0E to be exact), they just want to write the query as if they were writing it in C# as they actually are doing that. For the people who think Linq will therefore have a small learning curve: I'm afraid I have to disappoint you: the same problems arise as with every other O/R mapper query language: 'how do I formulate this [insert complex SQL written by a 70 year old Oracle DBA here] SQL statement in [insert O/R mapper query language constructs here]?'. That won't go away, simply because there's no 1:1 mapping between Linq and SQL.
So the problem then arises for me: what to do? It's not as if the query presented is one only some bloke in South East Alaska would run once a year. However it's also not the end of the world: the developer won't be able to avoid all O/R mapper constructs anyway: there will be O/R mapper specific language elements in the final Linq query or around it and all CUD (Create, Update, Delete) operations aren't even mentioned. So if a given required set of data isn't specifyable in Linq, it's likely the native O/R mapper query language constructs will offer a way to obtain the data in that form using that particular query. So the option to not support given Linq constructs isn't that big of a deal as it seems.
However, the one true reason we're doing this Linq implementation in the first place is for marketing and strategic reasons: to be able provide an upgrade path for users of Linq to Sql towards our framework and to be able to offer people who have enjoyed Linq courses or have experience with Linq a way to leverage that experience on our framework. It doesn't add any functionality, the framework already supports almost all the ways you want to fetch any data in what kind of wacky format you can think of. So the only really satisfying solution is to solve the problem of the derived tables and be able to support as much Linq queries as possible.
My initial research shows that of our supported databases only Firebird 1.5 doesn't support derived tables (even MySql does, who would have thought!). I can live with that, as there's a v2.0 for firebird which does support derived tables. The only thing I have to solve is: how do I add this to our API elegantly and also without breaking a lot of code already out there (and of course also in such a way that the amount of code to change is minimal)? This is the root aspect of maintainable software, though it's one we call can't avoid sooner or later. So my next stop is first to alter our runtime library API in such a way that the derived table specification is possible so I can formulate the Linq constructs in our own query language elements without much conversion. Because, it's of course (in theory) also possible to transform the group of nested selects into a join set with group by, after all they lead to the same result, and the nested selects actually represent a join + group by. However my head already starts to hurt when I think about that so I leave that to the set-theory junkies over at MIT.
(This is part of an on-going series of articles, started here)
I didn't have that much time today to work on our Linq to LLBLGen Pro layer, but nevertheless there are a couple of interesting things to mention.
.
It's all about the Source, Luke
Let's look at a skeleton of a very simple Linq query:
// C#
var q = from c in source
select c;
There are a couple of fuzzy things to see in the query above: a) what the **** is 'q' and b) why is 'source' italic printed?
Let's first address b). Linq is actually a Domain Specific Language (DSL), embedded inside C# or VB.NET. This means that it's not really part of the language, it's a different language. Compiling the complete code in C# and VB.NET will result in MSIL, so what to do with the Linq code inside the C# or VB.NET code? There are two options:
- If the source implements IQueryable<T>, convert the query to an expression tree and pass it to the implementing source for processing and execution
- If the source implements solely IEnumerable<T>, convert the query to a set of calls to extension methods on source which will be ran in-memory using normal generated MSIL.
The expression tree created when the source implements IQueryable<T> is similar to a parse tree, and contains all information from the outer-scope (the C# / VB.NET code) including the query / Linq code itself. This gives an interesting problem: what to use for source? If you look at Linq to Sql, it uses Table<T> classes. However for collections inside entities (e.g. Customer.Orders) it uses EntitySet<T>. The reason is simple: if you have a set of entities loaded into memory and you want to perform a Linq to Objects query on them in-memory, you don't want the C# / VB.NET compilers to create an expression tree which is sent to the IQueryable<T> implementation in the source of the query. To avoid that, the source has to implement IQueryable<T> but also shouldn't be used as a collection / container which could also be the source of an in-memory query.
I initially thought that our EntityCollection<T> class would be a good source class for the IQueryable<T> implementation but I then would run into a problem as described above: an in-memory collection would never be usable as the source in a Linq to Objects query. So I will need a separate class for solely this purpose. I dubbed it DataSource<T>. This class will implement IEnumerable<T> and IQueryable<T>. It's unclear at this point if it has to implement IQueryProvider as well. Linq to Sql's Table<T> class does, but it seems unnecessary and also a combination of concerns you don't want (the combination of IEnumerable<T> and IQueryable<T> is already too much IMHO, but I'm not designing the Linq API
)
So, the next question then is: how is the expression tree, created at runtime by code emitted by the C# / VB.NET compilers, converted to an object which is able to produce a result? To answer that, let's go back to our example query above. The statement has to result in an object which can be placed in the 'q' variable. We'll get to q in a second. The source (it's him again) has a property called Provider. This provider implements IQueryProvider, and that interface has a method called CreateQuery() which accepts ... an Expression object which is our expression tree.
The fuzzyness isn't over though. CreateQuery() results in an IQueryable<T> implementing object. If this sounds rather recursive and a bit odd, it is: an IQueryable<T> (the source) is asked for a provider to create ... an IQueryable<T>, but not the same kind of IQueryable<T> as the source, as the source is just a placeholder, a stand-in to be able to take part in the expression tree.
Still with me? Good.
So, now we're arriving at that mysterious 'q'. In the James Bond movies, 'Q' was already a somewhat mysterious fellow, and now in Linq queries it seems his nephew, little 'q', stepped in to spoil the party! In the query above, we used 'var' as the type specification, because we don't know at the time of writing what kind of type q's value has. This isn't really true for the above query, but for queries which create new anonymous types, the 'T' in the IQueryable<T> implementing return value of IQueryProvider.CreateQuery() is anonymous, and therefore unknown.
The class which implements the IQueryable<T> and which instance is returned from IQueryProvider.CreateQuery() is typically native to the O/R mapper which is handling the execution of the query. The main reason is that the instance placed in 'q' is the port to the O/R mapper and the actual execution of the query in expression tree form. In Linq to Sql this is DataQuery<T>, and in our code it will be LLBLGenProQuery<T>. The class automatically implements IEnumerable<T> through IQueryable<T>, and when the enumerator is requested, the expression tree has to be 'executed' and the result has to be returned. This way, deferred execution of a query is accomplished.
It is also key to store any information to make the execution of the query possible inside the object placed in 'q'.
This can be done for example under the hood by the source which has access to the provider and can feed the provider with objects it has received when the source itself was created.
The source has to be an IEnumerable<T> implementing type (as it seems), where the T is the type you're actually interested in (i.e. the Entity type). Trying to feed the query a normal class which implemented IQueryable<T> didn't result in a proper expression tree in my tests.
So enough information for some serious programming!
.
Now v2.5 of LLBLGen Pro is out the door and the release-stress has gone away, it's time to pick up the next project, which is Linq support for LLBLGen Pro, which will be rolled into v2.6 of LLBLGen Pro, which is scheduled for Q4 2007.
This time around, we thought it would be fun to blog a post on every day I've worked on Linq support with the achievements of that day. I've no idea how long this will take, but my intuition says it won't take months. There are a couple of problems to overcome though, some I'll be addressing briefly below and others I'll discuss in due time.
As LLBLGen Pro has a lot of features and already a true OO query system, it's logical to create an expression tree converter instead of a new SQL emitting system. The expression tree converter should simply create LLBLGen Pro query objects, like predicate expressions, relation collections, excluded fields lists, projections etc. etc. and should pass that to the appropriate fetch method when the Linq query is executed. The previous sentence already shows that 'Linq' is actually all about fetching, i.e. the 'R' in CRUD. The other 3 types of queries executed on a database aren't specified in Linq constructions, these are specified using the native O/R mapper elements.
This is also one of the drawbacks of the whole Linq system: you can't really rely on it to have a true generic O/R mapper querying language: you'll always have to fall back on the O/R mapper specific elements to use it to its maximum potential or worse: to even be able to perform a given action on the data in the database. There are other drawbacks to the current state of Linq: the syntaxis lacks some constructs which makes it again impossible to create a true generic O/R mapper query language. One of the main painpoints I'm refering to is the lack of prefetch path specifications (eager loading) inside the query text: there's always a single select command in any given Linq query (it always results in a single set), there's no room for any graph oriented fetching whatsoever, unless an extension method is created to specify prefetch paths/spans, which is precisely the thing you don't want, because that ties the query specified to the O/R mapper used, and the big question then becomes: why bother with Linq if you can also use the native O/R mapper syntaxis for querying entities?
Let's look at an example: say we want to fetch two graphs: A) all customers from Germany and their orders and their order details, and B) all products and their orders (m:n). These graphs are at the moment separated but you can of course see that we could add product entities from graph B to graph A if we for example add a new Order detail entity to a given order. If we look at what Linq provides us, we have no possibility to specify that we want to fetch these two graphs with the query specified, we have to specify the graph layout outside the query. Linq to Sql for example does this via LoadOptions, which are specified on the context used. And that's precisely where it goes wrong. Because the problem is the deferred execution of Linq queries: if you specify the query for graph A with the LoadOptions for the graph, you can't also specify the LoadOptions for the graph B (let's pretend Linq to Sql can load m:n relations) as well, even though the graph might contain the same entities, as it might be the graph B is fetched using a filter which is relevant for B but not for A. The issue is that the LoadOptions are used when the query is executed, which can be somewhere else.
This gives headaches which are unnecessary: which entities to load eagerly, via paths, spans or whatever you want to call them, together with the entities you're fetching with the query is an element which should be part of that query, as it has only value in the context of that query, not another query. So any O/R mapper which wants to allow its users to specify entity graphs for fetching using Linq has to come up with some sort of extension method. That's of course doable, it's just that it mitigates the whole purpose of Linq: it's then 'just another querying method' to achieve the same thing as with the native O/R mapper constructs. The more a user has to fall back onto the native O/R mapper query elements, the more it will become clear that Linq is actually sugar you can live without, when it comes to fetching entities from the database.
As Linq is meant to be used on entities in a database, objects in memory, XML and whatever else you can cook up on a rainy Sunday afternoon, it embeds the limitations to be useful on all these targets. One of the things which will cause a lot of trouble is the lack of the 'left/right' join keywords: you have to go the route of join into and then select again. Say you want to fetch all employees from Northwind who don't have filed an Order yet (let's say there are employees in Northwind who haven't filed an order yet, ok?
). That's typically a left join query where you test on NULL:
-- TSQL, using '*' for simplicity
SELECT E.* FROM Employees E LEFT JOIN Orders O
ON E.EmployeeID = O.EmployeeID
WHERE O.OrderID IS NULL
So, in Linq, how would you specify this? Intuition tells us to try this first:
// C#
var q = from e in nw.Employee
left join o in nw.Order on e.EmployeeID equals o.EmployeeID
where o.OrderID == null
select e;
Looks fine? It doesn't compile: 'left' isn't a valid keyword there. You should use a GroupJoin. A what? A group join. Enter the wonderful world where scientists design APIs, enter the world of 'doing it the hard way just because life is already easy enough'. Perhaps, the same query can be used on XML, but I don't give a hoot, I'm talking to a RDBMS, which is pretty clear, as I specified an IQueryable implementing object as the source of the query. In databases, we use ansi joins (well, of course, the poor sods who are still on Oracle 7 or 8 don't) and because the rest of the query has similarities with SQL all the way, why is 'left' and 'right' left out? (pun intended).
So how should this simple query be formulated instead? Check it out:
var q = from e in nw.Employee
join o in nw.Order on e.EmployeeID equals o.EmployeeID into oe
from o in oe.DefaultIfEmpty()
where o.OrderID == null
select e;
Now, you'll find little info about this on the net. There's some forums thread post written by Anders Hejlsberg where he explains how super cool this group join construct is in relation to consuming XML. But it's totally artificial: the whole DefaultIfEmpty() vehicle is simply thrown away and never used, as well as the hierarchical tuples created in 'oe' which are never really created, as the whole query is simply executed as a single statement on the database, where 'oe' is actually the hash matcher result in the relational algebra executed on the RDBMS to match elements from both sources. If you look at the expression tree produced for this Linq query, you'll see that optimizing this away isn't that trivial. A 'left' or 'right' keyword would have been much simpler.
There are other things, like transaction management so you can execute updates and selects inside a transaction without deadlocks, which will likely lead to extension methods which tie the code to the O/R mapper used, but as there are already necessary extension methods to add, one more isn't going to hurt. However, the question why you want to use Linq for DB queries as it doesn't really bring that much to the table which wasn't available in the native O/R mapper query language, will stay up in the air for some time to come I think.
Since Linq was introduced to the public, people have discussed how to create extensible queries, queries where you can append constructs to to refine its outcome, for example based on user input. One of the ways to do so is by using originalQuery.SelectMany(func).Where... which simply behaves as a new select around the select of the original query. An example can be found in the Linq to Sql article Ian Griffiths posted yesterday, where he appends a query to find the Max price from a set of entities defined by another query. Typically, this leads to a derived table select, where the original query is used as one of the elements in the FROM clause of the actual select. However, not all databases on the planet support derived tables, for example Firebird 1.5 doesn't (v2.0 does though). Of course, Linq to Sql isn't bothered by this, but Linq isn't all about Linq to Sql. So to optimize this away will be the real challenge, also because LLBLGen Pro doesn't support the specification of a derived table in its interface to developers (not a lot of O/R mappers do, it's almost always only supported internally to produce a given query result).
A day 0 in a project is typically setting up the environment, getting started with reading docs etc. etc. I've already read parts of the C# 3.0 specification to see what kind of keywords are possible in a Linq query. This helps to find out what could be constructed as a Linq query and thus what can be expected as an Expression tree: to cover every possible scenario with unit-tests would take years if not longer, so it's key to have a different way to proof that the code is correct and I like to use syntax specfications for that. Also, as there's not that much information available at the moment (the MSDN docs aren't complete yet) a lot of trial/error programming will likely follow, but with the proper tools it should be taken to a minimum I think. One key element is a good expression tree viewer. Luckily in VS.NET 2008 b2, there's a debugger viewer for expression trees in the CSharpExamples archive. With a little hacking the sourcecode can be transformed into a normal viewer which can be embedded in the test code to see what a query construct looks like in an expression tree.
A Linq supporting layer has to implement IQueryable and provide a provider which actually consumes the expression trees and produces the output. Microsoft has done a good job by providing a lot of extension methods on Queryable<T> which handle the passing of the input to the provider to produce a new Queryable object. This makes life easier so we, O/R mapper developers, don't have to write all these extension methods ourselves.
Ok, the initial analysis is done, I'm starting today on developing the first code to see how things behave and to see if what I think is done is also done. Stay tuned for more posts in this series.
More Posts