Developing Linq to LLBLGen Pro, day 3

(This is part of an on-going series of articles, started here)

In the previous post in this series, I discussed the problem of a select with aggregates using derived tables vs. a query which resulted in the same resultset but used a group by clause and my problem with formulating that simple group by query using Linq. A clever reader, Justin Etheredge, solved that mistery with the following Linq query:

// C#
var q = from c in nw.Customers
        join o in nw.Orders on c.CustomerId equals o.Customerid
        join od in nw.Order_Details on o.OrderId equals od.OrderId 
        group od by c.CustomerId into groups 
        select new { TotalAmount = groups.Sum(od => od.Quantity * od.UnitPrice) };

The problem I had was that the set which was the result of the two joins doesn't have a name, so how do you specify that you want to group that set? It turns out, you don't. You simply tell Linq to group one of the entities in the set, and you can specify any field in the total set of the join to group on. I have to say: that's not really intuitive, and I can imagine that grouping on multiple fields from multiple entities will cause another headache, but let's not get overboard here

We decided it was best for LLBLGen Pro that our API would get support for derived tables, so our Linq provider will have it a bit easier to convert the Linq expression tree to our query elements. Furthermore, it fills a little gap we still had with respect to SQL coverage with our API. SQL coverage is the amount of SQL constructs you can define in the O/R mapper specific API. If the O/R mapper has a low coverage, the chance is high you have to revert to stored procedures or hard-coded SQL strings inside your application to perform a given query. As soon as that happens, you have to maintain two different paradigms to use the database in your application which can lead to problems in the long run. Derived tables are necessary for a small group of queries, like consumption of aggregated sets inside the main query, not your every day SQL stuff, but for some reports or other edge-case queries, it can be helpful if derived table support is there, so by adding support for it, we're squashing two flies with one stone. With derived tables we're pretty close to total SQL coverage: the only things we don't support after that are table / join hints (other than NOLOCK), UNION and the OLAP statements.

Software engineering all the way
Simply adding support for this isn't tic-tac-toe: our API is well established and has been used in many thousands of software projects all around the world in the last 4 years. Simply changing it by refactoring the public API isn't going to give you a lot of happy users because they'll have to refactor their code because of that as well before they can upgrade to this new version. So we're talking about major maintenance here which is the heart of the Software Engineering field. I'd like to refer to the various articles written about this topic by my good friend Jeroen van den Bos at his blog at http://jvdb.org/blog for the details about maintainability of software and how it affects our every day software engineering work.

There are two different scenario's in our case: a) we're fetching entities and the source for the entities is a derived table or a joined set with one or more derived tables and b) we're fetching a list of fields from one or more entities (a list). The scenario b) isn't the problem, everything has to be specified in that scenario anyway. However for scenario a) there's a difficulty. Let's illustrate this with an example, using our Adapter paradigm (we also support another paradigm called SelfServicing which has a different API). Say I want to fetch all Customer entities which have an order which was filed by employee with ID 2.

// C#
EntityCollection<CustomerEntity> customers = new EntityCollection<CustomerEntity>();
RelationPredicateBucket filter = new RelationPredicateBucket();
filter.Relations.Add(CustomerEntity.Relations.OrderEntityUsingCustomerId);
filter.PredicateExpression.Add(OrderFields.EmployeeId==2);
using(DataAccessAdapter adapter=new DataAccessAdapter())
{
    adapter.FetchEntityCollection(customers, filter);
}

The routine which fetches the customers is 'FetchEntityCollection', from the database belonging to the adapter object the method is called on. What it has to fetch and from which table(s) / view(s) is specified by the target of the fetch, the collection. The filter defines the set boundaries. The reason this is done this way is that it leads to less errors: you can't specify fields from a table you're not targeting, you can't forget an alias etc. etc. The FetchEntityCollection has a couple of overloads which accepts more objects like excluded fields, a sort expression, prefetch path etc. The API is simply build like you would do it by hand: you have an object which is the adapter to your database to target and you call the method, an action, to do what you want it to do for you: fetch things and you specify the parameters like with any other method: as parameters of the method.

It has a downside however: you can't specify an alias for the entities to fetch. This isn't designed in, because it's not needed. That is, up till now: with a derived table, the fields in the SELECT clause have to target a derived table instead of a table / view in the database, if the source of the data is a derived table and not the table(s) / view(s) the entity is mapped on. So this alias has to be determined automatically, or you have to specify the alias somewhere. The latter would break the code or would require a group of new overloads of FetchEntityCollection. Not that great. However, automatic alias determination isn't a picknick either. Sure, with one derived table, it's OK, however what if the join is deep and there are many derived tables? Fortunately, we found a way to do so, with a pre-process step which is completely transparent to the outside code. This way the existing code and method signatures can be used without problems. So far. We still have to solve the problem with how to specify a derived table.

LLBLGen Pro doesn't use a string-based query system, but a set of different objects which together define the information used by the fetch logic. In the simple example above, you see it in action abit: it adds the relation object between customer.CustomerID and order.CustomerID to the relations collection in the filter bucket and it defines a predicate object and together with the relation, it passes that to the fetch method. The fun thing is that you can write your own predicate classes and combine them in the filter if you like. At runtime they're then evaluated and the object gets a call to produce its SQL fragment using DB specific producers for aliases, fields, functions etc. it gets injected at runtime. This system is flexible and also compile-time checked. However, where to fit in a derived table definition? Because the relation objects have to be compile time checked as well, they're generated into the code as properties, as illustrated in the example above. Though the derived table is specified in code by the developer, and has to be usable inside join expressions with existing generated relations. Oh, and changing any of this must not break any existing code in production so the changed API is backwards compatible with what's already there and can offer the flexibility of adding a derived table or tables.

So refactoring the public API is out of the question. We can add new code though, which behind the scenes could lead to the same core: the collection of relation objects will be responsible of producing the FROM segment of the select, so if we can make that utilize the derived tables and also manage to make it join derived tables with tables / views, we're set!

How to approach this? Well, use a page from the Agile books: write the code first how you want to write it if the API was developed by someone else. The thing we should look at is: how can we define a relation object in code (like the generated code does it too, btw) so we can add it to the relation collection passed to the fetch method and by doing that also automatically make sure a derived table is joinable to tables/views which are the target of entities? A normal LLBLGen Pro EntityRelation object is focussed on fields: it defines the relation between attributes from one side and from the other side, one is the PK side and one is the FK side, exactly how you would define an FK constraint in the database. However, a derived table isn't a field, so we need a different class. Sure we can refactor the interface of the EntityRelation class, but that too will break code or make it look like an object which is a combination of several things (and that's also considered a Bad Thingtm)

Let's first look at a mockup of the code we think is reasonable for this:

// C#
// first specify the elements in the derived table select (which is a dyn. list)
// these are: fields list, filter, relations, group by, sort expression, limiter and 
// the alias for the table. No paging. 
ResultsetFields dtFields = new ResultsetFields(2);
dtFields.DefineField(OrderDetailFields.OrderId, 0);
dtFields.DefineField(new EntityField2("Total", 
    (OrderDetailFields.Quantity * OrderDetailFields.UnitPrice), 
    AggregateFunction.Sum), 1);
GroupByCollection dtGroupBy = new GroupByCollection(dtFields[0]);
DerivedTableDefinition dtDefinition = 
    new DerivedTableDefinition(dtFields, "OrderDetailTotals", null, dtGroupBy);
// then specify the relation. 
DynamicRelation relation = 
    new DynamicRelation(dtDefinition, JoinHint.Inner, EntityType.OrderEntity, string.Empty,
        (new EntityField2("Total", null).SetObjectAlias("OrderDetailTotals") == OrderFields.OrderId));

// then specify the rest of the query elements
RelationPredicateBucket filter = new RelationPredicateBucket();
filter.Relations.Add(relation);
filter.PredicateExpression.Add(
    new EntityField2("Total", null).SetObjectAlias("OrderDetailTotals") > 5000);

// then fetch the data
EntityCollection<OrderEntity> orders = new EntityCollection<OrderEntity>();
using(DataAccessAdapter adapter = new DataAccessAdapter())
{
	adapter.FetchEntityCollection(orders, filter);
}
Assert.AreEqual(38, orders.Count);

We need to define a class which holds the elements for the derived table: the DerivedTableDefinition. We then simply define a DynamicRelation. This is a different type of relation object, namely a relation between two sides, left and right, which can be different things than fields from entities and which should be joined together based on the expression specified. This can be: derived table join entity targets (views, tables), derived table join derived table or entity targets join entity targets (a right join is equal to the swap of the left join). If we refactor the inner workings of the relation collection a bit so it can handle DynamicRelation objects as well in combination of EntityRelation objects, we're there: the outside API would be untouched and only enriched with newer possibilities, and we will achieve what we needed: full, flexible support of derived tables without compromises.

Sure, the question now arises: what's the need for EntityRelation again? But that's not something you can get rid of: there's code out there, many many lines of code, which relies on that class. You can't simply remove it because you thought of something much more appropriate and which can replace what you already had. That's the main difference between refactoring APIs during the development of the first version and refactoring APIs after the development of the first version, AKA production code. So remember: if some Agile / XP guru tells you that refactoring is great and you should do it all the time, don't forget that after v1 is out, you are bound to keep the interfaces and classes around, because stuff depends on it from then on so you have to be very very careful. The whole aspect of 'adaptive to change' (which is the main thing about Agile) is actually meant to be applied to the application as a whole: migrate it to the next version, within the boundaries defined by the reality you're in. So focus on that main aspect, and not on blind refactoring because you can: after v1, you are in a total different world with different rules and different laws.

Today I came as far as that the inner workings of the Relation collection are left for the API functionality migration. As all Dynamic Query Engines (the engines per database which produce SQL specific for the database they represent) already have a facility to create a subquery from query fragments (so alias scoping etc. is abstracted away inside the engines), connecting everything together is pretty straight forward. It's still a bit of code to write and test, but the tough part is over: once we're inside the library, behind the public API, you can refactor and add and change whatever you want, as long as the functionality of the public API doesn't change.

Tomorrow I hope to have this running so I can go back to my Linq code. .

2 Comments

  • "The problem I had was that the set which was the result of the two joins doesn't have a name, so how do you specify that you want to group that set? It turns out, you don't. You simply tell Linq to group one of the entities in the set, and you can specify any field in the total set of the join to group on."

    The set you describe doesn't exist, or exists as the combination of { c, od, o }, depending on how you look at it. You can still think of this construct relationally, in that you can group any tuple by any tuple. (In LINQ, a tuple consists of a single preexisting type or an anonymous type.) In your LINQ query, you could group new { c, od, o } by c.CustomerId. Or, you could group od.Quantity * od.UnitPrice by c.CustomerId, in which case you could call the parameterless Sum() in your select statement.

  • "The set you describe doesn't exist, or exists as the combination of { c, od, o }, depending on how you look at it."
    I indeed look at it as the combination of c, od and o as the new set which is then grouped, then selected and then ordered by a sql pipeline (so writing SQL is done with that in mind and I projected that onto Linq, which wasn't the right choice ;)).

    Thanks for clarifying. :)

    Thanks also for the VB.NET heads up. That will be interesting, together with the NULL issue related to VB.NET linq queries which look the same as C# ones, but behave differently depending on nullable types/null values (at least that's what I heard, haven't looked at a concrete example).

Comments have been disabled for this content.