Imagine, you're sitting at your desk and you're using the Linq to Sql designer in VS.NET 2008 and you have, say, 50 entities in your model. You're happy about how things are progressing. It took a while to get the model set up, considering the wicked table and field names they cooked up in the DBA dungeon, but after some swearing and too much caffeine, it's done. The model which is smiling back at you is what you had in mind.
Rain slams against the window and the DBA you have heard about but never had the chance to meet in person, walks into your office and with her whiskey-shaped voice she almost whispers into your left ear:
"Son, I've updated a few tables in your catalog schema, hope you don't mind. See ya!"
Before you can ask her why she found it necessary to do such a terrible thing to you, she's dissapeared, back to her basement.
In shock, you stare at your shiny Linq to Sql designer, in it your 50 entities in full glory. Which tables were changed? What has she changed? WHAT!? You start telling yourself not to panic, when the idea of checking manually each and every of those 50 tables for changes, manually rebuilding the Linq to Sql model, re-doing your inheritance hierarchies, renaming the fields is causing a slight increase in sweat production. Why did you decide to wear this grey shirt to work, today!
Sounds familiar? Well, every Linq to Sql user will sooner or later get into this position. The problem is caused by the fact that the Linq to Sql designer doesn't have a model refreshing feature: when the database schema changes, you can't simply say to the Linq to Sql designer: "Hey buddy, I heard from our DBA overlords that you've some work to do. Ping me when you're done, OK?" and you get your model refreshed with the newest meta-data from the updated catalog schema and migrated to that updated schema.
Luckily, we've a solution for you. Yes, honestly, we do. You see, the designer of our O/R mapper system LLBLGen Pro has model refreshing and migration built in, together with sophisticated name construction schemes and inheritance hierarchy migration. It also packs a state of the art, task based code generator engine. So we wrote, in a single day, a couple of templates for our code generator engine, which can now produce Linq to Sql classes and mapping files out of a SqlServer project.
This gives you the option to update your model with the updated database schema meta-data without blinking an eye. On top of that, the code it produces is generated by templates. These templates are extensible and as our code generator engine is task based, you can add whatever you want to the pipeline: checking out code, compile the end result, you name it. Of course, command line tools are provided so refreshing, migrating and re-generating the code is completely controllable from the command line. We already chopped up the single code file the Linq to Sql designer produces for you, so you don't have to deal with a very large file with the model: every entity gets its own class. The only downside is that you can't update your model in the Linq to Sql designer anymore. But, why would you?
The templates are currently in beta, and available for our customers. If you're interested in giving them a testdrive, but not a customer of LLBLGen Pro, no worries: leave me a message via the contact form or below in the comments and I'll mail you the template package so you can use them with our 30-day trial installation.
A small note for the people who see conspiracy theories in every corner: no, we're not abandoning our own framework. On the contrary
. Linq to Sql users already use Linq to Sql, but as we can provide some additional tooling for those people, why not?
In the latest public preview of Microsoft's Dynamic Data, they've added something else besides what's publicly advertised: support for 3rd party O/R mappers!
A couple of weeks ago, Bryan Reynolds mailed me about LLBLGen Pro support in MS Dynamic Data. I initially hadn't payed much attention to Microsoft's upcoming Dynamic Data initiative, as it's more geared towards web developers and was Linq to Sql / Entity Framework only as it seemed. He gave me a demo of Dynamic Data using Shared View (which works pretty well) and it got my attention. It's a clean, easy to use technology to get smaller websites which have to deal with data-entry, up and and running in no time.
There was a problem though: it seemed to be using Microsoft's own O/R mapping attempts only. So I mailed Scott Guthrie and explained that my enthousiasm was triggered by a demo of Dynamic Data and if they would consider opening up their code to support 3rd party O/R mappers as well. Within a day Scott hooked me up with some of the team members (David Ebbo and Marcin Dobosz) to see what could be done to support 3rd party O/R mappers. They provided me with a private build with some internal classes made public so we could check whether these were enough to provide meta data to the Dynamic Data engine so it could build the pages with LLBLGen Pro code as its O/R layer instead of Linq to Sql or the Entity framework.
It took a few iterations to get things right, but in the end it payed off: the meta-data provider API is sufficient enough to provide all the meta-data necessary and also to get back at runtime additional meta-data the O/R mapper added to the providers. The API is generic enough for 3rd party O/R mappers, if they use a context or not. Microsoft has decided to make this API public in the latest public CTP release of Dynamic Data, so any 3rd party can implement the metaprovider classes required.
Can every 3rd party O/R mapper work with Dynamic Data?
In theory: yes, I think it can. It needs two essential building blocks though:
- Linq support. The Linq support is essential as it is used to produce filters on related entities.
- A Datasource control which has SelectParameters support
Both elements required are time consuming projects to build, depending on how far you will go of course. A simple Linq provider might work OK for Dynamic Data and a simple datasource control might do too. Combined the LLBLGenProDataSource control and the Linq to LLBLGen Pro provider took almost a year to build in full. Looking back I now know it was well worth the investment: a Linq provider is essential for an O/R mapper to succeed in the .NET world of tomorrow, and a datasource control which works nicely with your O/R mapper is too.
For LLBLGen Pro customers who want to try out the dynamic data support, please contact us via email or our forums so we can email you our package with templates and DynamicData support classes. Our Linq support is available in beta in the beta of LLBLGen Pro v2.6, which was released yesterday.
Microsoft has awarded me again this year with an MVP award for C#! I won't be at the MVP summit this year though (held in 2 weeks).
Today I received through the contact form on this blog a hate-mail from a guy who called himself 'Ryan'. Ryan used a fake, non-existing email address so the only way to respond to him is via my blog, hence this post. Let's look at the email first:
From: ryan@notanoob.org
Subject: (Frans Bouma's blog) : anti-agile
Your post on Jeffrey Palmero's blog is laughable. He is a smart and successful person and is involved in practices that you do not understand. Your post makes you look like a moron. You obviously have a lot to learn about agile development.
The ONLY thing that works on large .net projects is the platform independent knowledge that the java / C++ / small talk community has learned over the past 20 years. I know because I work on a 3 million LOC, 200+assembly .net product every day. MS built the best development platform, but what MS teaches is crap. In the end, the culture of the development shop is what makes or breaks it. Agile practices focus on that culture. Tools and processes are secondary.
One day when you work on a real app that is more than 100K lines of code maybe you'll understand. You should not venture outside of the realm of your fan base, which is newbie developers that have a background in ASP.
I would love to see your product choke on several of our 100+GB databases.
Thanks.
My post about stored procedures back in 2003 generated the fair share of hate-mail in which angry DBAs and other stored procedure supporters wished me all kind of bad things and told me I had no clue whatsoever. I think it's related to speaking up in public and some people apparently can't deal with another person's opinion very well.
This particular email illustrates something I can't understand: why does someone get so angry about a random reply I've written to someone's weblog that that person goes to this particular blog, and types in the text above? What's the goal? To tell me how the world really works? It's never a pleasure receiving this kind of email, and it surprises me every time why a person thinks 10 minutes of his/her precious time is better spend on writing a hate mail than writing a mail which could open up a mature discussion so ideas/thoughts/arguments can be exchanged and both parties participating in the discussion can learn from eachother.
Let's address the various topics which are mentioned in the email. To start, I have no idea who this 'Ryan' person is, and I definitely don't know which post on Jeffrey's blog he's referring to. I don't doubt Jeffrey is smart and succesful, and I have no problem whatsoever with Jeffrey at all. Too bad Ryan didn't include a link so I could have elaborated what I wrote.
Ryan then goes into show-off mode and claims he works on a 200+ assembly, 3 million lines of code project every day. I'm sure he wrote that all by himself, but let's not focus on his work. The gem is in the line 'MS built the best development platform'. I'm sorry to bring it to you, Ryan, but VS.NET and the .NET framework are build with hard-core waterfall practises: cleanroom-design, hard-core specs are written out to the letter with every tiny detail explained, then the development of code is done, and after that the long process of testing and weeding out bugs starts, and after that: shipping.
Not to propagate waterfall, as for the gazillionth time: I don't like waterfall, but it's simply true. Because Microsoft uses this practise (and most other big corporates do who ship large products), it's very hard to get things changed once the hard-core spec phase is over. Most MVP's who have ever tried to give feedback to a product team, even more than a year before shipment know this the hard way: feedback is never accepted once the specs are finalized.
What Ryan apparently doesn't understand, is that I'm not anti Agile, why would I be 'anti'?. The 'Agile' movement, which first was meant to be called 'Adaptive', is actually simply a movement which propagates 'being adaptive to change'. As a software engineer I can't understand how someone can be against that: being adaptive to change is what's it all about. However, if I state that I think (thus which is a personal opinion) something is an oversight, e.g. that with just a bunch of unittests you create a false sense of correctness, does that make me anti-agile? If so, I would be anti-'adaptive to change', which isn't the case.
If I look at how I wrote LLBLGen Pro, which spans more than 350,000 lines of C# code btw, I always tried to focus on that: be adaptive to change: make it modular, make it flexible. We're a small company. I didn't have time to write out the full design document. So I didn't: everything is build with vertical slices, designed per feature, not BDUF. You have to, if you have to design an entity object model to store mapping data, entity definition data, database design data, the full designer to manipulate all that, for every database a driver to obtain meta data, an engine to merge changes into object graphs, the task-based engine who can execute tasks for you to generate code, create folders etc., the DSL to write the templates in, the LL(1) parser for the DSL, the code generator engines (there are more than 1), the runtime libraries and of course the templates. Ryan, do you really think I designed that all up-front to the fine details and then started coding? Think again.
Ryan then explains that once I'll work on a large 100K+ project, I'll understand. Sure, I did that many years ago already and till today, I'm glad I did so. So I'm not entirely sure Ryan really means 'Agile', I think he meant 'TDD' and that one (in this case: me) shouldn't speak about the fact that having solely unittests is not always a good idea if you want to have correct code.
He then goes on to bash the readers of this blog and the fans of our work, and with that, I again wonder...: why? What's the point, do you, Ryan, really think the world gets better with writing these below-the-belt-flames to someone? Aren't we all software engineers working on software, doing the best we can? If so, then why the hatred?
The last thing I want to address is why he apparently would, like to see LLBLGen Pro choke on databases with a lot of data? It's not hard to make any system choke on large volumes of data, as fetching large volumes of data takes time. Fortunately some large companies use our work on very large databases (1000-2500+ tables) and it does fine. Would that be because it's written with 'being adaptive to change' in mind? Not sure. But as you, Ryan, seem to think I have a lot to learn about software engineering, it must be luck, right?
Come on, Ryan, whoever you are. Why not have a good discussion with arguments instead of flames? The world knows already enough hate and anger, we don't need to create more.
. The 'agile' movement sometimes get negative press of being full of hate. I don't think Agile, the agile movement, TDD are about hate nor do I think the persons participating in that movement are full of hate. I think they're passionate about the movement they're in, like other people are also passionate about the movement they're in. So, Ryan, do yourself and the movement you're passionate about a favor: next time, please start the discussion like an adult: with proper arguments so we can have a proper debate between two professionals, OK?
Today we released the beta of Linq to LLBLGen Pro to our customers so they can dig in and check if we provided the right code, if everything works allright etc.!
If you're an LLBLGen Pro v2.x customer and you want to check out our Linq implementation, please check the customer area to download the beta package.
As I said in my last episode of the (long running) series 'Developing Linq to LLBLGen Pro', it took near 6 months to get this far, and it wasn't a walk in the park. However the last 3-4 months have been pretty OK, once everything was clear and I knew what to expect and what to do. Looking back I'm very happy we did take this step to provide full Linq support for LLBLGen Pro.
Linq for LLBLGen Pro is part of LLBLGen Pro v2.6 which will go into beta in a couple of weeks.
Update: I made a mistake in the first Linq to Sql query. It's not that slow as I previously posted. I didn't filter on country, which made it pull the rows of all 91 customers into memory instead of the 11. Fetching 91 customer rows, 818 order rows and 2015 order details rows took Linq to Sql over 900ms, fetching 11 customers took 165ms, not over 900. The text has been corrected for that.
(This is part of an on-going series of articles, started here)
We're getting closer to the goal: a full-featured Linq provider for LLBLGen Pro. Hopefully next week the Linq CTP for LLBLGen Pro is ready for release to customers, fingers crossed! Last time I talked about the function mapping support for database functions and in-memory function/method/property access extravangansa, today I'll discuss one of the cornerstones of using an O/R mapper in the first place: fetching trees of objects using hierarchical fetches.
Eager-loading of related data using prefetch paths
One of the two ways of fetching hierarchical data is by fetching a tree of entity instances. For example: fetch all customers from Germany and their Order entities and for every Order entity the related Order Detail entities as well. Using an O/R mapper should make this simple, and efficient, i.e.: the right entities should be fetched and all entities should be merged into a tree so their relation to each other is shown. This means that all the Order entity instances are stored in Orders collections inside the Customer entity which owns the Order instances. All Order Detail instances should be placed inside OrderDetails collections, namely the ones
of the owning Order entity.
To do this efficiently, it would take at most 3 queries to do this: one for the Customers, one for the Orders of these Customers and one for the OrderDetails of these Orders. LLBLGen Pro has a build-in mechanism which is called Prefetch Paths, which lets you specify the paths along a tree, and every node which should be fetched together with the main entity / entities to fetch. In this particular case, we're fetching Customers from Germany, and our path therefore is pretty simple: Customer - Order - OrderDetails. If you want, you can specify a filter per node, a sort expression per node and a limitation (so only the last 5 orders for example). It also allows you to specify multiple branches in the graph. This means that the path doesn't have to be linear, you specify a tree with multiple branches, e.g. Customer - Employee (m:n via Orders), Customer - Order - OrderDetails and Order - Employee.
Prefetch paths are also polymorphic. So you can specify a branch in the tree which is only available in certain subtypes. For example if you take the dreaded inheritance hierarchy Employee - Manager - BoardMember and BoardMember has a relation with CompanyCar (as they are the only ones allowed to have a company car, the lucky ********!), you can define a path for fetching Employees and BoardMember - CompanyCar: for every Employee which is a BoardMember, the related CompanyCar is fetched. Yes, it's pretty neat
.
It's key that these eager-loading specifications, fetch plans, prefetch paths or whatever you want to call them, are specified at the query level and are executed with the query. So when you're fetching the customers together with the prefetch path, you are fetching the complete tree defined by the specified path. Now, it's pretty clear that the complete path has to be fetched at once, but why is it important that the path is specified with the query? After all, doesn't Linq to Sql use a model where the tree to fetch is specified via LoadOptions on the DataContext? The main issue with defining it elsewhere is that if you execute multiple queries on the DataContext, you'll have to make sure the LoadOptions for that particular query are set to the correct values. The Entity Framework uses a different approach if I'm not mistaken, they indeed use an extension method to specify the related entities, yet it uses strings which I find particular weird considering the fact that the sole purpose of Linq being better than embedded SQL strings was that it's compile-time checked.
Linq to LLBLGen Pro will have a Queryable extension method called WithPath, which allows you to specify a complete tree inside the query, together with filters per node, sorting per node, exclusion of fields per node etc., i.o.w. the usual Prefetch Path stuff. Everything is specifyable using compile-time checked constucts, so no string mess. Let's get back to our little example of Customers from Germany and their Orders and OrderDetails entities. This is how the query looks like:
// Query A
var q = (from c in metaData.Customer
where c.Country=="Germany"
select c).WithPath(
new PathEdge<OrderEntity>(CustomerEntity.PrefetchPathOrders,
new PathEdge<OrderDetailsEntity>(OrderEntity.PrefetchPathOrderDetails)));
You simply define the path edges of the path to the related entities to fetch together with the main entities and you're done. As edges can be defined inside other edges, it's possible to specify multi-branched trees inside the query, and every PathEdge object can contain a lambda-based filter, a sort expression, a limiter and a list of
fields to exclude. As Linq does allow you to specify lambda expressions wherever you want, the filter is specifyable using lambdas, and as the type to define the lambda
on is specified with the PathEdge, you'll get full intellisense. The sort expression is a different matter: sorting is specified in Linq using 'orderby', but that's only accepted at a specific place, namely inside a query, but not at every spot. So we've to fall back onto normal LLBLGen Pro sort expressions for that, but that's ok, they're easy to specify. For example if we want to sort the orders on the OrderDate, descending, we would do instead:
// Query B
var q = (from c in metaData.Customer
where c.Country=="Germany"
select c).WithPath(
new PathEdge<OrderEntity>(CustomerEntity.PrefetchPathOrders, null,
new SortExpression((OrderFields.OrderDate | SortOperator.Descending), 0,
new PathEdge<OrderDetailsEntity>(OrderEntity.PrefetchPathOrderDetails)));
The code is still compile time checked. Our prefetch path code has been in the framework since 2004 and it has gone through some revisions where it received more and more tweaks. One of the main tweaks, which is developer controllable, is the setting when LLBLGen Pro should switch from a subquery based filter to an IN (values...) based filter for fetching related entities. LLBLGen Pro uses 1 query per path node and merges sets through hash-value comparisons. It doesn't use joins between parent/child in a tree, as that leads to problems: with 1:n relations it leads to a lot of duplicates and with multi-branched trees it leads to complicated scenarios to fetch the data out the resultset and to very wide resultsets. So using 1 query per path node is much more efficient, and as a bonus: the fetch code for the node itself is the same as you'd normally use so easier to implement as well. The Entity Framework uses joins for related entities, which I don't think is a wise thing to do.
Say in the above path, the engine has fetched the Customers from Germany is now going to fetch their Orders. This could be done in SQL with:
SELECT o.OrderID, o.CustomerID, ...
FROM Orders o
WHERE o.CustomerID IN
(
SELECT CustomerID
FROM Customers
WHERE Country=@country
)
A clever Irish developer, Marcus Mac Innes, the owner of pix.ie, said to me: "Isn't that optimizable with an IN (customerID, customerID...) query in certain conditions? and he wrote a proof of concept. In Northwind, there are 11 customers from Germany. As these are fetched before the Orders, and therefore you know the CustomerIDs to filter the Orders on, you can instead do:
SELECT o.OrderID, o.CustomerID, ...
FROM Orders o
WHERE o.CustomerID IN
(@customerID1, @customerID2, ...., @customerID11)
This is faster, as the RDBMS just has to read one table. There's a catch: as the # of parameters increases with the increased number of parent objects already fetched, the question arises: is this always faster? Marcus did some extensive testing and the sweetspot for average sized tables was near a 100 parameters. Above that number, and the subquery as shown in the first SQL snippet was faster. This is excellent tweaking material, so LLBLGen Pro sports a threshold which allows you per query to optimize this setting, as it can be that your tables require a lower number, or perhaps in some cases a higher number. We set the threshold initially to 50 to be on the safe side for all queries.
Performance of prefetch paths and Expression tree parsing
So, how does this all perform? We first have to measure how fast our Linq provider parses Expression trees. As our provider doesn't have a compiled query equivalent like Linq to Sql, it's key to know if using Linq instead of our normal API is of any value. Our tests show that a medium complex query in Linq is fully handled by our provider in under 1 millisecond on a core2quad system with XP. So the overhead is pretty minimal. This also shows that the necessity of a compiled query feature isn't necessary: the overhead isn't really noticable when you take into account network latency / db roundtrips, object materialization etc. etc.
Let's run the first query, Query A a 100 times on this system and let's see how fast it is. After that it's compared to the same query using Linq to Sql instead. Now, I want to stress that benchmarks in general have to be taken with a grain of salt: it's a snapshot of a moment in time under the conditions of that moment, not a fact of life. So it's an indication of what the performance will be. With all tests I've done, I've first warmed up the CLR a bit with fetching a query or two, so all assemblies are loaded, database connections are live etc. Tracing/logging was switched off, the networked database server was idle (SqlServer 2000).
As we'll see later on in the article, the performance of LLBLGen Pro in hierarchical fetches is pretty good and Linq to Sql, well... let's say it doesn't do that well. The fact that Linq to Sql has a hard time with hierarchical fetches isn't new: several people have blogged about this in the past: Patrik Löwendahl, Roger Jennings and David Hayden.
Here's my simple LLBLGen Pro test code. I use an empty foreach instead of calling ToList() on the query, because ToList() in every provider copies the results into a new List object, which takes performance without adding anything to the story, so I left that out. Our queries are typed with an interface which allows you to obtain the results directly in a list, so you can avoid this performance penalty. With Linq to Sql, you either have to call ToList() or traverse the set.
Stopwatch sw = new Stopwatch();
sw.Start();
int amount = 100;
for(int i = 0; i < amount; i++)
{
FetchNestedSetTest();
}
sw.Stop();
Console.WriteLine("Fetching the nested set {2} times took: {0}ms. On average this took per fetch: {1}ms",
sw.ElapsedMilliseconds, sw.ElapsedMilliseconds / amount, amount);
//...
// the routine FetchNestedSetTest:
public static void FetchNestedSetTest()
{
using(DataAccessAdapter adapter = new DataAccessAdapter())
{
LinqMetaData metaData = new LinqMetaData(adapter);
var q = (from c in metaData.Customer
where c.Country=="Germany"
select c).WithPath(
new PathEdge<OrderEntity>(CustomerEntity.PrefetchPathOrders,
new PathEdge<OrderDetailsEntity>(OrderEntity.PrefetchPathOrderDetails)));
foreach(var v in q)
{
}
}
}
The output: Fetching the nested set 100 times took: 1906ms. On average this took per fetch: 19ms
That's 3 queries per call, 19ms total for setup the query, adapter, db connection query parsing etc. This is a debug build for the complete stack (Linq provider, O/R runtime, SQL engine) and connects to a SqlServer 2000 database over the network.
Ok, now let's see what Linq to Sql does in this case (same machine, same target database, ran after LLBLGen Pro's version).
Stopwatch sw = new Stopwatch();
sw.Start();
int amount = 100;
for(int i = 0; i < amount; i++)
{
FetchNestedSetTest();
}
sw.Stop();
Console.WriteLine("Fetching the nested set {2} times took: {0}ms. On average this took per fetch: {1}ms",
sw.ElapsedMilliseconds, sw.ElapsedMilliseconds / amount, amount);
//...
// the routine FetchNestedSetTest:
public static void FetchNestedSetTest()
{
NorthwindDataContext nw = new NorthwindDataContext();
DataLoadOptions loadOptions = new DataLoadOptions();
loadOptions.LoadWith<Customer>(c => c.Orders);
loadOptions.LoadWith<Order>(o => o.Order_Details);
nw.LoadOptions = loadOptions;
var q = from c in nw.Customers
where c.Country=="Germany"
select c;
foreach(var v in q)
{
}
}
Ready? This is the output: Fetching the nested set 100 times took: 16558ms. On average this took per fetch: 165ms
Yes, I've checked it a couple of times, and before you ask if my code indeed fetches data, it does. 19ms vs. 165ms. That's a big difference. How is this possible? The main reason is that MS' code isn't clever in this case. It doesn't do any fancy optimization of the graph fetch, while it has all the information at hand to do so. Instead it executes a lot of queries: 1 for the Customers, one to fetch for every Customer's Orders and per Order, one to fetch the Order's OrderDetail entities. That's a lot of queries, which take a lot of time. Not only that, it will burn your server to the ground if you have many people accessing your database and executing these queries. So if you're considering Linq to Sql, be sure you know this serious performance bottleneck. See also the 3 articles by Patrik, Roger and Dave I've linked to above, which explain the issue more in detail. In a bit, we'll see that this approach could also lead to wrong results in some cases. More on that below.
Microsoft has put a lot of effort in optimizing the Linq to Sql runtime via IL code generation at runtime (see Rico Mariani's posts about this subject) to get fast set fetches. This code indeed does its work well, simple sets are fetched very quickly, a couple of milliseconds per query faster than our code. But, I'm sorry to say it Rico and friends, isn't this work all in vain when the optimization in the end turns out to be a micro-optimization (which is considered a bad thing)? Shouldn't this work have been applied to the bigger picture which clearly shows that the Linq to Sql framework isn't equipped with solid, tweakable, clever logic to perform all queries very fast, especially the queries for hierarchical fetches, one of the core reasons to use an O/R mapper in the first place? Optimizing frameworks is a delegate manner, and optimizing O/R mappers is something you have to work hard for: the classic 'it depends' is applicable to a lot of situations and to get the best performance in all cases, it's key that the framework notices which situation it is in and adapts to that situation. Code to quickly do hierarchical fetches with tweakable features is somewhat complex to write, especially the merging of the sets, but Linq to Sql doesn't even support m:n relations, so the hardest part of hierarchical fetches was already off the table!
Let's look at the other way of doing hierarchical fetches with Linq.
Fetching hierarchical sets of non-entity projections
Linq allows you to specify a projection inside the projection of your query, or more than one. I'll show you two examples. The first one I'll show you is a complex, 5 entity touching hierarchical fetch with multiple paths. The other one is a simpler one which fails to return the proper set in Linq to Sql which is caused by their way of handling nested sets in the projection.
Linq to LLBLGen Pro treats nested queries inside the projection the same way as prefetch paths: nested queries are fetched with a separate query per set, the sets are merged using hashvalue matching and the queries for the nested sets are optimized using the same bag of tricks used for prefetch paths (yes, we're using more tricks than just Marcus' trick
). The nested queries in a Linq query have an additional problem though: if the end result is a set of anonymous types you can only materialize the objects if you have all the data: anonymous types don't have setter properties, only getters. This requires a 2-phase projection: first the data is fetched from the db using a simple projection, and stored into a temporary storage in-memory. Then all nested queries are fetched, and the results of each query are placed in the slot per row of the set to materialize. Then that rowset is projected again, but this time in-memory with the actual projectors so in-memory method calls are ran and anonymous type instantiation works OK.
Ok, let's look at our phat nested query first:
var q = from c in metaData.Customer
where c.Country == "Germany"
select new
{
c.CompanyName,
c.City,
Orders = from o in c.Orders
select new
{
OrderDetails = from od in o.OrderDetails
select new
{
od.Product,
od.Quantity,
od.UnitPrice
},
o.Employee,
o.OrderDate
}
};
This query combines fetches of anonymous types with data projected from entity sets with fetches of entities: od.Product and o.Employee are fetches of related entities, while the rest are projections to anonymous types. This query fetches data from Customer, Order, OrderDetails, Product and Employee, and should use at most 5 queries. Let's put this baby on the testbench to see how fast our provider is. We use the same setup as above, only this time FetchNestedSetTest() simply executes the query above instead of the prefetch path query. The foreach assures proper enumeration. The query is also in a unittest with data-checks, so it indeed does fetch all data, so the timings are correct.
The output is: Fetching the nested set 100 times took: 4363ms. On average this took per fetch: 43ms
This number isn't that much of a surprise, considering the fact that the overall set of data is pretty small: just 11 customers, their orders, order details, the products used in those order details (70 in total) and the employees who filed the orders. With the approach of one query per node, and with tricks like switching to an IN (value, value..) query instead of a subquery, it makes the individual sets pretty small, and the merger routine which has to merge the different rows based on hashvalues doesn't have a hard time. A debug build was used.
Now, let's see how Linq to Sql handles this query. It comes as no surprise that a nested query inside a projection is handled the same way as with the loadoptions: the query is executed per parent, except when it is a scalar, then it is folded into the parent's select for obvious reasons (we do that too, btw). To avoid controversy, here's the Linq to Sql variant of this query:
var q = from c in nw.Customers
where c.Country == "Germany"
select new
{
c.CompanyName,
c.City,
Orders = from o in c.Orders
select new
{
OrderDetails = from od in o.Order_Details
select new
{
od.Product,
od.Quantity,
od.UnitPrice
},
o.Employee,
o.OrderDate
}
};
The output: Fetching the nested set 100 times took: 30714ms. On average this took per fetch: 307ms
So 43ms vs. 307ms. Analyzing what Linq to Sql does, you see the same thing: a lot of queries are executed. Clearly there's no clever code in place which makes this more optimal.
It might be that you now think these are edge cases., but I beg to differ. Just because you can group sets so easily together in projections in Linq, it's mandatory that these fetches are done optimally. Futhermore, and I already stressed that point above: the ability to fetch trees of objects is a big advantage of an O/R mapper over simpler code which fills classes with results from a query: merging sets efficiently without any effort from the developer is a very nice feature as the data is grouped in trees, objects in collections inside other objects, exactly as the class model dictates. Isn't that one of the reasons people want to work with entities, objects and the like and not with flat tabular data?
There's another reason why the approach Linq to Sql uses isn't that good: it can lead to wrong results. The following Linq to Sql query results in 818 rows, one for every Order entity in the database. However, that's wrong, as 'Distinct' is used in the query: Customers have often more than 1 order and often multiple orders are filed by the same Employee. The right answer has to be 472 (on our northwind test database) instead of 818 rows. But because Linq to Sql doesn't use 2-phase projections, it can't take into account the Distinct at all, leading to the wrong results:
var q = (from o in nw.Orders
select new
{
o.EmployeeID,
CustomerData = (from c in nw.Customers where c.CustomerID == o.CustomerID select new { c.CompanyName, c.City, c.Country }).Single()
}).Distinct();
The query is perhaps not what you'd see in your daily work, but it proves the point: an O/R mapper has to produce the right results according to the query.
I'll conclude with an example of Linq to LLBLGen Pro using a nested query which result is passed to a method which is executed in-memory, thus executed during projection. The method is very simple but illustrates the mechanism.
// helper class which contains the in-memory method to call
public class InMemoryMethodCallsForTests
{
public static bool IsUsefulEntity(CustomerEntity customerEntity)
{
return customerEntity.Country == "Germany";
}
}
// unit test which illustrates the point of calling this method with a nested query result.
[Test]
public void InMemoryMethodCallWithNestedSetResultAsParameter()
{
using(DataAccessAdapter adapter = new DataAccessAdapter())
{
LinqMetaData metaData = new LinqMetaData(adapter);
var ordersFromGermany = from o in metaData.Order
where o.Customer.Country=="Germany"
select o.OrderId;
List<int> orderIds = ordersFromGermany.ToList();
Assert.AreEqual(121, orderIds.Count);
var q = from o in metaData.Order
select new {
o.OrderId,
Value = InMemoryMethodCallsForTests.IsUsefulEntity(o.Customer)
};
foreach(var v in q)
{
if(v.Value)
{
Assert.IsTrue(orderIds.Contains(v.OrderId));
}
else
{
Assert.IsFalse(orderIds.Contains(v.OrderId));
}
}
}
}
This test calls for every anonymous type fetched the IsUsefulEntity routine by passing the result of the nested query on the Customer to it. The result is placed in the slot for 'Value' and is used to materialize the anonymous type instances returned by the query.
So, done yet?
Not entirely but almost. We want to add support for Full text search to our Linq provider (our API supports it, so it's a small thing), a way to specify fields to exclude in entity fetches (so fetch all fields from Employee except the Photo field) which is something already implemented in LLBLGen Pro so it shouldn't be a lot of work to add it and some very minor details to wrap up and then it's done!
If everything goes according to plan, we'll release next week a Linq CTP of LLBLGen Pro v2.6 to our customers. This CTP includes new runtime (v2.6), templates, the Linq provider and the sourcecode of all unittests for Linq so you get a lot of queries to look at to get started. After that, we'll spend a month or so to make the changes we've planned for v2.6 which will then be released in beta. I think that gives us enough time to hammer out the bugs/glitches/things we forgot from the Linq provider before release.
Final words
This was the last article in this series. It spans 14 episodes and I think it's a great read how my journey through Linq land went: from being in the pitch-black with just a flashlight with dead batteries, to the bright sunny dailylight of today with the fully working, high performance provider it has become. Compiled in release build, the Linq provider alone is 170KB, and the provider alone spans roughly 925KB of C# sourcecode (and there's still some small things left!). The first branch in SVN was created in September 2007, so roughly 5-6 months worth of work. I was on the edge of giving up on this Linq provider a couple of times, mainly due to having no real information when what appears where in the tree, however I'm glad I made it: the Linq provider offers a nice way of querying all databases supported by LLBLGen Pro and makes a nice addition to the featureset we already provide.
I'd like to thank all people who have supported me during this journey! I hope you all enjoyed reading these articles. The wait is almost over, though I can assure you, it was worth it.
(This is part of an on-going series of articles, started here)
Last time I talked about implementing Single. It turned out to be fairly straightforward, but as I explained in the previous episode, it's a weird method and has different behavior related to where it's used in the query: sometimes it does and sometimes it doesn't result in an exception and also it doesn't always result in the same type of exception. This same aspect of not having the same behavior, or better: end result, related to where the element is used in the query is seen with using function mappings, local methods etc. in a Linq query. Let's look at an example or two:
// Example A: usage of method mapped onto db function
LinqMetaData metaData = new LinqMetaData(adapter);
var q = from c in metaData.Customer
where String.Compare(c.Country, "USA")==0
select c;
List<CustomerEntity> customers = q.ToList();
Assert.AreEqual(13, customers.Count);
// Example B: usage of in-memory call of method not mapped on db function
LinqMetaData metaData = new LinqMetaData(adapter);
var q = from o in metaData.Order
where o.CustomerId=="CHOPS"
select new
{
DaysInMonth = DateTime.DaysInMonth((int)o.OrderDate.Value.Year,
(int)o.OrderDate.Value.Month),
OrderId = o.OrderId,
Constant = true
};
Example A is a typical example of a .NET method which is mapped onto a DB function / construct. However, with example B we see a different usage: DateTime.DaysInMonth isn't mapped onto a DB function in this case, it's executed in memory on the query result produced by the query during the materialization of the objects in q, so during the projection of the raw query results onto objects. This gives a bit of a problem: if there's no mapping of a method / property onto a DB construct, the method / property apparently has to be executed in-memory after the results have been received. Though, what about this query instead:
// Example C: non-working query due to in-memory method call in where clause
var q = from o in metaData.Order
where o.CustomerId=="CHOPS"
&& DateTime.DaysInMonth((int)o.OrderDate.Value.Year,
(int)o.OrderDate.Value.Month)==30
select o;
Here, we use the exact same construct of DateTime.DaysInMonth, however now it's inside a where clause and therefore it is part of the query which means it will
run inside the database. But that doesn't work if there's no way to convert the .NET code to SQL. So we have a query with the same construct but it has different results where it is used: sometimes it works, in other cases it ends in tears.
It is very important that developers understand this difference: that it's possible that an element does work in one area of the query but doesn't work in another area, because there's no conversion possible to SQL, so it can't run on the database. The problematic thing about this is that you'll find this out at runtime, not at compile-time, as the query is evaluated at runtime so the code compiles fine.
The main cause of this confusion is that Linq offers a lot of features to be used inside a query and with a lot of features comes a lot of responsibility. This means that if you're going to write a Linq query which is will run in the database, be absolutely sure you know where which part is executed. Not only will that save you from the trouble that things won't work at runtime, it also will safe you from the trouble fixing performance issues because what you thought would be running inside the DB is actually ran on the client. To add to this confusion, one can create a mapping for DateTime.DaysInMonth to a function in the database which returns the same value as the .NET version does. After that the construct will be convertable to SQL and will run inside the database and not in memory, so example C would then work properly. Later in this post we'll explore how to achieve exactly that.
One could argue if all this is really a good thing or not. I'm a bit undecided about this ("Now, that's a first!"): after writing a lot of tests for our method/property mapping system, I can only say: being able to map a .NET method/property to a DB construct is awesome. However I fully understand the problematic aspects illustrated in the examples B and C above and what the impact will be for people who read code written by others: they have a hard time understanding what's really going on. Not fully understanding what's going on is one of the main causes why bugs are introduced and not found by the developer.
For the people who have never seen code for an O/R mapper, Linq seems like magic and the fancy, shiny aura of it makes them forget about all the real life problems they've to deal with: this will solve all the problems, once and for all!. However the people who have worked with O/R mappers know that dropping SQL to replace it with another query language doesn't make things in all cases easier to implement: Linq is just another querying language and with all querying languages, it shares the same disadvantage: it's not a 1:1 equivalent of SQL. This is a disadvantage that will eventually become apparent for the novice O/R mapper users among us: you will still wonder from time to time how to write the same query in Linq as you would do in SQL.
LLBLGen Pro was the first O/R mapper on .NET which had a compile-time checked querying system. At the same time it forced the developer to use a different style of querying than one would use with SQL. With Linq, one might be tempted to think this will be different: it won't be hard anymore to rewrite a query originally written in SQL in C# using Linq. But that's naive and, if I may say so, a tiny bit wrong. Linq abstracts away so many things, that the learning curve to be able to write whatever query you'd want in Linq is still steep. The power offered by Linq to combine .NET code running on the client with code which will be converted to SQL into one query is great, but it also comes with a price: you've to be fully aware of what you're doing, as described in examples B and C. That's not a bad thing though. Writing software requires that you know what you're doing and with Linq and a good O/R mapper beneath it, a lot of power is right there at your fingertips.
Now that that lecture is out of the way, let's look at a couple of nasty buggers shown in the examples above.
Bugger 1: Constants in the projection
This is a problem: do you emit the constant in the query end-result or do you pass the constant along to the projection engine? The problem with the first approach is that if the value isn't convertable to a SQL construct, you're in trouble. So the best way is to pass it to the projection engine. The constant I'm talking about here is a constant similar to the 'true' boolean in example B and C. I first thought this would be a bit silly to add support for this, but after a while it did make sense, so adding support for this was necessary.
LLBLGen Pro has a fine-grained projection framework build-in, which allows you to define projections from sets onto datatables, entity classes or custom classes. It also allows you to define a projection from a view of an entity collection onto datatables, other entity classes or custom classes. This framework works with a set of source values, a set of destination elements and a set of projectors which define how to set each destination element to a value, as in: how to convert the source values into destination values or value (the number of source elements doesn't have to be the same as the number of destination elements). The input of a projection is the set of rows of data to project, e.g. the rows in a resultset of a query. Onto each row, the set of projectors are applied to produce the values for the set of destination elements, which for example can be a set of properties, or a set of constructor arguments or both.
I made a change to this framework where each projector to produce a destination element's value could have a delegate as well. This delegate would then be invoked with the row of data and it would produce a value. That value then would be the result of the projector to set its corresponding destination element with. A constant in the projection therefore is a simple delegate which produces... that constant!
Creating delegates in Linq is rather easy: you create a LambdaExpression and call Compile on it. If you predefine the signature of the delegate with a delegate definition, you can produce a delegate from the LambdaExpression by specifying the delegate type to the Expression.Lambda static method which creates the lambda and you just call Compile onto it to make it a true delegate of that delegate type. After that, you can invoke it with the signature of the delegate type. This same trick is used with in-memory method/property calls as well.
Bugger 2: In-memory methods/property calls in projection
To be able to execute the in-memory method/property during the projection, we've to create a delegate out of it. With constants, we've seen that we can produce a lambda expression from it, compile it and we'll get a delegate. There are a couple of issues though. One of them is that every method can have a different number of arguments, and if we want to have a delegate which executes fast, we have to avoid DynamicInvoke however this implies we have to solve the problem that we can't have a gazillion different delegate signatures to deal with. Another is that the elements of the query are still inside the lambda. These have to be replaced with references to parameters. Take our DaysInMonth example:
DaysInMonth = DateTime.DaysInMonth((int)o.OrderDate.Value.Year, (int)o.OrderDate.Value.Month)
Here we see two references to elements in the query: OrderDate.Year and OrderDate.Month (it's a nullable type, hence the 'Value' reference). Year and Month are properties mapped onto a DB construct, namely YEAR(date) and MONTH(date), but more on that later. We've to pull these elements out of the lambda, as otherwise they'd end up in the compiled delegate, and instead have them replaced with parameters. We could rewrite the method call as: DateTime.DaysInMonth(values[indexes[0]], values[indexes[1]]). Here, values are the set of values from the input and indexes is the array of indexes into values for the parameters. This is easily done with a special visitor-like handler which traverses the lambda subtree and rewrites elements, and at the same time collects the elements rewritten. This also offers a way to use a standard delegate signature with two input parameters: a values array and an index array. Because of this, all we need to do now is compute the proper indexes for each input parameter, store them with the projector and we're done, as the compiled lambda always has the same signature and therefore we can use the very fast delegate.Invoke(values, indexes) instead of using DynamicInvoke, no matter how many parameters the in-memory method has
.
These kind of delegate invokes are very fast: 10 million times calling Invoke takes less than 17ms. I find that acceptable, especially since 10 million times calling DynamicInvoke takes over 4 seconds! With this system, constants and in-memory method calls/property accesses in projections could be handled without problems, no matter how complex the calls would be (in-memory method call which takes in-memory method call's result's property access etc.
). There's a problem with object instantiation, but I'll cover that later in the post.
Mapping methods and properties to DB constructs
We already saw an example of this feature with Year and Month and also in Example A, with the String.Compare() method call. How does this work, and how flexible is it? From the start I wanted to have this system as flexible and as extensible as possible: developers have to be able to map their own db functions to .NET methods / properties and also should be able to overrule existing mappings because they think their construct works better (which is possible, see the string section below). The system build into Linq for LLBLGen Pro offers that: a mapping feature for mapping .NET methods and properties to database constructs. So not only plain functions, but whole snippets of SQL.
LLBLGen Pro's DbFunctionCall query element offers a way to call a function which accepts a constant value as a parameter (e.g. DATEADD() requires a parameter which is a constant, but is placed inside the (), so you can't map 'DATEADD' to some other name). It does this by accepting a SQL snippet with placeholders for the parameters where LLBLGen Pro's query engine will emit the parameter names for the input values of the function. This feature is used here as well.
By default, every dynamic query engine (DQE) for each database comes with a set of pre-defined mappings for a set of methods/properties of .NET classes onto database constructs. For example the String.Compare method call is mapped as follows in SqlServer:
// Compare(2)
_functionMappings.Add(new FunctionMapping(typeof(string), "Compare", 2,
"CASE WHEN {0} < {1} THEN -1 WHEN {0} = {1} THEN 0 ELSE 1 END"));
These function mappings are defined in a FunctionMappingStore object inside the DQE. A developer can also define his/her own function mappings in an own
FunctionMappingStore instance. Internally, the function mappings retrieved from the DQE used in the particular query, are obtained automatically by the Linq provider, so the developer only has to pass mappings defined in an own FunctionMappingStore instance. This FunctionMappingStore is placed in front of the default one, which means that if a mapping is found in the function mappings passed in by the developer, that mapping is used, otherwise the mapping found in the default ones. This offers the facility to overrule method/property mappings in-code and on a query by query basis.
The following example shows how one defines a mapping and how to pass it to the query engine. We're going to map a .NET method called 'CalculateOrderTotal' to a function defined in the Northwind database called fn_CalculateOrderTotal. This function takes two parameters, orderID and a boolean to use discount prices or not.
// defining the method in .NET. This method is empty as it isn't called in .NET code.
public class NorthwindFunctions
{
public static decimal CalculateOrderTotal(int orderId, bool useDiscounts)
{
return 0.0M; // never called
}
}
// defining the mapping in our own function mapping store, which is a class in our
// own project or in a referenced assembly, whatever you prefer.
public class NorthwindFunctionMappings: FunctionMappingStore
{
public NorthwindFunctionMappings()
{
// define the mapping. SQLServer 2000 needs the schema to be present for the
// function call, so we specify that as well.
this.Add(new FunctionMapping(typeof(NorthwindFunctions), "CalculateOrderTotal", 2,
"fn_CalculateOrderTotal({0}, {1})", "Northwind", "dbo"));
}
}
As you can see, you could opt for specifying a catalog and a schema as well. This is handy if the function you want to call is in another schema or catalog. LLBLGen Pro offers automatic catalog / schema name overwriting at runtime by setting a simple config file setting, so this hard-coded names aren't limiting flexibility.
Now that we have our mappings defined, we can use this method in our Linq queries:
using(DataAccessAdapter adapter = new DataAccessAdapter())
{
LinqMetaData metaData = new LinqMetaData(adapter, new NorthwindFunctionMappings());
var q = from o in metaData.Order
where o.CustomerId == "CHOPS"
select new
{
o.OrderId,
OrderTotal = NorthwindFunctions.CalculateOrderTotal(o.OrderId, true)
};
int count = 0;
decimal totalOfOrders = 0.0M;
foreach(var v in q)
{
count++;
totalOfOrders += v.OrderTotal;
}
Assert.AreEqual(8, count);
Assert.AreEqual(21045.0M, totalOfOrders);
}
The LinqMetaData class accepts an optional FunctionMappingStore instance which is then used inside the queries produced with that LinqMetaData instance. As LLBLGen Pro doesn't use xml files to store meta-data nor does it have to reflect over an assembly, instantiating a LinqMetaData class is very fast, it doesn't have any overhead to deal with.
In the query above, you'll see it looks almost the same as the ones in the first examples. I've left the DataAccessAdapter creation in the code snippet so you have a clear picture of the complete code needed to execute the query. The SQL produced by this query looks like:
SELECT [LPLA_1].[OrderID] AS [OrderId],
[Northwind].[dbo].fn_CalculateOrderTotal([LPLA_1].[OrderID], @LO11) AS [OrderTotal]
FROM [Northwind].[dbo].[Orders] [LPLA_1]
WHERE ((((([LPLA_1].[CustomerID] = @CustomerId2)))))
// parameter definitions:
Parameter: @LO11 : Boolean. Length: 0. Precision: 0. Scale: 0. Direction: Input. Value: True.
Parameter: @CustomerId2 : StringFixedLength. Length: 5. Precision: 0. Scale: 0. Direction: Input. Value: "CHOPS".
With this system, you can define wicked pieces of SQL and map a .NET method onto it and use it in your Linq queries very easily. As mentioned you can overrule the predefined mappings as well. The main issue is with strings. Databases have various string types in most cases: some being fixed length, others being variable length. In SqlServer you have a couple of functions to produce the length of a string: DATALENGTH and LEN. The difference is that with DATALENGTH you'll include trailing spaces, with LEN you don't. LEN is thus ideal for (N)CHAR typed fields, DATALENGTH for variable length fields. Our mappings are build with LEN, though if you require mappings based on DATALENGTH, you can easily override them in your own mappings.
Booleans in projections
In earlier episodes of this series I said we wouldn't support booleans in projections because SQL doesn't support booleans in projections either. However, with the function mapping system in place, it was very easy to implement this with a CASE construct onto which we mapped our own method. This method isn't used by the developer though we see a boolean expression in the projection as a method call to that method to produce the proper construct. As the rest of this function mapping pipeline has already been implemented, it automatically works once we pull the proper mapping out of the FunctionMappingStore instances. Take this example:
// Boolean expression in projection
LinqMetaData metaData = new LinqMetaData(adapter);
var q = from c in metaData.Customer
select new
{
c.CustomerId,
IsAmerican = (c.Country == "USA"),
};
The SQL query produced is:
SELECT [LPLA_1].[CustomerID] AS [CustomerId],
CASE
WHEN [LPLA_1].[Country] = @LO13651f01 THEN 1
ELSE 0
END AS [LPFA_2],
FROM [Northwind].[dbo].[Customers] [LPLA_1]
// parameter definition:
Parameter: @LO13651f01 : String. Length: 3. Precision: 0. Scale: 0. Direction: Input. Value: "USA".
Though this does cause a bit of a problem: the query returns '1', not 'true'. Though our lambda-to-delegate trick helps here. We simply produce a lambda which converts the query value to boolean by comparing it to 1 and we're done.
Lambda-based function usage
As the new versions of C# and VB.NET allow us to define simple delegates as Func instances using lambda expressions, why not use them inside a query? This results in a similar approach as with normal in-memory calls: compile the lambda to a delegate with a known signature and use that during projections. Here's an example:
// in-memory lambda function usage in projection.
Func<string, string> stringChopper = s => s.Substring(0, 3);
LinqMetaData metaData = new LinqMetaData(adapter);
var q = (from o in metaData.Order
select new { FirstChars = stringChopper(o.Customer.CompanyName),
o.Customer.CompanyName }).Distinct();
In this example, we define a simple lambda which performs a substring call on the input and we use it in our projection. To illustrate that this function indeed does run in-memory, the query produced looks like:
SELECT DISTINCT [LPLA_2].[CompanyName] AS [FirstChars0], [LPLA_2].[CompanyName]
FROM [Northwind].[dbo].[Customers] [LPLA_2]
Object creation inside projections
In the previous examples, you've seen that the projection of the query produces new, anonymous typed objects, one per result row. This is done by the 'new' operator, which results during the expression tree processing in an object representing a projection onto an anonymous type. However this approach broke down on the following code I'll show you in a bit. It's about creating an object in the projection with 'new' but the result of a method call or property access on this method is the result of the projection of each row.
// class used in example
public class Concatenator
{
private string _val1, _val2;
public Concatenator() {}
public Concatenator(string val1, string val2)
{
_val1 = val1;
_val2 = val2;
}
public string ConcatMembers()
{
return _val1 + _val2;
}
public static string Concat(string val1, string val2)
{
return new Concatenator(val1, val2).ConcatMembers();
}
public string ConcatValues(string val1, string val2)
{
_val1 = val1;
_val2 = val2;
return ConcatMembers();
}
public string Val1
{
get { return _val1; }
set { _val1 = value; }
}
public string Val2
{
get { return _val2; }
set { _val2 = value; }
}
}
I know the class doesn't make any sense in real life, but it's a great way to illustrate the point: consider the following query:
LinqMetaData metaData = new LinqMetaData(adapter);
var q = from c in metaData.Customer
where c.Country != null
select new Concatenator(c.Country, c.City).ConcatMembers();
This query doesn't produce a list of Concatenators, it produces a list of strings, namely the result of the call to ConcatMembers on every row. This means that the 'new' expression seen in the tree has to be seen as an in-memory call, and therefore has to be compiled into a lambda. This took me some time to get right but it works now. I still have a little bug to fix with a property access instead of a method call, but that should be pretty straight forward.
These kind of constructs blurr more and more the line between what's ran inside the database and what's ran in-memory. A developer should be careful with these powerful features but once understood, it opens up a lot of power. To illustrate what can be done with this, I'd like to show you a combination of our Concatenator fellow shown above with a boolean subquery which result is converted to string. The subquery is fetched twice for the unittest to fulfill its duty.
LinqMetaData metaData = new LinqMetaData(adapter);
var q = from c in metaData.Customer
where c.Country != null
select new
{
c.City,
Contains3 = (from o in c.Orders select o.EmployeeId).Contains(3),
ConcatenatedResult = Concatenator.Concat(
(from o in c.Orders select o.EmployeeId).Contains(3).ToString(), c.City)
};
This is particular nasty, because the boolean value returned from Contains normally would have been converted from the int returned from the CASE statement in the SQL query to a real bool value with a delegate. However, the ToString method following it is a method which has a mapping in the database. This means that unless there's something done under the hood, we'll get '1' instead of 'True' in our results as it's not possible to fit the in-memory delegate call between the CASE statement and the database construct onto which ToString is mapped. The solution appears to be simple: 'ToString', as any other method or property, can have a mapping per-type, so we simply map ToString() for a boolean onto "CASE {0}=1 THEN 'True' ELSE 'False' END" and it works automatically, as the system finds a mapping of the method call, in this case ToString, for the type, in this case System.Boolean, and uses that mapping.
Example C revisited
Let's look again at our example C, which had the DaysInMonth function in the where clause and therefore couldn't work because it has no mapping to a database construct. But what if we would map it to (*drumroll*):
"DATEDIFF(d, CONVERT(datetime, '' + CONVERT(NVARCHAR(4), {0}) + '-' + CONVERT(NVARCHAR(4), {1}) + '-01'), DATEADD(m, 1, CONVERT(datetime, '' + CONVERT(NVARCHAR(4), {0}) + '-' + CONVERT(NVARCHAR(4), {1}) + '-01')))" ? That will result in the proper # of days in the month specified by the two parameters, which will be placed on {0} and {1} resp. Let's see if that works:
// we add the mapping to our NorthwindFunctionMappings class ctor
// DaysInMonth(2)
this.Add(new FunctionMapping(typeof(DateTime), "DaysInMonth", 2, "DATEDIFF(d, CONVERT(datetime, '' + CONVERT(NVARCHAR(4), {0}) + '-' + CONVERT(NVARCHAR(4), {1}) + '-01'), DATEADD(m, 1, CONVERT(datetime, '' + CONVERT(NVARCHAR(4), {0}) + '-' + CONVERT(NVARCHAR(4), {1}) + '-01')))"));
// after that we can create a test to see if this indeed works!
[Test]
public void CustomMappedVersionOfDaysInMonthUsageInWhereClauseTest()
{
using(DataAccessAdapter adapter = new DataAccessAdapter())
{
LinqMetaData metaData = new LinqMetaData(adapter, new NorthwindFunctionMappings());
var q = from o in metaData.Order
where DateTime.DaysInMonth(o.OrderDate.Value.Year, o.OrderDate.Value.Month) == 30
select o;
int count = 0;
foreach(var v in q)
{
count++;
Assert.IsTrue(v.OrderDate.HasValue &&
(DateTime.DaysInMonth(v.OrderDate.Value.Year,
v.OrderDate.Value.Month) == 30));
}
Assert.AreEqual(252, count);
}
}
Which succeeds. As you can see, you can add your own mappings to overcome roadblocks on your path to success.
Linq to Sql odd-ness
It keeps surprising me how different the quality is of some parts of Linq to Sql. As if multiple teams have worked on those parts and one team simply dropped the ball in a lot of its tasks. While Linq to Sql is often capable in producing a proper, optimized query even if the query is very very complex and awkward, it sometimes drops the ball on simple things. Take for example left-shift and right-shift operator support. Back in the days when I was hammering out MC68000 assembler on my Amiga 500, we demosceners knew that to multiply with a power of 2, it was faster to add the values together than to multiple by 2 or to shift 1 bit. Shifting became faster if you had to multiply with a value larger than 4. Shifting is equal to multiplying with values of a power of 2 and vice versa. So shifting a value v n bits to the left can be mapped onto (v * POWER(2, n)). Shifting to the right is equal to: (v / POWER(2, n)). Linq to Sql can't deal with left/right shift operators in code. As a programmer I find that a little odd as bit operations and bit logic is a part of every first year of every CS course. It's odd as well as they otherwise have mapped a lot of operators and methods onto SQL constructs.
One could say that bitshifts aren't that important, but sometimes values are packed in bitfields in the database. Shifting is important in that case.
Another one is the lack of support for tuple matching with Contains and Exist. Granted, this is a bit fragile because it can go wrong with properties not being there, however in the cases where these things are properly set up, tuple matching is very handy. Consider the following query:
LinqMetaData metaData = new LinqMetaData(adapter);
var q = (from c in metaData.Customer
select new Pair<string, string>() { Value1 = c.Country, Value2 = c.City })
.Contains(new Pair<string, string>("USA", "New York"));
Pair is a simple generic bucket class which has two properties. The thing is: this query is actually not a clean-room testcase, but a real-life usable query. Admitted, you could also write this as a fetch and then execute a Count() over it, but you can do that with all Contains calls.
Perhaps they'll add it in the next version.
What's next?
After the bug with the property access on an in-memory object created in the projection has been fixed, the next thing to do will be defining the mappings for the functions for other databases than SqlServer. After that I want to start with the last part of the provider: hierarchical fetches. LLBLGen Pro has a powerful hierarchical fetch plan system called prefetch paths and the design to include this into Linq queries has been done already, so writing the code should be 'fairly' straight forward (as the core of the code is already done inside the LLBLGen Pro framework). The hierarchical fetches which are less simple are the ones which are nested in projections. Linq to Sql delegates these queries to be executed when the property is accessed, which results in a lot of queries and slow execution, and with our hierarchical fetch code already in place (so all the merging logic etc. is already written) it should be doable to connect the hierarchical query with this code to do hierarchical fetches of nested queries in a projection with one query per node per tree branch. See this example:
var q = from c in metaData.Customer
select new
{
Name = c.ContactName,
Orders = from o in metaData.Order
where o.CustomerId == c.CustomerId
select o
};
This query results in (1 + #of customers) amount of queries in Linq to Sql, though it's doable with just 2 queries. The trick is that there's a correlation relation inside the nested query, which can be used to tie the two queries together to drive the merge logic. But that's for the next episode of this series, which hopefully will be the last one. Stay tuned!
From the comments in 'Portraint of a n00b' by Yegge:
I program in English. After reviewing the English, I comment in Java after each sentence to let the computer know how to do it. In essence, I program in dual languages. IMHO the approach is what's important.
and:
Static typing is like saying there is this box and you can only put this kind of thing in it. The box is labeled.
Dynamic typing is like saying we have these generic transparent boxes that you have to look into to find out what is in there. or just remember where you put everything.
(Updated Wednesday 30-jan-2008). It was mentioned that we would implement 'Skip' as well, although we already had a paging method added, TakePage(). After carefull analysis, we decided not to implement Skip for now. The reason is that it can lead to confusing queries, while paging is what the developer wants. We believe TakePage() serves the developer better than a combination of Skip / Take (Take is still supported separately) which won't work in a lot of cases if Skip is specified alone.
(This is part of an on-going series of articles, started here)
Cast again
The last episode in this series contained a remark about Queryable.Cast, and that it can be ignored. I've to correct myself there, this isn't entirely correct. Let's look at the example query at hand:
// LLBLGen Pro Linq example
var q = from e in metaData.Employee.Where(e => e is BoardMemberEntity).Cast<BoardMemberEntity>()
select e.CompanyCarId;
Here, the Cast is actually irrelevant because the Where already filters on the BoardMemberEntity type. However in the following small query, it's not:
var q = from e in metaData.Employee.Cast<BoardMemberEntity>()
select e.CompanyCarId;
Here, the Cast is actually a type filter. Well... sort of. The thing is: there's no real 'good' way to handle this, similar to 'as', which is discussed below. Imagine the situation where the employee instance is of type ManagerEntity (supertype of BoardMemberEntity), and not of type BoardMemberEntity. What should the query above do in that case? The instance 'e' in that case doesn't have a CompanyCarId field, as that's a field only available in the BoardMemberEntity type. The specification of Queryable.Cast doesn't reveal any help here: it doesn't say what should be done when the Cast can't be performed. So, I decided it emits a type filter instead. This means that the query above always succeeds if there are BoardMemberEntity instances in the database, it simply filters out any types which aren't of the specified type, in this case BoardMemberEntity.
OfType
Queryable.OfType is a bit of a weird method, in that it actually does more or less what Cast does in that it filters out any elements which aren't of the type specified. Of course, 'Cast' by definition isn't a filter, but in database queries, you have little choice here: SQL isn't imperative, at least not in the way SELECT queries work. OfType therefore is implemented similar to Cast: it emits a type filter into the query.
The 'as' keyword
I already had implemented support for 'is', which resulted also in a typefilter (if you now have the feeling there are a couple of redundant ways to specify the same thing in Linq, you're correct), but 'as' is a little different. It again is more or less a keyword which is actually not really usable in database queries. Let's look at a query which illustrates this:
var q = from e in metaData.Employee
where (e as BoardMemberEntity).CompanyCarId==1
select e;
This in itself looks pretty straight forward, but look closely to what it represents: what if 'e' is a ClerkEntity (not a supertype of BoardMemberEntity) ? The 'as' keyword should then result in null/Nothing, so accessing the property CompanyCarId on it should result in an exception, at least in .NET code it does. But how are you going to convert this into SQL? One could argue, and I agree with that, that the query above is pretty poor code and constructions like that should be discouraged. Though the thing is that just because it's possible, it will cause some people to write code just like that.
Fortunately, the code above is translatable to SQL in general form, if the 'as' keyword is seen as a type conversion, so 'e' is simply seen as a BoardMemberEntity and the fact that a 'null' can occur is ignored as the filter on CompanyCarId will weed out any types without that field in the query anyway (no matter what inheritance mapping strategy is used). The only difference is that if you expect an error, you won't get one. Let's look at a more nastier version of the one above which actually is more correct:
var q = from e in metaData.Employee
let x = e as BoardMemberEntity
where (x!=null) && (x.CompanyCarId == 1)
select x;
Here, across two statements, a type filter is written: first the 'as' conversion in the let statement, and then the test for null on the result. You can generally write this as a handling of type specification equal/not equal null and by adding a special handler for that in the general binary expression handler, you can convert that situation to a type filter!
There's something not really great about the query above though. Linq to Sql has no problems, because it only supports single-table inheritance. If the O/R mapper supports multi-table/view inheritance, like LLBLGen Pro does, you have a bit of a problem with field names. Consider a hierarchy where the root has three branches of subtypes, and in two branches a field Foo is present, though not in the third branch. If you now fetch all entities of the root type, you will end up with multiple times the same field in the projection. The fix for that is field aliasing, namely you give every field an artificial alias, e.g. Fx_n, where x is some index to specify the entity and n is the index of the field in the entity.
The problem begins when 'let' entered the room: a 'let' is effectively a wrapping SELECT statement. (You can rewrite some queries by moving the query to other parts of the query but that's not always possible). The query wraps the original query in its FROM clause as a Derived Table and simply selects from that source, making the projection required for let. Now, imagine you have your fields aliased with artificial aliases inside that source. Your Derived Table fields now have names like F0_1, F3_4 etc., but not 'CompanyCarId'. This means that the filter on CompanyCarId, which is outside the Derived Table, as let wrapped what's before 'let' in the query, will fail if it targets 'CompanyCarId'. Normally it would work, as the JOIN statements which joined every subtype's Table/View would be accessable for the WHERE clause, but in this case that's not possible because the joins are inside the Derived Table.
As elements are processed in different areas of the Linq provider (you're not going to write a big routine which handles every situation), the handler for Where has no clue if the field should be re-aliased, at least not at that point. So I wrote a traverser, a class which crawls over all LLBLGen Pro query api objects and finds the elements sought. Basicly it's a base class which simply visits all object elements and if you in a derived class override a given method you can tap into this process and do what you need to do, e.g. collect Derived Table definitions, fields targeting derived tables etc. I wrote a couple of derived classes which collected information for me this way and with that I could easily fix every field alias and target in the entire query, so CompanyCarId references in filters outside the Derived Table would then be fixed to target the Fx_n field which represented CompanyCarId. The crawlers are also handy to find places where inheritance relations have to be injected into the query, for example because the query is folded into a subquery inside a Derived Table. The classes will be public in the upcoming runtime library of LLBLGen Pro v2.6 so our customers can use them as well for their own fine-grained query voodoo magic
.
Except and Intersect
Except and Intersect are two methods which are actually each other complement: you can implement both with a single construct and just a flag to add 'NOT' to the query fragment. Except and Intersect can be implemented as an EXISTS query construct, similar to the work done for Contains: a lot of code can be re-used for these two methods.
Except and Intersect can have an IEnumerable as argument. Although Linq to Sql doesn't support it (I don't know why) it's perfectly doable, and in Linq to LLBLGen Pro you can pass an in-memory set of elements to Except or Intersect. What's particularly weird with Linq to Sql not supporting it is that the code to support in-memory sets as argument for these two methods is also usable for Contains. Oh well...
.
There's some pitfall to be noticed with these two methods however. Consider this query:
// won't work
var q = from c in metaData.Customers
where someCollection.Except(c.Orders).First().EmployeeId==3
select c;
This query might look like a bit far-fetched, but the general idea is this: it's not possible to use an in-memory collection in a database-targeting query where Except is called on that in-memory collection with a database set as argument. Do you see why? It took me 3 days to realize this, so don't worry if you don't see it right away. The thing is: Except filters the set it is called on. But that set isn't in the database, you've to pass it to the database in the query. With PK fields, that might be possible, but not with complete entities, that's undoable. With Intersect it could be done though, however I haven't implemented that for now, as it's easy to work around it (swap the operands of Intersect
).
Single
Now here's some method I have no idea why it is in the API: Queryable.Single. Let me first quote the specification of the method:
Returns the only element of a sequence, and throws an exception if there is not exactly one element in the sequence.
There's an overload which accepts a predicate and which simply means that Single(predicate) will return the only element in the sequence which matches this predicate. What I find odd is the remark about throwing an exception: why would anyone call such a method? There is a rule about exceptions: "Don't ever use exceptions for control flow in your application". Exceptions aren't expressions for if/else constructs, they're serious business: they mean something was definitely wrong and needs handling to avoid a total crash. In database scenario's it's even weirder: which exception should you expect? And more importantly: what does it mean? If you don't know what the true meaning of an exception is, you'll never be able to handle it.
The example in the MSDN to illustrate Queryable.Single() is pretty bad actually, because it uses exceptions as control flow. Not only that, the example fails to illustrate a valid case where the exception would be something you would want. This is important, because... I can't think of a use case for Queryable.Single() where you would want the exception. After all, exceptions aren't meant for control flow, so I don't expect an exception to drive my code as if an if-statement resolved to false: it's not meant to be a test if a set contains more than one element.
The thing with Single is that it's redundant, at least for database queries. You can also use First(). First() also returns a single element, but it doesn't throw an exception when there are more than one element in the sequence. With database queries, using Single() has the same behavior as First(). Sure, I could add code which flags the resulting QueryExpression object that the result should be checked if it has more than one element and if so, an exception should be thrown, as Linq to Sql does with this query:
// Linq to Sql code
var q = (from c in nw.Customers select c).Single();
However, do I get the same exception that there are more than one element in the sequence with this one:
// Linq to Sql code
var q = from c in nw.Customers
where c.Orders.Single().EmployeeId==3
select c;
No. With this query I get a hardcore severity level 16 error from SqlServer that the query is wrong because the subquery returns more than one element which can't be used with the operator (=) specified. This isn't the fault of Linq to Sql, what else can it do? Call RAISERROR (someone at Sybase still feels ashamed about that typo I bet
) with a Count check? Why? Would that help the caller of the query? I'm not convinced it necessary will help. The 'Single' method is simply not useful in database queries: for the Single() overload, use First(), for the Single(predicate) overload use Where(predicate).First(). Though, as the requirement of the method is that it should throw an (not specified which one) exception, these usable synonym statements aren't completely covering what Single() represents. Though in my opinion, the exception requirement is a big mistake: if you need behavior to be called when a set has more than one element, you should test on that specification and call the behavior if the test succeeds, not by issuing a query which in the end fails with whatever error takes place, and count on that to handle things further.
I'll add support for it, though under protest. The reason I'll add support for it is to be compatible with queries which target other O/R mapper frameworks.
Linq to Sql issue(s)
During development of the Linq to LLBLGen Pro provider I often check what Linq to Sql does with a given query I use for testing and to see if for example my SQL is more efficient or falls flat on its face. Sometimes you run into unexpected things. When I tried an Except or Intersect query, I saw that 'DISTINCT' was emitted into the query. Everyone who knows SQL knows that DISTINCT is a keyword you have to be careful with, not all databases support every type of field with DISTINCT. In SqlServer for example, (n)text and image fields aren't supported in a DISTINCT projection. Not sure if this was a small glitch or that there was logic which would prevent DISTINCT in queries where it's not allowed I tried:
// Linq to Sql, gives crash with NotSupportedException.
var q = (from e in nw.Employees select e).Distinct();
It too emitted DISTINCT into the query, which was caught by its validation checker. However what's worse: Except and Intersect therefore also aren't usable with Linq to Sql and any image/(n)text containing entity type: DISTINCT is always emitted into the query. This query for example gives the same exception, though no DISTINCT was specified in the Linq query:
var q = from e in nw.Employees.Except(
from e in nw.Employees where e.Country == "USA" select e)
select e;
One could argue: "But if DISTINCT isn't possible, how to weed out the duplicates?". Well, you do that on the client in the routine which consumes the datareader and constructs the objects to return. You keep hashtables with hashes calculated from identifying attributes like PK fields and with that you filter out duplicates. At least with entity fetches like this one. Not supporting situations where DISTINCT can't be emitted into a SQL query is a typical error one could make in a v1 O/R mapper, it's only a bit sad for Linq to Sql users that Microsoft is so generous with releasing fixes for their framework on a regular basis.
Writing a Linq provider is a lot about true software engineering, I've written about that several times before: some things are known but the biggest part is unknown territory: what is constructed in which order, when to expect what, is it safe to ignore this or that? Unclear things one can only find out when it is used, i.e. by trial/error approaches. This is actually a bit sad, because it's now easy to overlook mistakes or miss corner cases, as the bigger picture isn't always clear. For example: what does VB.NET emit into the expression tree for string concatenations if C# emits 'Add' operations ?
With the DISTINCT keyword popping up in the Linq to Sql query for Except I was immediately alarmed: why is it there? Linq to Sql never emits DISTINCT when it's not told to do so. You then start thinking about it: is it something which is a left-over from their tree manipulations? Or is it hardcoded set to be emitted? It turns out it is. In QueryConverter.VisitExcept and QueryConverter.VisitIntersect it sets 'IsDistinct' to true. I couldn't think of a reason to have it default to DISTINCT, so I didn't add a requirement for that to our tree for Except and Intersect, also because it uses an EXISTS query, which doesn't care if DISTINCT is there or not.
Both routines are also a clone of eachother. Clones are easily created and often overlooked, however this one is particularly obvious, especially because the behavior of the two routines is closely related, so the implementation of the handlers is then too closely related.
What's next?
Implementing 'Single', probably tomorrow, then on to implementing the database function framework I have in mind: it has to be a framework where developers can add their own function mappings to the provider so they can map their own extension methods to database functions easily. When that's done, Queryable.Convert gets another look as some scenario's require it not to be stripped off but handled instead, though that relies on the function-mapping framework, so it has to wait till then. After that, hierarchical projections and prefetch path support are on the menu (prefetch paths are more a small addition as the core functionality is already in the runtime for quite some time, hierarchical projections require the prefetch path merge code already in the framework to be opened up to the Linq provider) and then I'm done with the Linq part. Finally. Stay tuned.
First of all: welcome. Now, as you all might know, this blog site, http://weblogs.asp.net, has a grouped RSS feed (a couple actually), which is called the 'main feed'. If you place your post in a category which is in the default list of this site, your post will automatically end up on the main feed. This is a nice feature, but as it is used now it kills the site.
At the moment, the feed is flooded with completely useless posts: posts which link to articles from 2 years ago, copied texts from manuals, very poor code copied from other websites etc. etc. More and more people are giving up on this main feed because of this. While what you post on your blog is your business, you have to realize that by placing the post on the main feed, you also affect others on this site. For example, I don't want people to unsubscribe from the main feed and I bet a lot of others here think the same.
So, dear new blogger at weblogs.asp.net, do yourself and us a big favor: write new content. We all know about a couple of search engines, and we all know how to lookup things in MSDN. We don't need random posts without any red line to point us to these articles. Instead, what we don't know (yet) is what your opinion is about topics related to .NET programming, what your experiences are with .NET programming, what your ideas are about how things could get better. Those articles are interesting to read and will attract new readers and existing subscribers to come back.
History learns that this flood of useless posts is temporary, but what's sad is that often people who unsubscribe to a feed don't come back.
More Posts
Next page »