Frans Bouma's blog

Generator.CreateCoolTool();

Syndication

News




    Add to Technorati Favorites

About me

Fun stuff I created

My work

October 2007 - Posts

Developing Linq to LLBLGen Pro, part 8

(This is part of an on-going series of articles, started here)

Today I managed to arrive back at the point I stopped with my current code base a couple of weeks ago to re-implement the expression tree reduction code. I'm not totally done with re-connecting the wires of the outer interface code to the inner execution engine, but that's a minor detail. A lot of work is still ahead of me, but I've more confidence in finishing it with a satisfying framework than I have had before during this project so far.

To illustrate what expression tree reduction looks like, I'll show you some screenshots from my expression tree visualizer, which is a reworked and enhanced form based on the expression tree visualizer in the C# examples shipped with VS.NET 2008. I've added to this visualizer a pseudo-SQL engine from our own debugger visualizers shipped with LLBLGen Pro so it's easier to see what the relation objects and predicate objects result in. (I use a custom Windows XP theme, so the screen might look a little unorthodox, so no, it's not linux )

The query which is shown as expression trees is the following one, which is pretty bogus but is solely used to check roughly if the code I had in mind even works out the way I thought it would. As you can see from the screenshots, it ends up in a nice QueryExpression, which is the last Expression based class a tree will end up in in my code so I can hand over this object with all its internal data to the execution engine which will simply grab the data inside it and pass it on to the LLBLGen Pro runtime library.

// C#
string city = "Munchen";
var q = from c in metaData.Customer
	join o in metaData.Order on c.CustomerId equals o.CustomerId
	join od in metaData.OrderDetail on o.OrderId equals od.OrderId
	where c.City == city && c.Country == "Germany"
	&& o.EmployeeId == 2 && (od.Quantity * od.UnitPrice) > 500
	select (o.EmployeeId > 10);

My current code can recognize predicates, field expressions, joins (not group joins yet), can evaluate local variables into values used into predicates, projections of single fields and entity projections. This is all embedded into the query above. The screenshots are below. I've made thumbnails of the images so the post stays rather small in size. Click a thumbnail to enlarge the image.

State 0, start state
State 1, Aliases have been assigned to sources, joins have been given their own expressions
State 2, Entities and entity fields have been recognized, as well as constants referring to variables
State 3, LLBLGen Pro field expressions have been recognized
State 4, Predicates and predicate expressions have been recognized
State 5, Joins and selects have been evaluated and combined
State 6, The final tree has been reduced by combining select expressions

There will follow more states and likely this won't be the final order of events. However you can see clearly how the tree gets reduced by recognizing subtrees and replacing these subtrees with a single expression object or other subtree which is easier to handle later on.

Also, for kicks, please pay attention to the expression in text form in the start state's screenshot. 10 to 1 you won't recognize the original query in that.

Posted Tuesday, October 30, 2007 6:58 PM by FransBouma | 7 comment(s)

Developing Linq to LLBLGen Pro, part 7

(This is part of an on-going series of articles, started here)

Last time I talked about the switch to the approach where most Queryable extension methods should be seen as sets on their own. What happened after that? Well, initially, I continued on the path I had taken a few weeks ago: a stack based term evaluator which was used by routines which handled the bigger terms. I created a couple of classes which were usable as a set-tree, but a single-path tree: a Where would modify the top of the tree, a Select would wrap the top of the tree with a new Set, using the original top of the tree as its source.

This approach went fairly well, I could write several unit tests which could fetch entity data, with or without filters on fields in these entities and create single-field projections, using LLBLGen Pro's own projection framework, though it all stayed a little simple. Doing the simple things first isn't a bad approach per se, after all, the complete scope of expression trees and what to convert into what is pretty complex and to grasp it in full takes a lot of time. Software Engineering, as its name implies, is a form of engineering so not all variables are known: you have to do experiments, work with the elements you do know and work from there towards a goal, solving the problems along the way. Have to create a bridge across a big stream but you don't know the exact strength of iron? Do experiments to find out. In Software Engineering things might look totally different but actually they're not.

With implementing a full Linq provider it's the same thing: there are many, many fine details inside the tree which can either be ignored or be very important to you, depending on the goal you want to reach, thus which provider you want to create. This means that if you write a provider which is actually a translator between two query languages, like I'm doing, you have a different set of problems than when you write a provider which has to generate SQL.

I hit a wall, when I wanted to add Joins to the code I had. The problem was that I tried to convert way too many elements at once into new objects. I also made the mistake that the single path tree wasn't really helpful because the Join expressions could contain select, where etc. so in a way it was a multi-branched tree. However, if you try to combine a lot of elements at once, this problem doesn't exist, after all handling the Join would mean handling the inner Select and Where, if they were present. However, to do that in practise, you have to do so many things at once, it's too complex.

During my engineering efforts on this code, I learned a lot about the Expression trees and how not to approach them, as you can read in these articles. However, I think these mistakes are unavoidable simply because it's unknown territory: what do handle first? How many elements in the Expression tree should you convert into new elements in one pass? Questions which are only answerable if you have seen the different approaches fail and succeed. This is in general true for many software engineering tasks. Often we're faced with a problem which is far too complex to understand in full from the beginning, we have to take smaller steps and solve the puzzle by solving smaller puzzles instead. However, the big question: which smaller steps to take and what's the best size per step is often a question which looks easy but is actually the hardest and also the most important part of the whole project.

To be able to handle the multi-branched tree and also to avoid converting way too much elements at once, I decided to modify the Expression tree in-place, exactly how Matt Warren explained in his series about writing a Linq provider. Matt uses a single pass routine (2 passes actually), and in the end it might be that a single pass or 2 passes is all that it takes, but when you're writing the thing and taking one step at a time, it's best to create a lot more passes, so a lot more handlers, which all change one thing in the tree, so instead of 1 big step in 1 pass, do a lot of smaller steps in a lot more passes.

Finding the right order
The key point is to find the right order. In general this isn't that hard, but if you have the wrong order, or if you're trying to handle too many nodes at once, you probably will run into problems later on, for example if you have already recognized entity field member access points, and converted them into field expressions, but you didn't assign any aliases yet to the tree sources. It's then hard(er) to find back to which alias the containing entity of the field accessed belongs to.

Rule of thumb is that in pass n+1, all nodes required for handling the nodes in that pass have to be there. For example, if you are going to replace BinaryExpression nodes containing a boolean operator (equal, lesser than etc.) with predicate expressions, it's best if you have already found entity field access nodes, recognized them and converted them into new expression nodes. This way, you can easier form predicates when handling BinaryExpression nodes because you can handle the left and right side and check what the handle action of these two sides resulted in, e.g. a predicate expression, a field expression etc. Though it's not tic-tac-toe: an expression like 'field < 10' can also be inside a projection or field expression inside a predicate. So it's best to do this in 2 passes: first the field expressions, then the predicates. This is an example how taking two passes (which are combinable perhaps later on) to solve the problem is not a hard puzzle: however to solve it directly in 1 pass might be a big challenge.

At the moment, my general order is to first traverse the tree and convert any Join method call expressions into JoinExpression objects, and assign aliases to the left and right side, plus to build some lookup tables in an Alias storage with all kinds of key entry points, so the rest of the tree handling can always find back the alias source to refer to based on the info known at that point. LLBLGen Pro only needs aliases if there are multiple times the same entity joined into a join set, so only when a join is present (or access to members which represent related sets, but more on that in a later article), aliases are necessary. Ironically, the last pass to process is the pass which actually handles JoinExpressions and converts them into relation objects for our API (together with the flattening of SelectExpression nodes, which are nodes which result in a set, with or without a projection, e.g. Where method calls, Select method calls etc.)

The reason why JoinExpressions are handled last is that by handling them the whole query is actually handled but doing it earlier, e.g. placing the handling of JoinExpressions earlier in the pass-stack, will run into a problem where the join is between an entity set and for example the result of a full other query, with a projection, where and for example also a join. It can be the JoinExpression handler isn't the last pass in the final result, however for the queries currently understood by the handler system, it is. Because everything is chopped up into smaller handlers, it's not a big deal to place a handler somewhere else in the stack, as long as the dependencies allow it.

To cut the process up into tiny fragments, you at first do a bit more work: you'll write a lot of handler classes which sometimes don't process much data. However, that's ok, you can always combine passes later on. To have the smaller, single item focussed passes, you can schedule them easily and if you start handling more complex queries, you can add more passes between other passes without having to worry you have to refactor a lot.

The last good point is also the best: it's provable code. If you have ran pass n, you know for sure the tree consumed in pass n+1 won't run into nodes handled in pass n and will see 0 or more nodes produced in pass n. This does include dependencies, but I think that's unavoidable. It's like a state machine: at state S, you run pass S, and that results in state S+1, where you run pass S+1. The state machine is pretty linear, as there aren't any events occuring inside the pass routines which could make the state machine go into another branch, but theoretically it could. For example, you could have a handler which simply checks the tree and if it finds a given situation, it might decide it's OK to run pass P first, and then pass S+1. Due to the nature of the small steps taken, it's flexible in doing that.

There are a couple of things I don't know how to handle them yet (I haven't looked at all extension methods yet like GroupJoin etc.), which are the Convert and the Quote expressions. For now I strip them out when necessary (and if it's even possible, sometimes you need to leave them in the tree to keep type equality), though I couldn't find a lot of information what these fellows do (Convert seem to be there to overcome the type problems in a BinaryExpression between a nullable type and a value typed constant) or when they've to stay. But time will tell, I guess.

Posted Sunday, October 28, 2007 1:46 PM by FransBouma | with no comments

Developing Linq to LLBLGen Pro, part 6

(This is part of an on-going series of articles, started here)

I switched to 'part' posts instead of 'day' posts, as I realized the initial plan (post every day) isn't that useful in this case ("Today I stared at 20 lines of code for 3 hours before I realized the ideas I had yesterday didn't work as I thought they would, more on this tomorrow") as the amount of work done per day is still rather limited. The main reason is that if you're in the dark somewhat, no matter what you can pull off the internet in the form of information in this case, you will make a lot of mistakes, wrong assumptions and faulty algorithms. More than often I thought for hours straight about how to solve a given problem, then found a solution, documented it, wrote some code, had to quit because the day was over and the next day I realized I had it all backwards and had to start over. This isn't that bad, no-one gets it right the first time, though if it drags on it can become frustrating, but that's with everything where a clue how to solve it is so close you can almost smell it but yet so far away you can't see it.

So I was on the wrong track. Well, not entirely, I was actually more busy trying to convince myself I could avoid the track I had to go by thinking I could outsmart it with more clever code and algorithms. In these cases, it's better to simply take the famous step back and realize that despite the fact that something might look nasty and should be avoided, it doesn't mean the road you're desperately trying to find is less awful.

In LLBLGen Pro I once made such a mistake, I thought I wouldn't make that mistake again, but apparently I was wrong. What did I do? Well, I tried to outsmart the algorithm of Topological Sorting in a Directed Acyclic Graph (DAG). Yes, I was that stupid. Let me try to talk my way out of this.

In an O/R mapper, there are many things you have to do, besides fetching entities. One of these things is saving piles of work done on entity objects in memory. (To read more about the difference between entity classes, entity instances, entity class instances and friends my essay about entity models). This is typically done by saving a set of graphs of entities. Say for example you fetched an existing set of Customer entities, added a few new Order entities to some of these Customers and to these new Order entities you added new Order Item entities as well. You also added some new Order Item entities to existing Order entities and you altered some Customer entities as well. So all in all a graph of objects (customers, containing orders, containing order items) which has to be processed, in a given order (but which one?) and with different operations (insert, update).

The question for the O/R mapper then becomes: in which order do I have to save all those entities? This is a typical DAG related problem: the graph is a DAG (cycles in relational models which are reflected in entity models is a Bad Thing and typically have to be redesigned) and by using Topological Sorting of the DAG, you get the order of the work handed to you on a silver platter. With Compliments even. Algorithms like Topological Sorting are an example why Graph Theory is so incredibly important in Computer Science and thus for your daily software engineering work as well.

I had one extra problem to solve: LLBLGen Pro doesn't use a central meta-data store. So I couldn't use the DAG constructed from the mapping information to sort the object graph, I had to sort the object graph by detecting the mapping graph on the fly. In general, this isn't a problem, though if you don't realize the problem you're dealing with is more complex than you might have thought it was, it will make you opt for easy solutions which will fail in corner cases (which have the Murphy-gene: they always pop up at most worst possible moment). My algorithm didn't mimic a graph sorter, as that appeared to me as being too slow, as I didn't want to visit every graph node first, and then again every node which had to be handled for save processing, that would mean visiting a node more than once! A small voice inside me said: "If you can come up with an algorithm which can visit a graph node just once, it's faster!". So I wrote an algorithm which did these two combined: it visited every node in the graph and on the fly could decide what to do.

It worked at first, then the corner cases appeared on the horizon, so I adjusted the algorithm to cover these as well, until I realized I wasn't doing what I should be doing: analyze the algorithm needed and look up how that is covered in what Computer Science has produced in the last couple of decades. So I analyzed the algorithm necessary, picked my "Algorithms in C" by Sedgewick from the shelve (old book, great resource) and peeked into it, found the algorithm in a couple of minutes, and I spend an afternoon implementing it and it never had a single issue since, simply because it's a proven algorithm, it should work regardless what you throw at it, corner case or not.

"I know it's friday afternoon, buddy, but could you hurry up a little and get to the point you're trying to make here, Bouma?". Sure thing. . In the previous post in this series, I was talking about special case programming and that I was running into situations were special cases could be identified and therefore the code would become pretty awful (and I remembered my encounter with my Special DAG Sorting attempt ). Matt Warren replied to that post that I should look at the methods like separate sets. Exactly what I wanted to avoid as it would lead to tree manipulation after the interpretation of the Expression tree, so in theory a different place of where the complexity would be located, not something to solve the complexity. At least that's what I thought.

But, Matt is a smart guy and I realized that right on time. So after that previous post I went back to my design and replaced my approach with a 'set per Queryable extension method' approach. In short it comes down to the fact that if you have a Queryable extension method like Where or Select, these work on an input set, and their result is again a set. Some methods simply work on the same set and the result of them is the same set (so no projection) and other methods result in a new set (so with a projection).

So I defined a class, SetDefinition, and with that I now build a tree: every extension method is handled in its own routine and per handler I decide if I have to manipulate the set (no projection) or wrap it with a new set and use the current one as its source (projection). This solved a big problem: it gives me the oppertunity to focus on parts of the puzzle and worry on the rest of them later on, in an easier fashion. I still use the stack based expression parser, as it gives good results and should be OK, though it's just a matter of taste. You can also produce subtrees of own nodes as the result of an expression handler, which is then later on reduced to something usable (which is actually the mechanism behind LR(n) parsing anyway: reduce something you've seen to something you know and while doing that handle the reduced elements).

Of course, there are a couple of problems still ahead. One of them is the reduction of the tree of SetDefinitions to a SetDefinition which contains the elements for the query or hierarchical queries to execute. At the moment I can run Linq queries which perform simple entity fetches with where clauses and single list projections. LLBLGen Pro contains a flexible projection framework already to project resultsets onto whatever you want, so I'm using that to handle the different projection types. One which will be a challenge is the hierarchical fetch of nested projections onto anonymous types, though as LLBLGen Pro already contains a generic merging system of sets based on a relationship (which contains one or more predicates) this shouldn't be too difficult to do.

It might sound a little strange to distinguish the different projections but fetching entities is something else than fetching a set of data which is projected onto a set of class instances, at least for an O/R mapper it is: entities are filled with data using different routes to avoid getting public properties being triggering field-oriented validation logic when an entity is filled; if the O/R mapper is a POCO system, subclassing takes place (not always) and for example if the entity has to participate in a uniquing space, the cache or uniquing context has to be consulted.. etc. etc.

So in short, the system is currently setup as: parse expression nodes with dedicated handlers. Per extension method a clean stack is used and the method parameters are parsed. The expression node handlers place their result on the stack so the caller can handle it further. At the end, the extension method handler pulls the stack elements and builds from that a new SetDefinition, or manipulates the current one. After that, the SetDefinition tree is reduced till nothing is reducable anymore, which in my setup is the state where there's just 1 SetDefinition left. This is the case because LLBLGen Pro can fetch hierarchies in 1 method call, so all parameters for that hierarchical fetch, if necessary, as well as derived tables, scalar queries etc. are elements of the SetDefinition's inner elements (predicate expressions for subqueries, ScalarQueryExpressions for field elements in projections etc.). The SetDefinition is then handled further by the QueryProvider which means it's simply converted to a fetch on the LLBLGen Pro api and the results are returned.

In the end, there's just hard work left: typing in all the code for the fine details, for handling all extension methods, all aggregates, all extension methods of types known to the database, like String etc. However that's the 'easy' part (it's still hard and complex though), as it comes down to writing out what you've already thought out: the design of that code is just following the global design of the overall framework. That's also why it's so important to get the global design of the framework right, hence my struggle.

My next stop is, after fixing a little bug in the tree reducer, adding support for Joins and Aliases. Joins introduce aliases for the rest of the elements. So do things like where e.Manager.Manager.Manager.Firstname=="John", which introduces joins which could (so you have to take it into account) be joins to self, which always require an alias. As Joins introduce anonymous types under the covers, it will be a bit of a challenge, but I think with the approach I have now it will be manageable.

Thanks Matt for making me realize I was on the wrong track. Next time we meet IRL, your beer is on me.

Posted Friday, October 12, 2007 1:29 PM by FransBouma | 3 comment(s)

Alternative Rock

Ever heard the term 'Alternative rock' ? It's a term for rock music which isn't mainstream. Or something. Anyway, read the wikipedia page for the fine print. I'm a metal fan (despite the pile of trance house music I've created in a dark past ) and once in a while I listen to alternative rock to ease the eardrums a little. Every time I do so, I get the same thought: Why is it called alternative?. I mean: isn't it just rock music like all that other rock music?

The reason is that 'alternative' suggests something like 'different from mainstream'. Duh... Beyonce's mainstream pop music is different from Foo Fighters'. So, and now we're arriving at why this is about software and not about music, a movement with the name 'Alt.NET', does that suggest the movement is about 'different from mainstream' too? And if so, what is that mainstream?

In the past months several people, which accidently all happen to be blogging on Codebetter.com, have tried to explain what they personally found the moniker 'Alt.net' stood for. I, as an outsider who wanted to learn what Alt.net stood for and what it would mean to fellow software engineers like me and the .NET community as a whole, did receive different messages from these people: some said it stood for ABC, others said it stood for DEF, and again others said it was XYZ.

The message became blurred, it became unclear what it stood for. Well, one message became clear: if you critisized the people who called themselves 'Alt.NET' about this unclear message, you were in for a lot of flak. I find that a little strange. If you want to change something, you have to have a clear message so everyone out there has a clear understanding what you stand for so potential followers (and that includes myself) could say "yeah, I think they have a point" and join your movement. If you, as a movement, get a response that your message might be a bit unclear, you then should do one thing only: get a clear message. There's no other thing possibly more important. That is, if you want your movement to succeed.

Why?, you wonder? Well, what's the point of having a movement with a mission if no-one really gets the mission statement? Selling an idea is similar to selling a product, mind you: if the potential 'buyer' doesn't get the advantage of what you're selling, why would the potential buyer buy it?

I said a couple of times in the past few months, in blog comments, that I didn't get the Alt.NET movement's mission statement, because, as explained above, it was unclear to me. I got as replies a couple of times that I didn't get it because I pretended that I didn't get it because it would hurt my company. How would that be, if all I ask for is a clear mission statement for the movement so I can decide to jump in or not? Or did I miss something vital about Alt.NET, like they are planning to come up with a free O/R mapper system which could do everything LLBLGen Pro could do? I didn't understand that from the communication put out by the Alt.NET fellows. I also got as replies that of course I didn't get it because I hated Agile and TDD. Err... what? So Alt.NET is about the Agile manifesto and Alt.NET is a church, preaching TDD? All I understood was that Agile and TDD was a common theme among followers of Alt.NET.

In short: I didn't get a good feeling about this movement. Not because of what Alt.NET's mission statement said it stood for, but because of what people around the movement thought it stood for and their friendlyness towards people who apparently weren't in 'the movement'. Apparently, if you didn't find Alt.NET a great thing (I really wished I could say that, but I don't know what it stands for and one shouldn't support something one doesn't understand what it stands for, do you?), you are a person who decided it's wrong, odd, stupid or other negative remark and therefore you should be critisized. Be it through indirect namecalling, low-level insults or other lameness.

I perfectly understand that the people who started Alt.NET, at least a couple, aren't that way and are always open for debates and exchange of thoughts, ideas, simply because debating with arguments gives better ideas and insights, no matter what the other's opinion is. I just find it a bit sad that around the Alt.NET movement, a couple of people find it necessary to behave like a fanatic who sympathizes with a given idea or product. You know, the guys with the "You're either With us or Against us" attitude.

Sam Gentile found it something not worth to be associated with anymore, and I can't blame him, reading his post and adding my own experiences. His post has started a good debate around the blogs and I'd like to link to a couple of them, simply because they discuss some things I mentioned above much better than I do.

First, Jdn's blog entry: Bad Analogy Time: Alt.NET Ex-drinkers.

And as second, Colin Ramsay's blog entry: Abandon ALT.NET. I'd like to quote the 'nail-on-the-head' remark from Colin below:

If they really wanted to change things then they should be writing about their techniques in detail, coming up with introductory guides to DDD, TDD, mocking, creating screencasts, or giving talks at mainstream conferences, or producing tools to make the level of entry to these technologies lower than it is.

Simply brilliantly worded and IMHO a great starting point for the Alt.NET guys (gals?) for their mission statement, at least if they're interested in reaching millions of people of course. Oh, you thought that by having a blog, a place where you can put your personal ramblings, has any relevance in the .NET community? Sorry, but that's not true. There are millions and millions of .NET developers out there. Look around at local user groups and ask who uses unit-testing, who uses an o/r mapper, who uses a code generator, who even understands what inheritance is and what polymorphism can get you? You'll likely stare into blank faces who ask you in return what the heck you're talking about. That is the state of our union, fellow blog readers. If you want to change that, you have to bring a clear, easy to grasp message and it will likely take years before you have any real significant result.

Too many discussions in the blog-o-sphere are never reaching the subjects of these discussions: the average developer who doesn't read blogs and just knows what MS tells them. I think it's a good idea to educate these people of other ideas, and let's start with just ideas. However you only achieve that by having a clear message. Not a cheer of joy that you're 'Alt.NET'. Woohoo...

Posted Monday, October 08, 2007 11:31 AM by FransBouma | 13 comment(s)

More on the .NET sourcecode and its 'Reference License'

Today I read an interesting post by Arne Vajhøj in the C# newsgroup. He brought up the point that in the Java research license, which was the license the Java sourcecode was released under before it was released under the GPL, a clause was added to prevent that the reader of the code was 'brandmarked' or 'tainted':

Before Java went GPL there was a Java Research License which allows it.
http://java.net/jrl.csp

Quote:
***
B. Residual Rights. If You examine the Technology after accepting this License and remember anything about it later, You are not "tainted" in a way that would prevent You from creating or contributing to an independent implementation, but this License grants You no rights to Sun's copyrights or patents for use in such an implementation.
***

And FAQ answer from same link:

***
18. Does the JRL prevent me from being able to create an independent open source implementation of the licensed technology?

The JRL is not a tainting license and includes an express 'residual knowledge' clause which says you're not contaminated by things you happen to remember after examining the licensed technology. The JRL allows you to use the source code for the purpose of JRL-related activities but does not prohibit you from working on an independent implementation of the technology afterwards. Obviously, if your intention is to create an ÒindependentÓ implementation of the technology then it is inappropriate to actively study JRL source while working on such an implementation. It is appropriate, however, to allow some decent interval of time (e.g. two weeks) to elapse between working on a project that involves looking at some JRL source code and working on a project that involves creating an independent implementation of the same technology.
***

MS could do something similar.

IMHO, an excellent idea.

Addition: I just remembered something. Microsoft has the policy that a developer of Microsoft (unless excluded from this policy) isn't allowed to look at GPL-ed software. Now, isn't the reason for that policy exactly the same as what I mentioned yesterday in my 'Don't look at the source' post? So, if I have no clue about IP, apparently neither does Microsoft.

Posted Friday, October 05, 2007 11:13 AM by FransBouma | 7 comment(s)

Don't look at the sourcecode of .NET licensed under the 'Reference license'

Update: If you think I should be shouting 'awesome' and similar words like most of the .NET community members, please take a walk down the path of 'licenses', something you all should be familiar with in every cell in your body, but by the look of all the different posts about this source release I can only conclude: hardly anyone has any clue whatsoever what licensing, copyright, software patents and related material really mean to a software developer. You didn't really think that by copying a class from the internet you owned the code, did you?

If you do framework development, take my advice and don't look at the .NET 3.5 BCL sourcecode. Using reflector isn't the same, you're then looking at IL being reversed engineered. Looking at sourcecode is different: it has comments, it has other layout, it has real variable names etc. etc. Also, use reflector only when you really have to. Using it is technically breaking the EULA.

Why you shouldn't look at that sourcecode
The reason is simple: software patents. People in the EU, where software patents are, fortunately, still not valid, should still realize that in other countries they do exist, and if you're writing software which could be sold in the US, don't make the mistake your code is liable to this. The jurisprudence on 'reverse engineering' is based on the fact that the people who are allowed to reverse engineer code have never layed eyes on the real code. As soon as they do, they can't reverse engineer the code anymore to their own benefit (whatever that may be, even rewriting code because it's internal in the BCL) because their case would fall outside the jurisprudence: it can be assumed they might have just copied the code instead of reverse engineered it.

Do realize that you're not allowed to rebuild that code, you're not allowed to modify it, copy it etc. You're only allowed to look at it, but also because you're not allowed to copy it, you have to forget what you saw, because if you don't forget it, and borrow the ideas in that code, you likely will step on some patent.

Take for example the new ReaderWriterLockSlim class introduced in .NET 3.5. It's in the System.Threading namespace which will be released in the pack of sourcecode-you-can-look-at. This class is a replacement for the flawed ReaderWriterLock in the current versions of .NET. This new lock is based on a patent, which (I'm told) is developed by Jeffrey Richter and sold to MS. This new class has its weaknesses as well (nothing is perfect). If you want to bend this class to meet your particular locking needs by writing a new one based on the ideas in that class' sourcecode, you're liable for a lawsuit as your code is a derivative work based on a patented class which is available in sourcecode form.

Software patents are evil and the release of this sourcecode doesn't help one bit, on the contrary. If you take your profession seriously, if you are interested in learning ideas about software engineering, instead of looking at that code, read scientific papers and learn from these. Dull crap which is out of touch with reality? Yeah right (just a few examples to get your appetite fired up ).

It's a red herring
Yes, I'm negative about this move. The main reason is that doesn't solve real problems at all. Take for example the case where you detect a bug in the BCL. You plow through the source-you-can't-touch and you'll discover the place where the bug originates and see how to fix it. You can't do a thing about it. You can't fix it yourself because you can't rebuild it. You can only report it back to MS and wait for a fix. Well, dear reader, good luck with that: even if Microsoft is willing to honor your request and patch it in a short time interval (read: month or so), you'll be handed a hotfix you can't distribute to your customers. They individually have to call PSS to get the fix, or you have to hope MS will release it publically which they've done with some fixes but hardly all of them.

Sure, be happy with your shiny sourcecode-you-can't-touch, but it's not something you would want: if Microsoft would have done this for the users of the code, namely the developers, they would have made the license less restrictive, at least that you could re-compile the code to include your own bugfixes till Microsoft would release them themselves.

Posted Thursday, October 04, 2007 11:18 AM by FransBouma | 50 comment(s)

Deferred execution in Linq pitfall(s)

Say you have this query in Linq to Sql

// C#
int id = 10254;
var q = from o in nw.Orders
        where o.OrderID = id
        select o;

// some other code
id++;

foreach(var o in q)
{
// process o.
}

What order is fetched: 10254 or 10255? That's right, 10255! The 'id' used in the query is added as a member access node to the expression tree. As Linq expression trees are converted to Sql when they're executed, it means that at that time, the value of id is evaluated and used inside the query.

So I then wondered, what if 'id' is a local variable in a method which creates the query while the query is executed outside the method? That would mean the query holds a reference to a member which doesn't exist anymore in theory, as 'id' is allocated in the stackframe of the method.

// C#
IQueryable q = Foo();
foreach(var o in q)
{
// process o
}

//...
private IQueryable Foo()
{
    NorthwindDataContext nw = new NorthwindDataContext();
    int id = 10254;
    var q = from o in nw.Orders
             where o.OrderID = id
            select o;

    return q;    
}

As 'id' is inaccessable from outside the method Foo, I can't increase it. However the query does result in a normal query, where the order 10254 is fetched. To me, this is a little odd, as 'id' isn't known at the time of execution of q, as Foo's stackframe has been cleaned up. We've seen in our previous test that 'id's value isn't inserted into the query, but the reference to the variable. This either is a low-level CLR trick, or something fishy with stackframes which are already cleaned up. Am I overlooking something here or is this indeed strange? (the expression tree refers to a weird type (c__displayClass0) )....

Posted Wednesday, October 03, 2007 1:59 PM by FransBouma | 12 comment(s)

Developing Linq to LLBLGen Pro, Day 5

(This is part of an on-going series of articles, started here)

Consuming Expression trees, back to Special Case programming?
I'll show you 5 different queries and what their expression tree looks like in text (tested on Linq to Sql, so you'll see Table references)

Query A:

// C#
var q = from c in nw.Customers
            select new
            {
                Name = c.ContactName,
                Orders = from o in nw.Orders
                         where o.CustomerID == c.CustomerID
                         select o
            };
gives:
// Expression tree .ToString() output
Table(Customer).Select(c => new <>f__AnonymousType0`2(Name = c.ContactName, 
Orders = value(WindowsFormsApplication1.Program+<>c__DisplayClass0).nw.Orders.Where(
    o => (o.CustomerID = c.CustomerID))))

In case you wonder... the WindowsFormsApplication1.Program namespace is the test app namespace I used to run the query with. As you can see, a bug pops up: instead of referring to Table(Order), it refers via the application to the context.Orders for the Where clause. This will be a pain to correct, I'm sure. Also there's just 1 Select call, while there are two in the query. This will be a pain too, but more on that below.

Query B:

// C#
var q = from c in nw.Customers
        let x = c.Region ?? "Undef"
        where (c.Country != "Germany")
            && x != "Undef"
        select new { CompName = c.CompanyName };
gives:
// Expression tree .ToString() output
Table(Customer).Select(c => new <>f__AnonymousType0`2(c = c, x = (c.Region ?? "Undef"))).Where(
    <>h__TransparentIdentifier0 => ((<>h__TransparentIdentifier0.c.Country != "Germany") && 
    (<>h__TransparentIdentifier0.x != "Undef"))).Select(<>h__TransparentIdentifier0 => new 
    <>f__AnonymousType1`1(CompName = <>h__TransparentIdentifier0.c.CompanyName))
Yeah, I also wondered about the transparent identifier, but it's the reference to a derived table-like sub result if you look closely. See also the C# 3.0 spec, where this is explained in detail.

Query C:

// C#
var q = from c in nw.Customers
        where c.Country != "USA"
        select new { CompName = c.CompanyName };
gives:
// Expression tree .ToString() output
Table(Customer).Where(c => (c.Country != "USA")).Select(c => 
    new <>f__AnonymousType0`1(CompName = c.CompanyName))

Query D:

// C#
var q = from c in nw.Customers
        select c;
gives:
// Expression tree .ToString() output
Table(Customer).Select(c => c) 

Query E:

// C#
var q = from c in nw.Customers
        where c.Country=="USA"
        select c;
gives:
// Expression tree .ToString() output
Table(Customer).Where(c => (c.Country = "USA"))

Confused? So am I. Yesterday I spend some time on writing code for the execution of queries, setting up the proper LLBLGen Pro elements such as live transactions and the like so they're usable with the query objects so deferred execution still would take place inside a transaction if required (so SqlServer won't deadlock on you), and I also spend some time analyzing what to do with the Expression tree nodes. Yesterday, it occured to me that it was something like LR(1) parsing: you handle an element with the lookahead info and place the results on the stack for the caller to process further. If the caller sees it has enough info to reduce the elements on the stack, it reduces all those elements and handles it further, placing the result of the reduction onto the stack again.

However, to make that work, the input has to be deterministic, i.e.: you shouldn't wonder in state N in what scope you're in, started in state N-M. Now take a look at queries D and E. D has a Select call, E doesn't. Why on earth doesn't E have a select call? This is undeterministic. What it should have been is:

// What should have been E's correct structure 
Table(Customer).Where(c => (c.Country = "USA")).Select(c=>c)

The expression tree created by the compiler is wrong, because the expression tree parser in the Linq provider has to decide what to leave out, however the compiler decides that for me now. The sad effect of this is that I now have to add code which has to guess what the intention was: when no Select call is present, apparently one still wants to do a select, with the native projection.

This is doable, although it's a form of 'special case programming': as a provider developer you now have to anticipate on a special case: namely the one where there's no Select call in the expression tree despite the fact that there is a select statement in the query (there always is). With the occurance of a special case, one has to wonder: what kind of other special cases are there, which are unknown to me now? And also: if the compiler tries to be clever, why isn't it more clever so parsing the nodes is easier and straight forward?

The main pain with writing the Linq provider isn't the code, it's the unknown voodoo which takes place inside the compiler which makes it emit calls to Queryable(Of T) extension methods which build apparently odd expression trees at runtime. To be able to write a proper Linq provider, you have to understand the intention of the query. There's no other thing to look for. You're handed an expression tree. So by parsing that tree, you try to re-build the query the developer wrote in C# (of VB.NET) which lead to this expression tree. If you solely look at elements inside the tree, you'll run into problems with things which have effect on various elements of the tree, like a join into with DefaultIfEmpty() calls to signal a left join (yes, Linq's join syntax is that cumbersome). If you know the intention, you can produce the proper query code on all databases you support via your provider and also you can optimize the query when necessary.

It's now unclear what lead to which expression tree subtree, so in other words: it's impossible to deterministically determine what the meaning is of a subtree in the expression tree, at least at this point where documentation about why the expression tree looks like the way it does is non-existent.

It's not hard to write code which can handle trees like the ones resulting from queries C, D and E. The main issues arise when someone introduces a line with 'let' like in query B or a second select (projection) in the main projection as in query A.

Take a closer look at the differences between query A and B. You'll notice that although query A has two select statements in the query, it has just one Select method call in the expression tree, while in query B, it has two Select method calls while there's just 1 select statement in the query.

Query B, with its two Select method calls gives another problem: what to do with the results? Does it have to be projected onto new types, or onto entities? Projecting onto entities is different, as shorter paths exist inside the O/R mapper core, simply because one doesn't set entity values through properties when you're fetching (to avoid triggering validation logic which shouldn't trigger at fetch time for example). However, with two (or potentially more) Select method calls exist, which one decides what is done with the end-result? Again, special case programming and not deterministic code: to handle a Select method call, one has to know the scope it is in: it can't decide based on its own parameters and the lookahead what to do: which one is the end-projection? The last one? But do you know that when you run into the first one? No, so you have to seek through the whole tree for another select. If that's not found, you know you're the only one. So doing things based on what you see in node X isn't going to work.

In LR(n) parsers, the stack top is important too. So this could help in the Select method handler to determine if it should decide the end projection or if it should reduce to a derived table. But this could lead to a shift/reduce conflict , which one normally solves by either extending the lookahead or changing the grammar (no can do) or by adding a special case (not recommended).

I'm sorry if I sound a litle bitter, but this whole Linq expression tree parsing system is overly complicated while this isn't necessary. I wrote my own LR(n) parser engine and action/goto generators, I know what parsing trees means, but with Linq expression trees it's like with a box of chocolates, isn't it, Forest...

As I said earlier, what you want with code like this is to be able to proof that the code is correct. As there are 'special cases' popping up left and right, being able to proof something is correct is impossible, unless you have deep knowledge of the inner workings and know why the tree looks like the way it does. Without that knowledge, I can't optimize query A. Query A can be done with 2 queries in generic code (or one, if you cheat, but that leads to special code for single path graphs). However to determine this situation is seen inside the query is hard. Sure, you can look at the expression tree of query A and take a guess that if you see something like that it will be optimizable.

However, Linq to Sql (and also the entity framework) will execute a myriad of queries with query A (number of customers + 1). I'll try to find a way to optimize it, but that depends on whether I can recognize the subquery/subqueries in a normal fashion. This is harder than you might think, as the subquery might or might not have a Select method call (and Microsoft, with knowledge of the inner workings of the compiler didn't know how to optimize it)

Yes, I do wonder if it wouldn't be simpler if I just got the linq query as text and did the parsing myself. Anyway, I'll start simple, with queries D, E and then C. I've the whole mechanism in place now so I can now add dispatch routines to my Expression node dispatcher for the various expression node types within the range of BinaryExpression for example, so I'll move forward with my LR(1) (so 1 lookahead) lookalike engine, which is hopefully enough to handle the cases it has to fight with. It's likely I need tree traversal code for cheating in situations where the 1-lookahead isn't enough. This is cumbersome though and makes things fragile.

Posted Wednesday, October 03, 2007 1:27 PM by FransBouma | 8 comment(s)

More Posts