Sam Gentile, among others, pointed us to two articles at Microsoft about the speed of managed code, how to optimize your code, tools that can help you and low level insight of what's going on in the CLR when it comes to the storage of data and GC. There are two articles: Know what things cost, and a primer about how to write fast managed code. I first read the 'know what things cost' article. It is a well written article and some good info if you are interested in that info. However it's not that useful for optimizing your managed code, if you ask me. I'll explain that in a second. The other article appeared to me as very useful, because it describes problems and gives hints to think about.
Why is the in depth article not that useful and the other is? Because the primer is focussed on algoritms, focussed on code on the abstract level of a 3GL like C#, while the other article is focussed on how much an object instantiation costs, how much an int multiplcation costs and other low level things. I find that low level info totally irrelevant. The reason for that is, that I program in a highly abstract language, C#, and do not know how the JIT will transfer it in machine code nor do I know that machine code or the meaning of its instructions. When I compile my code and it runs on a CLR, that CLR is then ordered to run the code as fast as possible, I can't do a thing about it: I can't force it to inline a given method, I can't order it to pre-JIT some code, I can't give runtime hints for the JIT so it can do a better job. It's out of my hands, like C++ code is out of your hands when you compile it: it's up to the compiler how fast your C++ code runs.
However, there is something you can do. Optimize your algorithms written in that highly abstract language. I'll try to explain this using an example from my 'old days'.
In 1989-1996 I was an active member of a sub-culture they call 'the demoscene', which ment that I was involved programming graphical routines and composing music for graphical demos. Each year there were a couple of competitions about who could create the best demo (meaning the greatest demo to look at, the most jawdropping routines, best graphics etc), usually held on demoparties, which we would call today 'lan-parties', only we didn't have a lan with our Amiga's :). There was a trend during 1990-1992, to plot as much calculated pixels a frame on your 7.15mhz Amiga: 3D dot-balls, dotted tunnels, rotating dotted 3D figures, you name it. We used assembler, because then we could optimize every cycle out of the code to write. We knew for each MC68000 instruction how many cycles it would cost, and how to do things the fast way (multiplying with 4 was faster done by adding it 4 times, however multiplying it with 5 was slower doing 5 adds, a mul instruction was then faster). In the end, every demo-coder knew you need at least 4 or 5 instructions per pixel: the X-Y coords were starting at the top left, and you had to first calculate the startposition of the screenline in the 1-bitplane framebuffer, then the 8 pixel byte where the pixel was in and then you had to calculate the pixelnumber, because 'pixel 3' was not pixel 3, but pixel 5, since bits are counted from the right (pixel with X coordinate 18 was in byte 3 from the left ((19/8)+1) and you had to plot then the pixel 3 pixels into the byte, bit 5). That extra instruction costed 6 or 8 cycles if I recall correct. But since everybody had to do that, it was a limit you had to deal with.
In Denmark, there was a guy who called himself Hannibal. He programmed a demo, called 3D demo, and won second place at 'The Party', in 1991 I believe. The most stunning part of the demo was that he had a dotball rotating in 3D with an amazing amount of dots, everybody knew that was impossible, at least with the framerate (FPS) he managed to squeeze out of the routine. Geeks as we were, we dissassembled the routine and then discovered his brilliant but o so simple solution: he placed the (0,0) position not at the top left but at the top right. That way he saved himself the 6 to 8 cycles of the bit number to bitpos conversion, per pixel to plot, which gave him enough headroom to add more dots to his effect than anybody else ever could.
It opened my eyes. For years we had tried to gain as much cycles by optimizing the code in assembler, yet we got beaten by a whole other kind of optimization: optimization of the algorithm. By simply using another algorithm that was more efficient in terms of actions / calculations, Hannibal won more time than we ever could.
Since that day I tried to optimize algorithms instead of lines of code, and with success, both with graphics code as with day to day business software. The thing I learned from programming demos in the demoscene was that you asked yourself 10,000 times a day: "that action / calculation over there, can I re-use the result? Can I do it beforehand so I don't do it a lot of times in that loop? Do I have to traverse that list in this way or should I add a table to lookup entries much faster?". When you learn yourself to do that, your algorithms become very clean: you know it's perhaps wiser to allocate another buffer for a lookup table so you don't have to do 10,000 multiplications of 4 floats in that tight loop but simply do a lookup.
I hear you thinking: "but you only know that if you know the costs of allocating that buffer compared to those multiplications". In a way that's true, but see it as a metaphore for every day object oriented programming: doing action X once and then a cheap lookup is much faster than doing X 10,000 times without the lookup, instantiating 10,000 objects once is faster than instantiating 10,000 times one object that has to be cleaned up and recreated since the clean up is done also 10,000 times. If X is instantiating an object, traversing a list, calculating a sine value, reading from the database or whatever, it is necessary to ask yourself with every action X: do I need to do that there or can I do it somewhere else, is that action really necessary in this form or can I do it differently? Using bubblesort on a long list is not that useful, even if you optimize your implementation. It's better to change the algorithm to quicksort.
There is another reason why I think knowledge of the low level instructions is ok, but not very helpful: if I do my best to write the most clever algorithm, I don't want to be bothered with how it will run on a JIT, since that's the JIT's job: if I want it to be as fast as possible, I would have written it in MSIL, with the table with the cycles for each instruction next to my keyboard. However I program in a highly abstract language, is it my problem the JIT can't create a better assembler output from my high abstract code? No. The compiler should hint the JIT, the JIT should use the hints and use the MSIL to create the fastest code possible. If it can't, if the JIT falls short, the JIT isn't good enough, since my algorithm is the one I felt is the most efficient. If I have to break up the algorithm to make execution faster, I'm doing dangerous things: what if the JIT is optimized, so the algorithm I had reworked into an algorithm that is in theory slower but on the previous JIT faster, is now suddenly slower as it should be? Do you think John Carmack is still programming his 3D engines in assembler because that's faster? Why are so many people programming data-intensive operations in SQL, a language that is interpreted? Could it be, that the algorithms the programmers write in these languages are very efficient and the execution of these algorithms is out of their hands? I think it is.
Much more time is won by f.e. using connection pooling in a database application, using caching in asp.net, using additional tables in the database for f.e. trees to use the power of SQL to retrieve trees in one query, using hashtables instead of array traversals, using a temporary sorted list with references to objects in an array or other structure to fast access them in a given point in an algorithm, using dispose when necessary and avoiding finalize when not needing it. There are so many good algorithms out there. Use them, do a search on google and you'll find tons of research papers on almost any algorithm. Learn what O(n) means and why it can help you.
That's optimizing code and thus also how you should optimize managed code, since it doesn't matter which platform you run the compiled output on, if the algorithm is efficient, the execution will be efficient.