Windows with C++: Exploring High-Performance Algorithms

My latest Windows with C++ column in the October 2008 issue of MSDN Magazine is now online:

Windows with C++: Exploring High-Performance Algorithms

In the concurrency space, issues such as coordination, asynchronous behavior, responsiveness, and scalability get the lion's share of the attention. These are the high-level topics that developers think about when designing applications. But, perhaps due to inexperience or the lack of good performance tools, some equally important topics are often overlooked. A case in point is high-performance algorithms.

Enjoy!

If you’re looking for one of my previous articles here is a complete list of them for you to browse through.

Produce the highest quality screenshots with the least amount of effort! Use Window Clippings.

Published Monday, September 22, 2008 8:34 AM by KennyKerr

Comments

# re: Windows with C++: Exploring High-Performance Algorithms

Monday, September 22, 2008 3:06 PM by Leo Davidson

Only having heard of, and not used, OpenMP before, I didn't realise it was quite that easy to take advantage of it. Very handy.

Do you know of a good page/reference explaining how the OpenMP threads are initialised and cleaned-up? In particular I'm wondering how OpenMP in Visual C++ deals with CRT initialisation -- i.e. the reason one should almost never use CreateThread() rather than _beginthreadex() in a C/C++ app -- and things like CoInitializeEx and other per-thread library init/cleanup.

I tried a search but didn't find anything. (Maybe I looked for the wrong phrases.)

# re: Windows with C++: Exploring High-Performance Algorithms

Tuesday, September 23, 2008 3:12 AM by KennyKerr

Leo Davidson: this is still the best introduction to OpenMP with Visual C++.

OpenMP is very handy for small, tactical, improvements to concurrency. It can in some circumstances provide dramatic performance improvements for very little effort but it is not a complete solution. In particular there is no way currently to handle errors gracefully. I would also not use it if the threads required any particular initialization for COM or the CRT. The new native concurrency libraries expected in VC10 should provide a more complete solution.

The warning about using CreateThread is dependent on your reliance on the CRT. Care should be taken but I would not say that using CreateThread is taboo. That would imply that the other thread (pool) functions are also off limits. Rather I would be careful about what global initialization my application depends on. Naturally this is very application specific.

# re: Windows with C++: Exploring High-Performance Algorithms

Wednesday, October 08, 2008 10:01 AM by John Schroedl

Good article Kenny, as usual.

Any tips on determining when data locality is bad?

I wish all our data structures were as evident as a bitmap's but they're not. In particular, we have statisticians attempting to create parallel algorithms who are most-likely unaware of these sorts of optimizations and I'd like to help them identify situations where this optimiztion can help.

I assume the info is buried in perfmon counters but which ones? These *seem* like candidates:

- Memory/Cache Faults per Sec

- Memory/Page Faults per Sec

- Cache/Data Flush Pages per Sec

Any tips?

John

# re: Windows with C++: Exploring High-Performance Algorithms

Thursday, October 16, 2008 4:57 PM by Scott VanSickle

Thanks for a good article.  I never really thought much about the implications of padding WRT contention; this is something that I will be considering much more because of your article.

I think that there are some more very basic considerations WRT high performance algorithms that you are not considering though (not that these are really the primary focus of the article, but very important nonetheless):

1- don't do unnecessary operations, especially in a loop with many iterations.  In all of your code samples you are doing unnecessary multiples on every pixel.  In Figure 1 you calculate x * sizeof(Pixel) unnecessarily for every y; in Figure 5 you calculate y * stride unnecessarily for every x.  Removal of many thousands of multiples should show a measurable speed improvement in any piece of code.

2- use pointer incrementing in pixel walking loops instead of an offset calculation from the beginning of the bitmap on every pixel.  Init a "Pixel *pPixel" variable at the beginning and do pPixel++, etc. within the loops.  You would have to adjust for any padding at the end of each scanline too where necessary of course.  I would think that being able to move the reinterpret_cast out of the loop may save some machine instructions also.  You could do this once you were using pointer offsetting instead of your current pixel offset calculatiion technique.

Scott V

# re: Windows with C++: Exploring High-Performance Algorithms

Friday, October 17, 2008 9:07 AM by Scott VanSickle

sorry for the typos in my 10/16 comments;  2 instances of "multiples" should be "multiplies"

Scott V

# re: Windows with C++: Exploring High-Performance Algorithms

Monday, October 20, 2008 3:32 PM by KennyKerr

John Schroedl: Good question, although I haven’t really looked into any particular perf counters for this. I’m not sure how much of it tools could really identify automatically due to the abstractions that are present and the great degree of difference among different chipsets. Careful analysis tends to be inevitable. But if I find some automated analysis tools/tips I’ll pass them along.

Scott VanSickle: Thanks for your comments, however you need to be careful not to optimize prematurely. The compiler is a lot smarter than most developers think. For (1), typically the compiler can optimize this away. For (2), the compiler can also make the difference you’re describing irrelevant. As for reinterpret_cast, this is to suppress compiler type checking and doesn’t add runtime overhead. It’s good to question the code but you should start by measuring the actual performance and tweak the code only if it will actually make a measureable difference.

Hope that helps.

# re: Windows with C++: Exploring High-Performance Algorithms

Tuesday, October 21, 2008 10:25 AM by Scott VanSickle

Kenny,

I would be very interested in receiving info on performance analysis tools and techniques.  Please post anything that you come across in a prominent place.

re: item (1) - I agree that compilers have gotten a lot smarter, but there's still no good reason to do huge numbers of unnecessary operations unless possibly if you are able to make a maintainability justification based on shorter or clearer code.  I don't at all consider my code improvement suggestion to be a premature optimization.  I do agree though that premature optimization is a bad idea; it's something that many developers need to keep more in mind (myself included). It is generally a bad practice to write inefficient code for no good reason and to rely on the compiler to "fix" it for you.  It is true that optimizations will greatly improve the situation, but when running debug builds they are usually turned off so you lose developer productivity in cases where the unoptimized debug version is much slower. I have personally experienced cases where I have not been able to test with debug builds because unoptimized versions of inefficient algorithms were intolerably slow and I was losing significant productivity.  The images that we work with are commonly 12M pixels; that's a lot of iterations for any algorithm.

re: item (2) - in my experience with VS 2005 the difference is not irrevelant.  We write a lot of image processing code and have very often observed a measurable difference between using something like buffer[nStart + nOffset] and pBuffer++ when walking through millions of pixels.  Because of this, as a practice we try to use pointer incrementing as much as possible when addressing data in large loops like your example. The difference is obviously greatly increased when a "[]" operator is involved (std::vector, e.g.).  In that case many extra function calls are in play in debug builds. In release builds they would be inlined, but the inlined code is unnecessary much slower than if pointer offsetting ws used. Also I know what reinterpret_cast is for; I wasn't questioning your use of reinterpret_cast for suppressing type checking (you couldn't compile the code without doing that), I was questioning the repeated assignment to Pixel& in the loop.  Pixel* incrementing would be much faster than recalculating Pixel& from (bitmap + offset) on every loop iteration and would not make the code any more complex.

Scott V

# re: Windows with C++: Exploring High-Performance Algorithms

Tuesday, October 21, 2008 2:36 PM by KennyKerr

Scott V: Thanks for your great comment. I can see you’ve got more experience with image processing!  :)  

I also like to write high performance code by default but err on the side of readability and then measure, analyze, optimizing as necessary. For articles in particular I try to keep things as simple as possible.

But your experience is appreciated. I’m interested to see what perf improvements I’ll get with your suggestions.