Windows with C++: Exploring High-Performance Algorithms

My latest Windows with C++ column in the October 2008 issue of MSDN Magazine is now online:

Windows with C++: Exploring High-Performance Algorithms

In the concurrency space, issues such as coordination, asynchronous behavior, responsiveness, and scalability get the lion's share of the attention. These are the high-level topics that developers think about when designing applications. But, perhaps due to inexperience or the lack of good performance tools, some equally important topics are often overlooked. A case in point is high-performance algorithms.

Enjoy!

If you’re looking for one of my previous articles here is a complete list of them for you to browse through.

Produce the highest quality screenshots with the least amount of effort! Use Window Clippings.

4 Comments

  • Only having heard of, and not used, OpenMP before, I didn't realise it was quite that easy to take advantage of it. Very handy.

    Do you know of a good page/reference explaining how the OpenMP threads are initialised and cleaned-up? In particular I'm wondering how OpenMP in Visual C++ deals with CRT initialisation -- i.e. the reason one should almost never use CreateThread() rather than _beginthreadex() in a C/C++ app -- and things like CoInitializeEx and other per-thread library init/cleanup.

    I tried a search but didn't find anything. (Maybe I looked for the wrong phrases.)

  • Good article Kenny, as usual.

    Any tips on determining when data locality is bad?

    I wish all our data structures were as evident as a bitmap's but they're not. In particular, we have statisticians attempting to create parallel algorithms who are most-likely unaware of these sorts of optimizations and I'd like to help them identify situations where this optimiztion can help.

    I assume the info is buried in perfmon counters but which ones? These *seem* like candidates:
    - Memory/Cache Faults per Sec
    - Memory/Page Faults per Sec
    - Cache/Data Flush Pages per Sec

    Any tips?
    John

  • Thanks for a good article. I never really thought much about the implications of padding WRT contention; this is something that I will be considering much more because of your article.

    I think that there are some more very basic considerations WRT high performance algorithms that you are not considering though (not that these are really the primary focus of the article, but very important nonetheless):

    1- don't do unnecessary operations, especially in a loop with many iterations. In all of your code samples you are doing unnecessary multiples on every pixel. In Figure 1 you calculate x * sizeof(Pixel) unnecessarily for every y; in Figure 5 you calculate y * stride unnecessarily for every x. Removal of many thousands of multiples should show a measurable speed improvement in any piece of code.
    2- use pointer incrementing in pixel walking loops instead of an offset calculation from the beginning of the bitmap on every pixel. Init a "Pixel *pPixel" variable at the beginning and do pPixel++, etc. within the loops. You would have to adjust for any padding at the end of each scanline too where necessary of course. I would think that being able to move the reinterpret_cast out of the loop may save some machine instructions also. You could do this once you were using pointer offsetting instead of your current pixel offset calculatiion technique.

    Scott V

  • sorry for the typos in my 10/16 comments; 2 instances of "multiples" should be "multiplies"
    Scott V

Comments have been disabled for this content.