Commenting on ChrisAn's reliability posting because he BlogX'ed himself into a no comment corner
This'll be an interesting post because I can talk about more than one thing. First, ChrisAn added some new features to his blogging engine and either turned off comments using the handy config option, or his security word feature simply isn't working. http://www.simplegeek.com/permalink.aspx/f085404d-00e3-47b4-be83-c73a682bc1f3. Now he also posted something I'm even more interested in, the picture of the CLR reliability story. Been talking about this a lot lately and I had the following comments that I wanted to post on Chris's blog, but I'll do it here and he can follow the link. Here is his original posting http://www.simplegeek.com/permalink.aspx/1cdd3f95-bc89-4bfb-8a80-8ff66b6b1c7e, and here are the comments I was trying to make. The comment starts off weird, because I'm redirecting his question of, “What should the component do“.
No, what is the platform to do. If you need a 5 nines reliability platform, then you allow for that type of platform to exist. Mark enough pages to gracefully shut-down an appication when in a particular mode or use the extra pages to gracefully exit a soft OOM (by soft, I mean not entirely real because we actually have pages we are sparing). As the platform you have the power to allocate a special region just for this type of usage.
I think as the developer, I should also get an opportunity to mark pages for this reason. I might mark a single page, so that I can open a new file handle and write out some save data while I'm gracefully crashing.
I guess large platform changes are out of the question though, even though CER regions were created. Wiping your butt with tree leaves is going out of style though, so maybe we'll get some toilet paper soon enough ;-)
I think what I hit on above is actually pretty interesting. Within the realms of the CLR, all memory allocations controlled by the CLR itself. They have a managed heap, their own private place to store information. They normally control this region, but others can walk over it using API calls. Barring users munging the managed heap, I think the ability to mark pages for use in special circumstances would be a great idea. Mark say 4k of memory that can be used to allocate enough objects to walk out of an OOM. I pose this in two parts really.
Part 1 - The CLR Tear-down mode: The CLR tear-down mode is when a really solid OOM is on the table and the CLR has to gracefully shut-down or recover. In a normal OOM, the CLR can't allocate any more memory. With the new concept of a reserved region, the CLR can now allocate memory as it steps out of the OOM, possibly granting more abilities to the GC because the GC can allocate special trees for storing compaction data, or whatever it might need.
Part 2 - The Developer Tear-down mode: If theygive the ability to the CLR, I want it as well. Word currently has a feature where they save off your file every now and then in the case your computer tanks on you somehow. This could be a Word crash, or the power going out, or you forgetting to save before walking away from your laptop while on battery. The developer tear-down mode is similar to this, but it says the following, You are being torn-down and the process is being exited, use the 4K of memory you specially allocated so you can save your settings and come back up fighting.
I'd assume the types of things you can do in this mode would be highly restrictive so that you don't just throw the machine into another OOM by calling say some property to get state that in turn allocates a huge representation tree. I'd be happy if the entire process was just a service of the CLR, maybe with me marking some special objects that are guaranteed to be serialized in the case of such an event. In later events, as the CLR invades the OS more and more it could start to take advantage of the various reliability hardware that exists. Taking events and signals from power devices in order to ensure state gets saved, or maybe even storing state into a persistent memory device if it exists on the machine. Starts to give you more options.