Commenting on ChrisAn's reliability posting because he BlogX'ed himself into a no comment corner

This'll be an interesting post because I can talk about more than one thing.  First, ChrisAn added some new features to his blogging engine and either turned off comments using the handy config option, or his security word feature simply isn't working.  http://www.simplegeek.com/permalink.aspx/f085404d-00e3-47b4-be83-c73a682bc1f3.  Now he also posted something I'm even more interested in, the picture of the CLR reliability story.  Been talking about this a lot lately and I had the following comments that I wanted to post on Chris's blog, but I'll do it here and he can follow the link.  Here is his original posting http://www.simplegeek.com/permalink.aspx/1cdd3f95-bc89-4bfb-8a80-8ff66b6b1c7e, and here are the comments I was trying to make.  The comment starts off weird, because I'm redirecting his question of, “What should the component do“.

No, what is the platform to do.  If you need a 5 nines reliability platform, then you allow for that type of platform to exist.  Mark enough pages to gracefully shut-down an appication when in a particular mode or use the extra pages to gracefully exit a soft OOM (by soft, I mean not entirely real because we actually have pages we are sparing).  As the platform you have the power to allocate a special region just for this type of usage.

I think as the developer, I should also get an opportunity to mark pages for this reason.  I might mark a single page, so that I can open a new file handle and write out some save data while I'm gracefully crashing.

I guess large platform changes are out of the question though, even though CER regions were created.  Wiping your butt with tree leaves is going out of style though, so maybe we'll get some toilet paper soon enough ;-)

I think what I hit on above is actually pretty interesting.  Within the realms of the CLR, all memory allocations controlled by the CLR itself.  They have a managed heap, their own private place to store information.  They normally control this region, but others can walk over it using API calls.  Barring users munging the managed heap, I think the ability to mark pages for use in special circumstances would be a great idea.  Mark say 4k of memory that can be used to allocate enough objects to walk out of an OOM.  I pose this in two parts really.

Part 1 - The CLR Tear-down mode: The CLR tear-down mode is when a really solid OOM is on the table and the CLR has to gracefully shut-down or recover.  In a normal OOM, the CLR can't allocate any more memory.  With the new concept of a reserved region, the CLR can now allocate memory as it steps out of the OOM, possibly granting more abilities to the GC because the GC can allocate special trees for storing compaction data, or whatever it might need.

Part 2 - The Developer Tear-down mode: If theygive the ability to the CLR, I want it as well.  Word currently has a feature where they save off your file every now and then in the case your computer tanks on you somehow.  This could be a Word crash, or the power going out, or you forgetting to save before walking away from your laptop while on battery.  The developer tear-down mode is similar to this, but it says the following, You are being torn-down and the process is being exited, use the 4K of memory you specially allocated so you can save your settings and come back up fighting. 

I'd assume the types of things you can do in this mode would be highly restrictive so that you don't just throw the machine into another OOM by calling say some property to get state that in turn allocates a huge representation tree.  I'd be happy if the entire process was just a service of the CLR, maybe with me marking some special objects that are guaranteed to be serialized in the case of such an event.  In later events, as the CLR invades the OS more and more it could start to take advantage of the various reliability hardware that exists.  Taking events and signals from power devices in order to ensure state gets saved, or maybe even storing state into a persistent memory device if it exists on the machine.  Starts to give you more options.

Published Sunday, June 13, 2004 7:51 AM by Justin Rogers
Filed under: ,

Comments

Sunday, June 13, 2004 6:02 PM by Pavel Lebedinsky

# re: Commenting on ChrisAn's reliability posting because he BlogX'ed himself into a no comment corner

I'm not sure if hardening the framework against OOM and other similar conditions really makes sense.

With a lot of work it would probably be possible to harden the CLR and libraries so that it's theoretically possible to write apps that can recover from OOM, or do some cleanup before terminating. But even then you'd have to spend lots of time and effort testing each "hardened" application to make sure running out of memory doesn't cause corruption in any of the components.

For most server applications it's easier and more reliable to simply recycle the process. This can even be done before the process actually runs out of memory (the IIS/COM+ worker process model).

Client apps like Word is where it gets more interesting. If I paste a huge Visio drawing into a Word document and Word runs out of memory, should this result in process termination? Probably not (however one could imagine a model where UI and actual processing are implemented in separate processes so the UI process cannot easily run out of memory and the background worker process can be recycled without the user even noticing it).

However if the data in the document was really important (like some financial data) then I would take no chances and restart the process then load the last known good version of the data.

What is needed in my opinion is some kind of configurable policy that describes how an application wants to handle critical failures (OOM being just one example, others could include unexpected SEH and managed exceptions, stack overflows etc). The default should be to kill the process and submit a Watson crash report, but individual apps should be able to override it.
Sunday, June 13, 2004 7:56 PM by Justin Rogers

# re: Commenting on ChrisAn's reliability posting because he BlogX'ed himself into a no comment corner

Recycling, in and of itself is a possible source of corruption. You have to make sure and properly save out state and do many other things when this happens. There are currently hundreds of ASP .NET applications that don't properly save out state during an application shutdown and recycle. Users performing multi-page actions all get their chance to lose data in this scenario.

If, as you say, it were some financial data, who can say that hte LKG is the best version of the data to load? What if the LKG misses approximately 2-3 seconds of important information that I typed in right before I loaded the Visio document? Or should the app save before doing any paste operation? That sounds costly. I'd venture that hardening against OOMs is crucial, and allowing me the ability to save out the version of the document that I have in memory containing my 2-3 seconds of important data is better than loading an LKG. Hell, recovering that large amount of memory possibly and contuing the process of running Word may be even more important.

Some processes just can't be terminated when something goes wrong. You can't just terminate someone's game client in the middle of their game, thats crap, but games tend to gracefully handle OOMs quite a bit. They have a true reliability story and they provide fallback functions for cleaning memory, such as refreshing and reloading the entire graphics set currently being used with lower resolution versions to conserve memory.

I'm not advocating every application take care of this process, and as I noted, this would be a special mode, a form of reliability mode. Most of the CLR could benefit from the creation of the special mode and be more hardended to OOM's but only with special consideration could you set up the ability to handle an OOM and recover or save your state.

As for spending a bunch of time testing my hardened application, that is a matter of opinion. In many cases the CLR itself will better recover when an OOM strikes, meaning I get that much hardening for free and they do the testing. My app being hardened against OOM's and other critical failures is nothing more than a feature of my app. I'd spend just as much time working this feature as I would any other feature. Which app would you buy of the following two, "Excel that recycles if your Workbook gets too big", "Excel that gracefully recovers from a large Workbook and offers recommendations for solutions, one of which is to shut-down and reopen the workbook."

Leave a Comment

(required) 
(required) 
(optional)
(required)