I hate rebooting. Most people do. All too often though, it's the only way to get running again. Think it's a problem only to Microsoft Windows? Imagine having to do 130 reboots before you could get working again. That’s what NASA’s Mars rover Spirit had to do when it ran out of memory on January 21, 2003.
Is Spirit running on Windows? No, it’s running VxWorks from Wind River.
In the case of Spirit, the problem was due to 289MB of flash memory being filled up, causing various faults with the software. The software attempted to resolve the problem by rebooting the on-board computer. But that didn’t fix the underlying problem and eventually the system would again reboot. The window of opportunity to communicate with the rover, along with the extreme slowness of the connection (11KBps on a good day) meant that it wasn’t “cured” until February 6, 2003.
I’ve long been interested in software for spacecraft. It stems from building a SuperElf computer back in 1980 and using it in various science fair projects. The Super Elf was based on the RCA 1802 COSMAC microprocessor and flew as the brains of many spacecraft, including Voyager I and II. The 1802 processor was made using a CMOS technology that made it naturally hardened against radiation in space, but terribly slow by microprocessor standards today.
Another side interest of mine is software failures. Software engineering isn’t the same type of discipline that say, Civil Engineering is. Anyone can claim to be a programmer, and the relative infancy of the field ensures that many will try.
I wrote a side-bar in my first book about a famous software failure on January 15, 1990 that resulted in much of AT&T’s toll-free network being inoperable for several hours. The cause of this failure was due to a misplaced C-language break statement when the software for the 4ESS switching machines was being updated. That programming flaw would corrupt data if two calls were received within 1/100th of a second.
The software was designed to handle corrupted data – by rebooting. When restarted, the switch would announce to all the other switches in the network that it was once again available. Each switch kept track of available switches, and when it got the okay from the rebooted switch, it would have to spend a little time updating its status map. Thus, it was more likely to get hit at the same time with two or more calls to be processed. It only took 4 seconds to reboot, so the 4ESS switches were going down, coming back up and overloading the rest of the network. Thus the cascade began, and as more and more switches went down, and came back up, the greater the problem spread through out the system. It only took 10 minutes to bring down the entire network.
Most software failures are a result of unintended effects. It’s usually a set of unknown conditions, so the lowest form of repair kicks in – start over from scratch.
This brings me back to the Spirit rover. According to Glenn Reeves, flight software architect at NASA’s Jet Propulsion Laboratory, the flash memory on Spirit became an “incredibly full file system that now contains more information than we ever thought it would.” That begs a couple of questions that I haven’t seen addressed yet: How come the memory filled up so quickly? The rover was only on the surface for a few days. Presumably it collects data, stores it, and downloads it to Earth. Secondly, why can’t the system recovery sensibly from what is essentially a “disk full” error.
I know that software testers like to use tools to artificially fill up memory in order to see how the software reacts. I can’t imagine that this kind of testing wasn’t done for the Mars rovers.
Now, I’m sure there is plenty of detail here I’m not aware of, and I’m not passing judgment on the quality of someone else’s code. The rover software is amazing in that it has a lot of redundancy built-in and can update itself with software sent from millions of miles away. I’d like to know some more of those details.