Fail Gracefully


We were working on a subsystem component for Openvista and we noticed an issue occurred with a third party's application we were testing against every so often. It is the type of problem you dread as a engineer; the frustrating intermittent issue. After some debugging, we determined that it was something that could occur given the "right" sequence of connectivity issues and we needed to plan for it.

After determining what could cause the issue to occur, we started working on the user interface so that the software would communicate to users what to do if the issue occurred, and how to resolve said issue with one click. We then linked the user interface to our software engine and placed it into the testing process.

It reminded me that knowing the expectations of consumers should drive the decisions on how to handle failure conditions. For example, with some systems, users know those systems can and often do fail in some way and, at the same time, things may not be broken. Email is a good example of this. If an email "bounces," users in general don't think their email system is necessarily broken. Typically, they check if they have the right email address or if they made a typo when they entered the address in the "To" box.

However, with other systems, like a television set or movie player, users tend to be fairly unhappy without a lot of forgiveness if a movie or show doesn't play correctly. There is a totally different expectation on how much time users give for shows to play correctly before becoming worried something is fundamentally wrong. A lot of this has to do with how TV, movies and email fail. Email fails gracefully while television and movies do not. When an email bounces, it usually explains why it failed in a concise and generally understandable sentence or two. When a television show or a movie doesn't play, there isn't generally a message at all and the auditory and visual capabilities of the system aren't used to explain to the consumer what they should do, if anything.

With IT systems, the more error checking and handling you build into the system, the longer it takes to build the system and the more complexity the system develops. However, consumers tend to accept very little failures from applications or IT systems because the better applications tend to make failures less scary, more recoverable and less damaging to the user's experience. That's why Word, which still crashes on me, isn't as painful as it used to be since it saves a recoverable version of the word processing document every so often, just in case. Of course, my behavior has changed to accommodate failures as well and I tend to make sure I have manually saved copies often.

This brings me to the article at Joel on Software about Wolfram Alpha and the actual article it points to here. Like many other technologists, I took a look at Wolfram Alpha and, based off of some queries, really appreciated the user experience. However, other queries as simple as "average cat life span" (which Google answers) got the "Wolfram Alpha isn't sure what to do with your input." message. The problem, as those articles point to, is that Wolfram Alpha pretends to accept all inputs but really only accepts very discrete ones. This disconnect is not expected by users who, when they use Google or another search engine, almost always get some type of answer. Thus, to the user, Wolfram Alpha fails in a ungraceful way. That is always going to be a turnoff for consumers.

I think an important take away for us as we continue to develop Openvista, and for others who build systems, is that you should aim your systems to fail in a way users expect based off the context of their experiences with similar systems or your product's failures may not be graceful.