Choosing the right error handling strategy

07 Sep 2016 by Jonathan

To quote a previous post: “Sometimes things aren’t working.” If something isn’t working, you have to deal with it. But how?

There are two fundamental kinds of strategies: recoverable error handling (exceptions, error return codes, handler functions) and un-recoverable error handling (assert(), abort()). When do I use which one?

Note: This is marked part 1 of a series, but is the second part chronologically. This is simply because I didn’t plan the series when I wrote the second part. As a new reader it makes more sense to read it following the part oder though.

Kinds of errors

Errors can have a variety of reasons: The user enters weird input, the operating system cannot give you a file handle or some code dereferences a nullptr. Each of these errors here is different and needs different treatment. The three main categories of error sources are:

User errors: “user” here means the human sitting in front of the computer and actually “using” the program, not some programmer who is using your API. User errors happen when the user does something wrong.
System errors: System errors happen when the OS cannot fulfill your request. In a nutshell, everything that fails because a call to the system API has failed, is a system error. System errors have a gray zone - some of them happen because the programmer passed bad parameters to the system call, this is more of a programing error than a system error.
Programming errors: The programmer hasn’t looked at the precondition of the API or the language. If the API specifies that you must not call foo() with 0 as the first parameter and you do - this is the fault of the programmer. Even if the user has entered the 0 that was passed to foo(), the programmer has not written code to check that and it is thus his fault.

Each category is different and each requires special treatment, so let’s look at them.

User error

I’m going to make a very bold statement: A user error isn’t actually an error.

All users are stupid and don’t follow instructions.

A programmer dealing with human input should expect that the input is bad - the first thing it should do is check the validity and report mistakes back to the user and request new one.

Thus it doesn’t really make sense to deal with user errors using any form of error handling strategy. Input should be validated as soon as possible to simply prevent user errors from happening.

This isn’t possible every time of course. Sometimes it is very expensive to validate the input, sometimes code design and separation of concerns prevent it properly. But then error handling should definitely be recoverable - imagine if your office program crashes because you hit backspace in an empty document or if your game aborts because you try to shoot with an empty weapon.

And if exceptions are your preferred recoverable handling strategy, be careful: Exceptions are for exceptional situations only - most of bad user input isn’t an exception, all the programs I use would even argue that this is the norm. Use it only when the user error is detected deep inside the call stack of possibly external code, occurs only rarely and is very severe. Otherwise return codes are the appropriate way of reporting the error.

System errors

System errors cannot be predicted (usually). Furthermore, they are not deterministic and can occur on a program that worked on a previous run. Unlike user errors which solely depend on the input, they are true errors.

But do you use a recoverable or unrecoverable error handling strategy?

It depends.

Some argue that out-of-memory is a not recoverable error. Often you do not even have the memory to handle the error! Thus you should just terminate the program immediately.

But crashing because the OS could not give you a socket isn’t really user-friendly. So then it would be nicer if you threw an exception and let some catch exit the program cleanly.

Throwing an exception isn’t always the right recoverable strategy to choose.

Some would even argue “never”.

If you want to retry the operation after it failed, wrapping a function in a try-catch in a loop is slow. Then returning an error code is the right choice and looping until the return value is okay.

If you write the API call just for yourself, you can simply pick the way needed for your situation and roll with it. But if you write a library, you do not know what the user wants. In part 2 I mentioned a strategy to deal with it. For potential unrecoverable errors, you can use the “exception handler”, for the others you have to provide the two variants.

Note that you should not use assertions that are only enabled in debug mode, obviously. System errors can happen in release builds, too!

Programming errors

Programming errors are the worst kind of errors. For the purpose of error handling I’m going to restrict myself to programming errors happening at a function call, i.e. bad parameters. Other kind of programming errors can only be caught at runtime with the help of (debug) assertion macros sprinkled through your code.

There are two strategies for dealing with bad parameters: give them defined behavior or undefined behavior.

If the precondition of a function states that you must not pass in a bad parameter, doing so is “undefined behavior”, and does not need to be checked by the function itself but by the caller - the function should merely do a debug assertion.

If on the other hand a bad parameter is not part of the precondition, but instead the function documentation specifies that it will throw a bad_parameter_exception if you pass a bad parameter, passing a bad parameter has well-defined behavior (throwing an exception or some other recoverable error handling strategy) and the function does need to check it always.

As an example consider the std::vector<T> accessor functions: The specification of operator[] specifies that the index must be in the valid range, while at() specifies that the function will throw an exception if the index is not in the valid range. Furthermore, most standard library implementations provide a debug mode that checks the index of operator[], but technically this is undefined behavior and does not need to be checked.

Note: You do not necessarily need to throw an exception to make it defined behavior. As long as it is not listed in the function precondition, it is defined. Everything stated in the preconditions does not need to be checked by the function, it is UB.

When do you make a parameter defined, when undefined behavior? In other words: When do you only check it with a debug assertion, when do you check it always?

Sadly, there is no satisfying answer, this is highly dependent on the situation. I only have a rule of thumb I follow when designing APIs. It is based on the observation, that it is the callers responsibility to check the preconditions, not the callee’s. Thus a precondition should be “checkable” by the caller. A precondition is also “checkable” if it is easy to do an operation that always makes the parameter value correct. If this is possible for a parameter, it is a precondition and thus only checked via a debug assertion (or not at all if the check is expensive).

But the decision depends on a lot of other factors, so it is very difficult to do a general decision. By default, I tend to make it UB and only use an assertion. And sometimes it might even make sense to provide both versions like the standard library does with operator[] and at().

I consider it a mistake for this specific case though.

A note about the `std::exception` hierarchy

If you are using exceptions as your recoverable error handling strategy, it is recommended to create a new class and inherit it from one of the standard library exception classes.

From the various classes I suggest that you only inherit from one of those four classes:

std::bad_alloc: for allocation failures
std::runtime_error: for general runtime errors.
std::system_error (derived from std::runtime_error): for system errors with error code
std::logic_error: for programming errors that have defined behavior

Note that the standard library has a distinction between logic (i.e. programming) and runtime errors. runtime errors are broader than system errors. To quote the standard, it is used for errors “detectable only when the program executes”. This doesn’t really help a lot. I personally use it for bad parameters that are not solely programming errors, but can also happen because of a user error - but that is only detected deep inside the call stack. For example, bad comment formatting in standardese results in a parsing exception derived from std::runtime_error, this is later caught at the appropriate level and results in a log output. But I wouldn’t use this class much otherwise, nor std::logic_error.

Final guideline

There are two ways of handling errors:

a recoverable strategy uses exceptions or return values (depending on situation/religion)
a non-recoverable strategy logs an error and aborts the program

Assertions are a special way of non-recoverable strategies only in debug mode.

And there are three main sources of errors, each should be dealt with differently:

user errors shouldn’t be treated as errors in higher level program parts, everything from the user should be checked and handled appropriately. Only in low-level parts that do not directly interact with the user can they be handled with an appropriate recoverable error handling strategy.
system errors can be handled with both a recoverable and a non-recoverable error handling strategy, depending on the kind of error and severity. Libraries should strive to be as flexible as possible, possibly using techniques outlined in part 2 of the series.
programming errors, i.e. bad parameters, can either be prohibited by preconditions in which case the function should only use debug assertions to check or fully defined behavior in which case the function should signal the error in an appropriate way. I’d go with making it UB by default and only define that the function checks for the parameter if it is very difficult to check by the caller.

What’s next?

This was a very dry part without any code and much actual advice - but this isn’t possible. But I thought it made sense to write down my thoughts as an introduction to the posts that follow.

Or I have already written.

In those posts I will outline concrete strategies for dealing with errors.

Part 2 - which is already published - describes techniques to handle system errors as flexible as possible. The chronologically next part - part 3 - is going to talk about the implementation of assertions. And part 4 is going to talk about designing your interfaces in order to minimize preconditions, so look forward to those!

This blog post was written for my old blog design and ported over. If there are any issues, please let me know.