Organization of application errors in the production environment

Error handling is an inseparable part of the software development process. We have established ways to prevent, detect and solve them, but despite everything, errors keep happening in production systems.

I have recently found an interesting quote that inspired me to write a few words related to the subject of errors:

The best error message is the one that never shows up.

Thomas Fuchs

This quote can be considered on two levels: reducing the frequency of errors - and handling them gracefully when they occur. I will focus on the second aspect.

Designing solutions, we often forget that things may go wrong. As a result, when they do, entire guts of the system are spilled to the world through stack traces and the like. I have seen many examples of how such errors 'came back' to haunt the end users. This is the reason why I want to share simple methods of how to 'cover them up'.

Let’s look at the type of software that we all deal with everyday - web applications, accessed via web browsers.

No errors

In perfect conditions, error messages don’t appear because there are no errors in the application that would cause them to be displayed. Sounds so simple...

As people responsible for software development, we should strive to ensure that our code has the lowest number of defects possible. The time and costs of repairing them increases along with how late we detect them and how many complexities we have in the code (eg. strong dependencies or duplication of the wrong code). In the worst case scenario, some errors never get fixed, because the cost of doing this exceeds the value of the application. In such cases the users are typically forced to live with them.

But, Dear Reader, I have a question for you: Do you know of any meaningful piece of software (so not Hello World) which is free of errors?

We have often witnessed huge consequences that resulted from minor mistakes. The list with those types of bugs goes on and on. But it only proves that even with large budgets, complicated procedures and repeated testing, errors make it to production releases anyway.

A harsh example of this is the launch of the Ariane 5 rocket. The attempt to convert 64-bit number to a 16-bit one caused an overflow error that led to a change in the trajectory of the flight and, ultimately, self-destruction of the rocket. In an instant, over 7 billion (!) dollars went up in smoke.

Avoiding all mistakes is next to impossible. Even when devoting huge resources to software development and applying strong verification techniques, counters may unexpectedly turn, buffers may overflow and interfaces may fail.

Consequences of errors are usually not as catastrophic as described above, but nevertheless, it's always a good idea to give feedback and leave the end user feeling that this great team remains in control of the situation!

Errors disclosed to the end user

There are certain types of information that can be taken by the user as an error. This group includes: lacking permissions, missing subpages, links with the expired tokens.

In many cases, we intentionally inform the user that something has gone wrong - but is this always perceived in accordance with our intentions?

Our error messages are the first point of contact between the user and a potential error. That is why we should make sure that the user feels that everything is under control. We can achieve this by presenting clear messages.

StackOverflow not only informs about the lack of a page but also gives four useful links.

The first step here is to not only choose the errors that we will inform the users about, but also to decide what should be hidden from them. Maybe our sites that are only accessible to administrators should not return the status of 403? How will an ordinary user know from such a message that they correctly spelled the admin/news address in our application, but they’re just not entitled to see it?

I encourage you to think about it and adopt a targeted policy for dealing with specific errors. Write it down and educate your team in this area.

Then, prepare website-specific error pages with appropriate messages to inform the users about what happened. There is a ton of design advice online for 404 error pages. What’s most important is that the default error pages from Apache/nginx are replaced with views explicitly prepared for this occasion - and that they return the HTTP response code that matches the prepared subpage.

Remember that the default 4xx sub-page (e.g. in the Apache server) can be valuable for pentesters. It reveals technical details that can be used to look for suitable exploits.

Errors hidden from the end user

What to do when the application fails because, for example, an exception is not caught correctly?

One thing is sure, do not display the view of the so-called exception handler from the framework. It is the most revealing view that can be presented to the user. It reveals code architecture as well as variable and method names. Even worse, it may display the data passed to methods, or even the full configuration of the application.

I will use the example of the Whoops! Library, very popular among applications written in PHP. It’s purpose is to help developers to deal with errors and exceptions.

You can also preview the demo version

It’s surely helpful in development, but imagine what would happen if this way of presenting errors ended up in the production environments. Each error would result in providing a treasure trove of information that should be inaccessible to outsiders. Non-technical people will not know what happened and where those hieroglyphics came from. A technical person, however, may use this information to your disadvantage.

Therefore, in production environments or those made public during beta testing with end users, you have to ensure that details are excluded from the exception handler! Period. Usually, this can be done by using a flag in the configuration.

You can typically limit the access to the full error view, e.g. to selected IP addresses, but it’s easy to comment-and-forget such a setting during fast paced development. I definitely discourage this approach. It's much better to leave a blank page (though you should get rid of it as well) than reveal full information, such as presented by the mentioned Whoops library!

The perfect solution, however, is the dedicated error page discussed in the previous section. This type of an error page shows that something has actually failed on the application side, but in a brief and humorous way.

Source: daxiongmao.eu

Whatever the problem is - establishing a connection to an external service or an uncaught exception in the business logic - is irrelevant. There’s no point in revealing what caused the application to fail when the end user can do nothing about it.

With an error page constructed like this, the only thing we communicate is – we have foreseen errors! A bit of an ironic statement, when you think about it.

An additional trick to sneak some information that may be helpful to investigate an incident can be seen in the clever Netflix error page above. As you’ll notice, there is extra information at the bottom of the page, such as the build ID, but it is at a first glance invisible to the user - the text has the same color as the background.

Errors visible for the team

So, after we hide our error details, the user will not know what actually happened - only that something went wrong. Naturally, this doesn’t mean that we can now forget about the error. As a team, we are responsible for supporting the application as it runs in production. We have to know what is going on with it, verify undesirable behaviors and eliminate them.

Technical aspects of the error are hidden from the user, but the full knowledge of what happened has to be available to the development team.

To avoid fishing blindly for errors, we collect error information in various forms, including stack trace files and external logs collected using aggregation tools (like ELK Stack) or SaaS solutions.

One of the solutions that is worth mentioning is Sentry. It offers, among others, support for the most popular programming platforms (backend & frontend) as well as integration with external services.

Sentry’s well-designed interface makes it easy to search, analyse and view the details of the intercepted errors. You can see example errors collected from a web frontend on the screen above.

Different methods of notifying the team through a selected communication channel

Going a step further, setting up error notifications over messenger, e-mail, or text messages is also worth considering - but be careful with the quantity and validity of the messages that are sent.

I remember that in one of my projects all uncaught exceptions and errors, every single notice from PHP, got sent by e-mail to all team members. As time passed, this entirely desensitized us to those messages. In the end we simply configured filters in our email clients to send those error messages straight to the trash bin. I won’t justify this behavior, but be warned, redundant error messages cause development teams to act in a way that is along the lines of the broken windows theory.

Summary

For key takeaways from this article, let me advise you to do the following:

Prepare dedicated error subpages.
Hide technical details of the error from the end users.
Make sure the team knows what errors occur in the application.

Logging errors makes no difference by itself. Use the suggested solutions to achieve both a better user experience and improve your application.

Finally, if you’d like to read more about good practices for creating error messages, the following two articles are worth reading: