Dealing with production incidents

A software incident is defined as an unplanned interruption to an IT service or a reduction in the quality of an IT service. Any deviation from its normal or usual way of operation is an incident. The process to handle these incidents is called the incident management process.

This is the official definition. In most cases in my career though incidents could be described as a software-version of real-life fire in a bordello. Something unexpected, potentially dangerous and all of a sudden all the fun is replaced by panic. Oh and those senior managers suddenly remembering your name. In worst case they’ll mis-spell your name, but will start noticing you. Just to get out of the burning bordello.

There are multiple ways of approaching incidents. And many steps to get to a controllable incident. The first is always having some sort of organization. A tool. Or at least a document. Somewhere. If this is not made available, many times too often resolving incidents means having a group of volunteers. People who care. Really not sustainable long-term. Especially as your organization grows.

Try harder next time…is not a problem management method mentioned in ITIL.
Denis Matte

❗️ TOP 5 Software Failures of 2018–2019 ❕(#5 is pretty alarming) 😧 | by Katerina Sand | CheckiO Blog — source: https://blog.checkio.org/%EF%B8%8F-top-5-software-failures-of-2018-2019-5-is-pretty-alarming-2a5400b01658

Before we proceed, would be good to understand what are we facing. What causes incidents? It is important in building a solid incident response. Understand what you might be dealing with. We can distinguish a few types of underlying issues of incidents. For simplicity, here is how I categorize them (in order of importance):

Signals (most of what can be easily monitored and covered by SRE 4 golden signals)
Exceptions (most code bugs fall into this category, as long as they are raised properly)
Silent assassins (the catch{empty;} blocks in your code; and the situations where – oops we paid 300mln on a non-deliverable date, 3 months ago and nothing went through)
Fake kingpins (everything listed as error or fatal, but in fact being ignorable; the most despicable practice of using exceptions as logs)

The first two are something you should worry about daily and each time they appear. Silent assassins can be made less likely by forbidding suppressing exceptions or errors. Or simply introducing good coding practices. Similar with the fake kingpins. But c’mon who would do it to their colleagues? I just added it there, so I know when it runs… ten years ago … and I have since left the company, leaving no documentation behind.

With the above, you should have sufficient understanding, what constitutes step two in building a successful incident response.

When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meager and unsatisfactory kind. It may be the beginning of knowledge, but you have scarcely, in your thoughts, advanced to the stage of science.
William Thomson, Lord Kelvin (1824-1907)

The next, third step, oscillates around metrics. I am not suggesting ITIL or any other specific approach. Ok, I might be leaning towards DORA metrics (which I am yet to write about), but again any metrics will work. As long as you will start using them in a reasonable way.

In the first step – the process/document – you would have laid out the basics. How to deal with stuff. The second part tells you what kind of stuff you might be dealing with. The third step will tell you how good or bad you are at it.

In my work I highly recommend reading metrics as a guide. Not as a KPI. They help us to formulate questions. Imagine you are measuring the amount of production incidents you have. You might make a mistake to set a goal to have less incidents. This would be wrong.

The right approach would be asking the questions – What practices are we missing? What are we not allowing our engineers to do? Are we making the right choices in our hosting vendor selections? Maybe we are pushing teams too hard, so they drop quality? Maybe we don’t understand quality?

Answer and action, triggered by the metrics, with an intention to improve metrics, can work wonders. Also numbers in your face, will help to communicate, that there might be a problem.

Here are some manufacturing metrics, I’d recommend considering:

Production Downtime: Analyze and optimize your maintenance
Defect Density: Track the damaged items right away
Rate of Return: Measure how many items are sent back
On-time Delivery: Ensure your products are delivered on time
Right First Time: Understand the performance of your production process
Maintenance Costs: Evaluate your equipment costs in the long run

I suggest to look at manufacturing metrics, as they are by far the most advanced in terms of metrics and management methodologies. Puny software science and empirical evidence is still a bit behind. But it gets better year on year :).

The last step is giving your engineers clear expectations. I classify them into two simple preconditions and four simple rules.

The preconditions:

Provide your teams with clear guidance or documentation of what the systems being monitored do
You built it, you own it. End to end. No intermediaries.

The simple rules:

Before starting any work, ask the question loud and clear: What has changed in code, infra, external factors? Focus on answering this question first. Before you will get sidetracked by the fake kingpins. Ask your vendors for service status pages.
Then look for anomalies in monitoring stats. This means yes, you need to have reliable and easily accessible monitoring tools. Note: monitoring through chats is not an option nowadays.
Provide a simple guide on how to check if the issue is severe. In simple steps.
Make sure you have listed the right escalation contacts in your process-document. People who are eligible to wake other people up, or at least get others attention right away

The above should be then sufficient for any engineer to respond and assess the risk. It does not mean you will get it right always. You will need to adjust as you learn. But you increase the chance of limiting impact or preventing incidents altogether.

What also helps is performing game days often. This is throwing a problem into the system and observing how your teams would react. Then learning and adjusting from there.

Programmer Funny Meme' Poster Print by PosterWorld | Displate in 2020 | Programmer humor, Programmer quote, Programmer jokes — source: https://pl.pinterest.com/pin/571957221430389488/

To summarize today, dealing with incidents means:

Having a document or a process, so that how to do it is clear
Knowing your enemy or in other words understanding what kind of underlying issues you might be dealing with
Having metrics, so that you can start asking the right questions
Setting clear expectations to your engineers and continuously training them in incident resolution
~~Writing defect-less code~~ (there will always be bugs and incidents; good code may limit the chance though)

Dealing with production incidents

Leave a comment Cancel reply