Handling a Systems Meltdown: 12 Steps

Your systems are melting down, your customer data is going down the pan, and everyone is panicking. What do you do?

Step 1: Take a Breath

It’s OK. You got this. It’s going to be OK.

Right, let’s get to it…

Step 2: Communicate, Communicate Communicate

It’s tempting to jump in and start tackling the situation. But before you do, take a moment to think about who you need to inform about the ongoing situation.

People to consider:

  • Customers
  • The Service Desk
  • Managers

Who is best place to handle that communication?

  • Not you. You’re managing the crisis.
  • Not the people most qualified to solve the problem. They’ll be too busy solving the problem
  • No, pick the person who is least qualified to fix the issue, but who still has enough understanding that they can communicate what’s going on.

If you need to do so, borrow someone else from another team to take on the communications role.

Step 3: Shut the Door

You and your team are going to have to concentrate.

There are lots of things that you could be doing that won’t solve the immediate crisis. If people start talking about them, kill the conversation.

  • This isn’t time to do a root cause analysis
  • This isn’t time to speculate about things you’ll do differently in the future
  • Your colleagues personal crisis can (probably) wait
  • Planned work can be postponed
  • You can say sorry later

If you need to, post someone at the door to keep people out.

Step 4: Get Help

Who is going to help you solve the problem?

You don’t want too many people in your crisis team. Too many voices will make it hard to focus on solutions. Don’t be afraid of (politely) kicking people out if they’re not actively helping right now.

At the same time, you want your best people around you. Call them in, and ask them to get their tools ready.

As people arrive, don’t stop to explain everything. Say you’re in crisis, and give them enough information to start being useful. Then give them something to do.

Get people to play to their strengths:

  • Who are your best trouble-shooters? – Get them doing diagnostics.
  • Who is fast and accurate? – Get them at the coal face, ready to act.
  • Who is slower, but deals better with complexity? – Listen to them.
  • Who are your all-rounders? – Get them to help everyone else.
  • Who is the expert on the thing that’s broken? – That’s your technical lead.

Do you need help from other teams?

Step 5: Don’t Make Things Worse

In a crisis, it is very easy to start doing things that will either make the current situation worse, or which will destroy the evidence trail and make the clean-up operation harder.

Take steps to ensure this won’t happen.

For example:

  • Take a backup
  • Don’t act until you’re sure
  • Get someone else to double-check
  • Don’t rush
  • Do one thing at a time

If you’re working on live systems, don’t work alone. Have someone look over your shoulder, double-checking everything you do.

Step 6: Keep Notes

As you work to solve the problem:

  • Keep track of what you do, and what the outcomes are
  • Create an action list for later
  • Keep key people informed

Step 7: Clarify What’s Going On

If you don’t know what’s going on, you can’t fix it.

  • What is failing?
  • What is the impact on the business?
  • What are your key sources of information?
  • What has changed recently?
  • Is someone in another team already working on this?
  • What is your biggest risk?

If you don’t know, how can you find out?

Step 8: Restore Stability

Your first action isn’t to fix the underlying issue. It is to stop any data corruption from getting worse.

  • If you need to, take things off-line
  • If things have changed recently, consider rolling them back

Don’t try to be too clever. Take simple steps that are low risk, and easy to understand. This isn’t the time to write elegant, highly optimised code. Rather, focus on clear, effective code that will get the job done.

Do one thing at a time, and check the results after each step.

Step 9: Clean Up

Now that things are stable, take a moment. If you need to, go use the loo, and make a cup of tea.

Your next task is to restore anything that is still broken. Focus on data, then functionality.

Break the whole problem down into manageable chunks. Group similar problems together, and tackle them one group at a time.

Step 10: Reflect

Right, everything’s back up and running.

Take a longer break, but not too long. You need to do do the following as soon as possible after the crisis, while things are fresh in your mind:

Root Cause Analysis

This is where you figure out what went wrong. Chances are there was more than one cause, and that each cause was in turn caused by something else.

Make a plan:

  • to reduce the likelihood of something like this happening again
  • to reduce the impact of something like this happening again

Lessons Learned

What have you learned from the way the crisis was managed?

  • what went well?
  • what will you do differently next time?
  • how will you remember?

Step 11: Appreciate Your People

Now that the crisis is over, take time to understand the impact the crisis had on people, and how people contributed to resolving it.

Thank People

  • who helped
  • who were ready to help

Say Sorry to People

  • Who lost functionality
  • Whose work was delayed
  • Who you ignored

Step 12: Rest

Take time to charge your batteries, and get back to normal. Tomorrow is another day.

Leave a Reply

Your email address will not be published. Required fields are marked *