Your systems are melting down, your customer data is going down the pan, and everyone is panicking. What do you do?
Step 1: Take a Breath
It’s OK. You got this. It’s going to be OK.
Right, let’s get to it…
Step 2: Communicate, Communicate Communicate
It’s tempting to jump in and start tackling the situation. But before you do, take a moment to think about who you need to inform about the ongoing situation.
People to consider:
- Customers
- The Service Desk
- Managers
Who is best place to handle that communication?
- Not you. You’re managing the crisis.
- Not the people most qualified to solve the problem. They’ll be too busy solving the problem
- No, pick the person who is least qualified to fix the issue, but who still has enough understanding that they can communicate what’s going on.
If you need to do so, borrow someone else from another team to take on the communications role.
Step 3: Shut the Door
You and your team are going to have to concentrate.
There are lots of things that you could be doing that won’t solve the immediate crisis. If people start talking about them, kill the conversation.
- This isn’t time to do a root cause analysis
- This isn’t time to speculate about things you’ll do differently in the future
- Your colleagues personal crisis can (probably) wait
- Planned work can be postponed
- You can say sorry later
If you need to, post someone at the door to keep people out.
Step 4: Get Help
Who is going to help you solve the problem?
You don’t want too many people in your crisis team. Too many voices will make it hard to focus on solutions. Don’t be afraid of (politely) kicking people out if they’re not actively helping right now.
At the same time, you want your best people around you. Call them in, and ask them to get their tools ready.
As people arrive, don’t stop to explain everything. Say you’re in crisis, and give them enough information to start being useful. Then give them something to do.
Get people to play to their strengths:
- Who are your best trouble-shooters? – Get them doing diagnostics.
- Who is fast and accurate? – Get them at the coal face, ready to act.
- Who is slower, but deals better with complexity? – Listen to them.
- Who are your all-rounders? – Get them to help everyone else.
- Who is the expert on the thing that’s broken? – That’s your technical lead.
Do you need help from other teams?
Step 5: Don’t Make Things Worse
In a crisis, it is very easy to start doing things that will either make the current situation worse, or which will destroy the evidence trail and make the clean-up operation harder.
Take steps to ensure this won’t happen.
For example:
- Take a backup
- Don’t act until you’re sure
- Get someone else to double-check
- Don’t rush
- Do one thing at a time
If you’re working on live systems, don’t work alone. Have someone look over your shoulder, double-checking everything you do.
Step 6: Keep Notes
As you work to solve the problem:
- Keep track of what you do, and what the outcomes are
- Create an action list for later
- Keep key people informed
Step 7: Clarify What’s Going On
If you don’t know what’s going on, you can’t fix it.
- What is failing?
- What is the impact on the business?
- What are your key sources of information?
- What has changed recently?
- Is someone in another team already working on this?
- What is your biggest risk?
If you don’t know, how can you find out?
Step 8: Restore Stability
Your first action isn’t to fix the underlying issue. It is to stop any data corruption from getting worse.
- If you need to, take things off-line
- If things have changed recently, consider rolling them back
Don’t try to be too clever. Take simple steps that are low risk, and easy to understand. This isn’t the time to write elegant, highly optimised code. Rather, focus on clear, effective code that will get the job done.
Do one thing at a time, and check the results after each step.
Step 9: Clean Up
Now that things are stable, take a moment. If you need to, go use the loo, and make a cup of tea.
Your next task is to restore anything that is still broken. Focus on data, then functionality.
Break the whole problem down into manageable chunks. Group similar problems together, and tackle them one group at a time.
Step 10: Reflect
Right, everything’s back up and running.
Take a longer break, but not too long. You need to do do the following as soon as possible after the crisis, while things are fresh in your mind:
Root Cause Analysis
This is where you figure out what went wrong. Chances are there was more than one cause, and that each cause was in turn caused by something else.
Make a plan:
- to reduce the likelihood of something like this happening again
- to reduce the impact of something like this happening again
Lessons Learned
What have you learned from the way the crisis was managed?
- what went well?
- what will you do differently next time?
- how will you remember?
Step 11: Appreciate Your People
Now that the crisis is over, take time to understand the impact the crisis had on people, and how people contributed to resolving it.
Thank People
- who helped
- who were ready to help
Say Sorry to People
- Who lost functionality
- Whose work was delayed
- Who you ignored
Step 12: Rest
Take time to charge your batteries, and get back to normal. Tomorrow is another day.