Handling a Systems Meltdown: 12 Steps

Your systems are melting down, your customer data is going down the pan, and everyone is panicking. What do you do?

Step 1: Take a Breath

It’s OK. You got this. It’s going to be OK.

Right, let’s get to it…

Step 2: Communicate, Communicate Communicate

It’s tempting to jump in and start tackling the situation. But before you do, take a moment to think about who you need to inform about the ongoing situation.

People to consider:

  • Customers
  • The Service Desk
  • Managers

Who is best place to handle that communication?

  • Not you. You’re managing the crisis.
  • Not the people most qualified to solve the problem. They’ll be too busy solving the problem
  • No, pick the person who is least qualified to fix the issue, but who still has enough understanding that they can communicate what’s going on.

If you need to do so, borrow someone else from another team to take on the communications role.

Step 3: Shut the Door

You and your team are going to have to concentrate.

There are lots of things that you could be doing that won’t solve the immediate crisis. If people start talking about them, kill the conversation.

  • This isn’t time to do a root cause analysis
  • This isn’t time to speculate about things you’ll do differently in the future
  • Your colleagues personal crisis can (probably) wait
  • Planned work can be postponed
  • You can say sorry later

If you need to, post someone at the door to keep people out.

Step 4: Get Help

Who is going to help you solve the problem?

You don’t want too many people in your crisis team. Too many voices will make it hard to focus on solutions. Don’t be afraid of (politely) kicking people out if they’re not actively helping right now.

At the same time, you want your best people around you. Call them in, and ask them to get their tools ready.

As people arrive, don’t stop to explain everything. Say you’re in crisis, and give them enough information to start being useful. Then give them something to do.

Get people to play to their strengths:

  • Who are your best trouble-shooters? – Get them doing diagnostics.
  • Who is fast and accurate? – Get them at the coal face, ready to act.
  • Who is slower, but deals better with complexity? – Listen to them.
  • Who are your all-rounders? – Get them to help everyone else.
  • Who is the expert on the thing that’s broken? – That’s your technical lead.

Do you need help from other teams?

Step 5: Don’t Make Things Worse

In a crisis, it is very easy to start doing things that will either make the current situation worse, or which will destroy the evidence trail and make the clean-up operation harder.

Take steps to ensure this won’t happen.

For example:

  • Take a backup
  • Don’t act until you’re sure
  • Get someone else to double-check
  • Don’t rush
  • Do one thing at a time

If you’re working on live systems, don’t work alone. Have someone look over your shoulder, double-checking everything you do.

Step 6: Keep Notes

As you work to solve the problem:

  • Keep track of what you do, and what the outcomes are
  • Create an action list for later
  • Keep key people informed

Step 7: Clarify What’s Going On

If you don’t know what’s going on, you can’t fix it.

  • What is failing?
  • What is the impact on the business?
  • What are your key sources of information?
  • What has changed recently?
  • Is someone in another team already working on this?
  • What is your biggest risk?

If you don’t know, how can you find out?

Step 8: Restore Stability

Your first action isn’t to fix the underlying issue. It is to stop any data corruption from getting worse.

  • If you need to, take things off-line
  • If things have changed recently, consider rolling them back

Don’t try to be too clever. Take simple steps that are low risk, and easy to understand. This isn’t the time to write elegant, highly optimised code. Rather, focus on clear, effective code that will get the job done.

Do one thing at a time, and check the results after each step.

Step 9: Clean Up

Now that things are stable, take a moment. If you need to, go use the loo, and make a cup of tea.

Your next task is to restore anything that is still broken. Focus on data, then functionality.

Break the whole problem down into manageable chunks. Group similar problems together, and tackle them one group at a time.

Step 10: Reflect

Right, everything’s back up and running.

Take a longer break, but not too long. You need to do do the following as soon as possible after the crisis, while things are fresh in your mind:

Root Cause Analysis

This is where you figure out what went wrong. Chances are there was more than one cause, and that each cause was in turn caused by something else.

Make a plan:

  • to reduce the likelihood of something like this happening again
  • to reduce the impact of something like this happening again

Lessons Learned

What have you learned from the way the crisis was managed?

  • what went well?
  • what will you do differently next time?
  • how will you remember?

Step 11: Appreciate Your People

Now that the crisis is over, take time to understand the impact the crisis had on people, and how people contributed to resolving it.

Thank People

  • who helped
  • who were ready to help

Say Sorry to People

  • Who lost functionality
  • Whose work was delayed
  • Who you ignored

Step 12: Rest

Take time to charge your batteries, and get back to normal. Tomorrow is another day.

Eliminate Waste

Eliminate Waste is one of the core tenets of Lean Software Development.

Waste Spotting

By definition, the most efficient process is one in which there are no wasteful activities.

By learning to spot different types of waste, it is possible to start eliminating the activities that make our software development processes inefficient.

The lists below are intended as a starting point, to help us spot waste in our own software development processes.

Common Sources of Waste

Mary Poppendiek identified 7 common sources of waste in software production. She mapped each of these to one of the 7 wastes “muda” wastes in lean manufacturing.

  1. Partially done work (work in process)
  2. Extra features (over production)
  3. Relearning (extra processing)
  4. Task switching (handoffs)
  5. Waiting (delays)
  6. Handoffs (motion)
  7. Defects (defects)

I have noticed several other wastes that are common in the field:

  1. Unnecessary complexity (a form of over production, one that is especially common in software systems)
  2. Missed automation opportunities (a form of extra processing)
  3. Over-management (another form of extra processing)

Faster Roads and Developer Productivity

Today, I’ve been trying a thought experiment. It is designed to help us find creative solutions to software production bottlenecks.

The Scenario

Imagine that you’ve landed a job in traffic management, and that you’ve been assigned the task of reducing average journey times. What are your basic options?

Examples

In a couple of minutes, one of my colleagues and I came up with a few ideas:

  • Get everyone to drive faster
  • Increase the number of lanes
  • Remove obstacles, e.g. traffic lights, roundabouts
  • Add additional routes
  • Reduce congestion by reducing the number of vehicles on the road
  • Encourage people to avoid rush-hour by spreading their journeys throughout the day
  • Stop-starts create traffic jams, so slow traffic so that it flows better
  • Add safety features to cars so they’re less likely to have accidents and cause delays

What else can you think of?

Shifting Perspective

Now, imagine that delivering software works in a similar way. The features are cars, and the road is the delivery pipeline that takes feature requests and turns them into deployed features that are used by our customers.

Based on this analogy, what kinds of changes could we make to the development pipeline so that we could deliver features quicker?

For example, “get everyone to drive faster” could map to, “get everyone to code quicker”, and “increase the number of lanes” could map to “add more developers”.

Follow-Up Questions

  • What are the flaws in this analogy? In what ways do getting cars from A to B differ from getting features delivered?
  • In what ways is the analogy helpful?

Personal Observations

Personally, I found this thought experiment helpful, as it helped to generate different ways of looking at the software development life-cycle. It helped me to come up with ideas that could help increase our productivity.

The down side of this approach is that, like any analogy, it breaks down if you push it too far.

Drawing a Scar on an HTML Canvas

A simple approach to drawing a “scar” with “stitches” on an HTML canvas.

The Requirments

The team was developing a simple drawing tool for use in a hospital, to replace a 3rd party application that was due to be decommissioned. One of the requirements was that the users should be able to draw a line that represents a scar. The customer wanted these lines to look similar to the ones in the old tool, something like this:

Example Scar Line

Rejected Solution

Our initial thoughts were that we that we would need to take the equation for the original line, and use some kind of algorithm to figure out the location and angle of the stitches. This would undoubtedly involve some complex mathematics.

The Final Solution

My final solution is simple and fast, and works perfectly for lines, quadratic curves and bezier curves. It hardly involve any calculation at all!

How It works

To see how it works, follow these steps:

Step 1: Draw the Initial Line

Step 2: Make the Line Wider

Step 3: Make the Line Dashed

Step 4: Draw the Line Twice, First Thin, Then Wide and Dashed

Step 5: Adjust the Dash Spacing and Dash Widths

Implementation

The code below will draw the scar depicted above:

<!DOCTYPE HTML>
<html>
<head>
	<style>
	body {
		margin: 0px;
		padding: 0px;
	}
	</style>
</head>
<body>
<canvas id="myCanvas" width="578" height="200"></canvas>
<script>
	var scarThickness = 3;
	var stitchThickness = 1;
	var stitchSpacing = 20;
	var stitchWidth = 25;

	 var canvas = document.getElementById('myCanvas');
	 var context = canvas.getContext('2d');

		// We'll draw the scar in 2 passes over the same line.
		for(pass = 0; pass < 2; pass++) {

			// Both passes will draw this squiggle.
			context.beginPath();
			context.moveTo(100, 20);
			context.lineTo(200, 160);
			context.quadraticCurveTo(230, 200, 250, 120);
			context.bezierCurveTo(290, -40, 300, 200, 400, 150);
			context.lineTo(500, 90);

			switch(pass){
				case 0:
					// Normal line for 1st pass, which draws the base scar.
					context.lineWidth = scarThickness;
					break;
				case 1:
					// Wide dashed lines for 2nd pass to achieve stitch effect.
					context.lineWidth = stitchWidth;
					context.setLineDash([stitchThickness, stitchSpacing]);
					break;
			}

			context.strokeStyle = 'red';
			context.stroke();
		}
	 
</script>
</body>
</html> 

The Lean / Agile Hero’s Journey

A reflection on my agile journey.

Stage 1: The Ordinary World

Doing what we’ve always done.

Stage 2: The Call to Adventure

The feeling that there must be a better way.

Stage 3: Refusal of the Call

There are many reasons to turn back:

  • Uncertainty and self-doubt.
  • Fear of change.
  • Too busy chopping down trees to take the time to sharpen the axe.
  • Lack of support.

Stage 4: Meeting the Mentor

Seeing that other people have taken this journey before you, and taking inspiration from them.

Stage 5: Crossing the First Threshold

Formally adopting Lean / Agile. Often involves adopting a framework such as Scrum.

You may have to defeat the threshold guardians, e.g. colleagues who are resistant to change, managers etc.

Stage 6: Tests, Allies and Enemies

Making changes within your team to make Lean / Agile work for you.

Stage 7: Approach To The Inmost Cave

A sense of foreboding. An inkling that you’ve missed something important. A feeling that Lean / Agile is flawed, or that you’re doing it wrong, or it has been over-sold.

Stage 8: Ordeal

Facing up to the fact that Lean / Agile hasn’t lived up to its promise.

Stage 9: Reward

A deeper understanding of Lean / Agile; not just its practices, but its deeper principles.

Stage 10: The Road Back

Adopting a Lean / Agile mind-set.

Stage 11: Resurrection

A complete re-evaluation of what you’re doing from your new vantage point. A new enthusiasm and energy. Addressing long-standing issues.

Stage 12: Return with the Elixir

Encouraging others to take the Lean / Agile journey for themselves.

The Real Problem

When finding solutions, it is essential to get at the real problem, not the thing that people say is the problem.

You can’t always get what you want
You can’t always get what you want
You can’t always get what you want
But if you try sometimes
Well, you might find
You get what you need

The Rolling Stones

The XY Problem

The XY Problem is known to anyone who tries to solve problems for people, even if they don’t use that name for it. It happens when people present their problem in terms of their idea about a solution rather than your the actual problem. Unfortunately, this leads to enormous amounts of wasted time and effort, both on the part of people asking for help, and on the part of those providing help.

It goes like this:

  1. Someone has a problem, X.
  2. They think that Y will help them solve their problem, X.
  3. They ask about Y.
  4. Someone helps them to do Y.
  5. It turns out that Y doesn’t solve their real problem, X.

On a good day, the person with problem X will realize that they’re solving the wrong problem and shift their focus to X. All too often, however, they either continue to believe that Y is the answer, and go back to step 2, or they pick a different Y and try that instead.

Example: The Paper Problem

It isn’t uncommon for people to talk about “the paper problem”. The thinking goes that if an organization can go digital and get rid of all the paper, the organization will be more efficient.

What happens is this:

  1. A customer is frustrated by their administrative processes.
  2. They assume that the problem is that they’re using paper (Y), rather than computers.
  3. The IT supplier, keen for business, offers to write an application for them.
  4. On a good day, the application reduces the administrative overhead, and they customer is happy.
  5. On a bad day, the underlying problems (X) with the administrative process have not been addressed.

The customer may now be in a worse position:

  • IT acts as an amplifier, making good processes better, but bad processes worse. They can now do the wrong thing quicker.
  • IT systems are more expensive to change than paper systems. The problems are now baked in.

And all too often, both customer and supplier think that the problem is better IT, when the real problem is fixing bad admin processes. Both spend time and resource on improving Y, when the real problem is X.

The Presenting Problem

A similar issue occurs when people ask about the symptoms of their problems rather than their underlying causes. The initial symptoms are sometimes called the “presenting problem”. Relieving these symptoms may make the help-seeker feel better, and where the underlying problem isn’t serious, that can be sufficient. However, if the underlying problem is more serious, then a “cure” which simply masks the underlying issue is likely to do more harm than good.

Example: The Beeping Smoke Alarm

  1. My smoke alarm keeps beeping (presenting problem)
  2. I take the battery out to prevent the beeping (relieving symptoms)
  3. I fail to put in a new battery (which would solve the underlying problem)
  4. My home, family and my life are at greater risk

Solutions

There are various ways to overcome these issues:

As Someone with a Problem

  • Choose the best forum to seek solutions. In particular, recognize that some people have a vested interest in you solving the presenting problem rather than the underlying problem.
  • Treat people who are trying to help you with respect and patience. First, because it is the right thing to do. Second, because doing so means that they’re more likely to help you in return.
  • Be open to new ways of approaching your problem.
  • When explaining your problem, try to be concise, precise and informative.
  • Describe your goal, and what you’ve tried already.
  • Focus on your observations, not just your guesses.
  • Accept that a better question is progress, even if you don’t feel any closer to a solution.
  • Accept that the problem solver may perceive a deeper problem than you can see.

As a Problem Solver

  • When someone comes to you with a problem (even if that someone is yourself), begin with the assumption that there is an underlying problem that they’re not asking about.
  • Use the 5 Why’s Technique where appropriate.
  • Learn to recognize common occurrences of these problems As you gain experience in your field, you’ll find that people often ask about Y when they really want to to know about X.
  • Ask about other symptoms.
  • If you carry a hammer, beware of assuming that every problem is a nail. (If you’re in IT, for example, don’t assume that technology is the answer to everything. Both people-problems and systemic issues often masquerade as technical issues.)
  • Accept that you can’t solve everyone’s problems. Sending someone elsewhere (or even doing nothing) is better than making things worse.

SQL Server Index Analysis

A query for the level of fragmentation of indexes in the current database, with usage since last restart of SQL.

I can’t take credit for this code, but it has proven useful, so I’m posting it here.

Almost Implementing a Standard

Several of our suppliers claim to implement standards, but in reality they implement a subset of those standards, and only under certain circumstances.

To make matters worse, they don’t actually document the places where their implementation diverges from the standard. You get what you expect most of the time, but not quite always. And when you don’t, it can be a problem… a big problem.

What would if other businesses worked that way? Imagine you regularly ordered a taxi to take you to the airport. Imagine that, most time, a taxi turns up, where you want it, when you want it. But just occasionally, instead of a taxi, they send a small, golden-feathered chicken. But they insist you pay for it, even if you miss your flight.

The point of a standard is that it is… standard. So, almost implementing a standard is another way of saying that you don’t implement it. Except that your marketing people won’t be able to sell your product unless you claim to implement it.

</rant>

When Disaster Recovery is a Disaster Waiting to Happen

We were ready for our Disaster Recovery exercise. The plan was simple: to fail-over to our DR site, and then to fail-back again.

But, during our last-minute safety checks, my colleague and I spotted a problem. We realised that, if we proceeded with the planned fail-over, there was a significant risk of data-loss.

Some of the team wanted to continue anyway. We were, after all, trying to simulate a real disaster situation. If a real disaster were to occur, they argued, we would have to deal with similar consequences.

Debating the merits of this position, I suggested that:

There is a difference between falling off a bridge and throwing yourself off. The first is an unfortunate accident, the second is downright recklessness.

In agreement, my colleague replied that:

Continuing with the exercise would be like throwing yourself off a bridge now just in case you fall off one later!

A deeper analysis of the risk suggested that, if we were careful, we could throw ourselves off this particular bridge safely. Although there was likely to be data loss, there was a manual process available that would recover the missing records.

So, off we jumped…

The Liskov Substitution Principle (LSP)

A simple introduction to the Liskov Substitution Principle, a design guideline that helps us with inheritance.

Introduction

The LSP was first described by Barbara Liskov in a 1987 conference address entitled Data Abstraction and Hierarchy.

Despite the fancy name, it is actually quite a simple idea.

One way to explain it is this:

If A is a type of B, then everything we can say about B is also true of A.

Which means that:

If S is a subtype of T, then objects of type T in a program may be replaced with objects of type S without breaking anything.

Examples

A Snake is an Animal

If Snake is a subtype of Animal then you can replace Animal with Snake and nothing should break.

In other words, anything that is true of Animal should also be true of Snake.

A Programmer is a Human

If we model a Programmer as a Type of Human, then (within the domain of our program) everything we can say about Human can also be said about Programmer.

To say this another way, we can take any instance of Human and replace it with a Programmer.

Challenges

The challenge here is that we intuitively model like this. As a result, we tend not to think about it…. and so occasionally we make a mistake.

Code Examples

For example, imagine a program that models Square as a sub-type of Rectangle:

Bad Design

A programmer has defined a Rectangle class as follows:

Later, someone else adds the Square class to the application. In this new code, Square is implemented as a sub-class of Rectangle:

At first sight, this seems reasonable; after all, every school child knows that a Square is just a special kind of Rectangle. Unfortunately for the second programmer, however, there are circumstances when you can’t substitute a Square for a Rectangle.

Consider a (single-threaded) client of the Rectangle class that has a method that does this:

  1. Set the value of Height to 200
  2. Set the value of Width to 400
  3. Read back the value of Height

The result is obviously 200. This is the value expected, because changing the Width of a rectangle normally has no effect on its height.

Now consider what would happen if the same method is passed a Square. This is reasonable, because we have said that a Square is a Rectangle:

  1. Set the value of Height to 200
  2. Set the value of Width to 400
  3. Read back the value of Height

Now the result is 400! Obviously, this is a different result from the same code.

We can see, therefore, that the second programmer violated the LSP, in that their specialist form of Rectangle doesn’t behave like a rectangle in all circumstances.

Of course, the second programmer could argue that it is obvious that changing the Width of a Square would obviously change its Height. However, other classes that use the Rectangle class have no special knowledge about Squares. They expect to be passed Rectangles, so they have every right to expect that the the things passed will behave like Rectangle… not like Squares.

The issues caused by violating LSP often manifest themselves in a different part of the code from the one that has changed. In this case, the change was to add the Square class as a sub-class of Rectangle, but the issue came to light in a client of the Rectangle class. In complex systems, issues like this can easily go unnoticed, and can very hard to debug, and even harder to fix.

Implementing LSP

It is important to think carefully about hierarchies of classes in your application.

Some ways to avoid violations of LSP include:

  • Avoid inheritance altogether – favour composition over inheritance
  • Keep class hierarchies shallow

See also:

Summary

If a Snake is a type of Animal, then anything we can say about an Animal is also something we can also say about a Snake.

Violations of LSP:

  • Can be difficult to spot, especially because we generally don’t think about LSP because it is “obvious”
  • Can result in subtle bugs in code
  • Can cause bugs that manifest themselves in different parts of the code than the ones that have changed