Job Description

My job is to orchestrate the rapid on-and-off of millions of tiny little switches.

I am the wizard, the spell master, the weaver of webs.

I make worlds from pure thought, and as my worlds collide with yours, your worlds are changed, and so are mine.

My worlds are contained in little black boxes, boxes that I will never actually see.

They are not lost, these boxes full of switches, but I do not know where they are, and I never will.

For hours I sit in silence, staring at the object before me; an artificial glass that glows with the light of yesterday’s sun.

Arcane symbols march across the glass, just below its surface. They come and going at my whim.

And as I work, I remain quite still, still but for the endless dance of my fingers on yet more switches.

And then, all at once, the stillness is broken.

I curse; I wring my hands in frustration.

And one of my fellows intones the spell to end all spells:

“Have you tried switching it all off and then back on again?”

Yes, indeed; yet more switches.

Handling a Systems Meltdown: 12 Steps

Your systems are melting down, your customer data is going down the pan, and everyone is panicking. What do you do?

Step 1: Take a Breath

It’s OK. You got this. It’s going to be OK.

Right, let’s get to it…

Step 2: Communicate, Communicate Communicate

It’s tempting to jump in and start tackling the situation. But before you do, take a moment to think about who you need to inform about the ongoing situation.

People to consider:

  • Customers
  • The Service Desk
  • Managers

Who is best place to handle that communication?

  • Not you. You’re managing the crisis.
  • Not the people most qualified to solve the problem. They’ll be too busy solving the problem
  • No, pick the person who is least qualified to fix the issue, but who still has enough understanding that they can communicate what’s going on.

If you need to do so, borrow someone else from another team to take on the communications role.

Step 3: Shut the Door

You and your team are going to have to concentrate.

There are lots of things that you could be doing that won’t solve the immediate crisis. If people start talking about them, kill the conversation.

  • This isn’t time to do a root cause analysis
  • This isn’t time to speculate about things you’ll do differently in the future
  • Your colleagues personal crisis can (probably) wait
  • Planned work can be postponed
  • You can say sorry later

If you need to, post someone at the door to keep people out.

Step 4: Get Help

Who is going to help you solve the problem?

You don’t want too many people in your crisis team. Too many voices will make it hard to focus on solutions. Don’t be afraid of (politely) kicking people out if they’re not actively helping right now.

At the same time, you want your best people around you. Call them in, and ask them to get their tools ready.

As people arrive, don’t stop to explain everything. Say you’re in crisis, and give them enough information to start being useful. Then give them something to do.

Get people to play to their strengths:

  • Who are your best trouble-shooters? – Get them doing diagnostics.
  • Who is fast and accurate? – Get them at the coal face, ready to act.
  • Who is slower, but deals better with complexity? – Listen to them.
  • Who are your all-rounders? – Get them to help everyone else.
  • Who is the expert on the thing that’s broken? – That’s your technical lead.

Do you need help from other teams?

Step 5: Don’t Make Things Worse

In a crisis, it is very easy to start doing things that will either make the current situation worse, or which will destroy the evidence trail and make the clean-up operation harder.

Take steps to ensure this won’t happen.

For example:

  • Take a backup
  • Don’t act until you’re sure
  • Get someone else to double-check
  • Don’t rush
  • Do one thing at a time

If you’re working on live systems, don’t work alone. Have someone look over your shoulder, double-checking everything you do.

Step 6: Keep Notes

As you work to solve the problem:

  • Keep track of what you do, and what the outcomes are
  • Create an action list for later
  • Keep key people informed

Step 7: Clarify What’s Going On

If you don’t know what’s going on, you can’t fix it.

  • What is failing?
  • What is the impact on the business?
  • What are your key sources of information?
  • What has changed recently?
  • Is someone in another team already working on this?
  • What is your biggest risk?

If you don’t know, how can you find out?

Step 8: Restore Stability

Your first action isn’t to fix the underlying issue. It is to stop any data corruption from getting worse.

  • If you need to, take things off-line
  • If things have changed recently, consider rolling them back

Don’t try to be too clever. Take simple steps that are low risk, and easy to understand. This isn’t the time to write elegant, highly optimised code. Rather, focus on clear, effective code that will get the job done.

Do one thing at a time, and check the results after each step.

Step 9: Clean Up

Now that things are stable, take a moment. If you need to, go use the loo, and make a cup of tea.

Your next task is to restore anything that is still broken. Focus on data, then functionality.

Break the whole problem down into manageable chunks. Group similar problems together, and tackle them one group at a time.

Step 10: Reflect

Right, everything’s back up and running.

Take a longer break, but not too long. You need to do do the following as soon as possible after the crisis, while things are fresh in your mind:

Root Cause Analysis

This is where you figure out what went wrong. Chances are there was more than one cause, and that each cause was in turn caused by something else.

Make a plan:

  • to reduce the likelihood of something like this happening again
  • to reduce the impact of something like this happening again

Lessons Learned

What have you learned from the way the crisis was managed?

  • what went well?
  • what will you do differently next time?
  • how will you remember?

Step 11: Appreciate Your People

Now that the crisis is over, take time to understand the impact the crisis had on people, and how people contributed to resolving it.

Thank People

  • who helped
  • who were ready to help

Say Sorry to People

  • Who lost functionality
  • Whose work was delayed
  • Who you ignored

Step 12: Rest

Take time to charge your batteries, and get back to normal. Tomorrow is another day.

Eliminate Waste

Eliminate Waste is one of the core tenets of Lean Software Development.

Waste Spotting

By definition, the most efficient process is one in which there are no wasteful activities.

By learning to spot different types of waste, it is possible to start eliminating the activities that make our software development processes inefficient.

The lists below are intended as a starting point, to help us spot waste in our own software development processes.

Common Sources of Waste

Mary Poppendiek identified 7 common sources of waste in software production. She mapped each of these to one of the 7 wastes “muda” wastes in lean manufacturing.

  1. Partially done work (work in process)
  2. Extra features (over production)
  3. Relearning (extra processing)
  4. Task switching (handoffs)
  5. Waiting (delays)
  6. Handoffs (motion)
  7. Defects (defects)

I have noticed several other wastes that are common in the field:

  1. Unnecessary complexity (a form of over production, one that is especially common in software systems)
  2. Missed automation opportunities (a form of extra processing)
  3. Over-management (another form of extra processing)

Faster Roads and Developer Productivity

Today, I’ve been trying a thought experiment. It is designed to help us find creative solutions to software production bottlenecks.

The Scenario

Imagine that you’ve landed a job in traffic management, and that you’ve been assigned the task of reducing average journey times. What are your basic options?

Examples

In a couple of minutes, one of my colleagues and I came up with a few ideas:

  • Get everyone to drive faster
  • Increase the number of lanes
  • Remove obstacles, e.g. traffic lights, roundabouts
  • Add additional routes
  • Reduce congestion by reducing the number of vehicles on the road
  • Encourage people to avoid rush-hour by spreading their journeys throughout the day
  • Stop-starts create traffic jams, so slow traffic so that it flows better
  • Add safety features to cars so they’re less likely to have accidents and cause delays

What else can you think of?

Shifting Perspective

Now, imagine that delivering software works in a similar way. The features are cars, and the road is the delivery pipeline that takes feature requests and turns them into deployed features that are used by our customers.

Based on this analogy, what kinds of changes could we make to the development pipeline so that we could deliver features quicker?

For example, “get everyone to drive faster” could map to, “get everyone to code quicker”, and “increase the number of lanes” could map to “add more developers”.

Follow-Up Questions

  • What are the flaws in this analogy? In what ways do getting cars from A to B differ from getting features delivered?
  • In what ways is the analogy helpful?

Personal Observations

Personally, I found this thought experiment helpful, as it helped to generate different ways of looking at the software development life-cycle. It helped me to come up with ideas that could help increase our productivity.

The down side of this approach is that, like any analogy, it breaks down if you push it too far.

Drawing a Scar on an HTML Canvas

A simple approach to drawing a “scar” with “stitches” on an HTML canvas.

The Requirments

The team was developing a simple drawing tool for use in a hospital, to replace a 3rd party application that was due to be decommissioned. One of the requirements was that the users should be able to draw a line that represents a scar. The customer wanted these lines to look similar to the ones in the old tool, something like this:

Example Scar Line

Rejected Solution

Our initial thoughts were that we that we would need to take the equation for the original line, and use some kind of algorithm to figure out the location and angle of the stitches. This would undoubtedly involve some complex mathematics.

The Final Solution

My final solution is simple and fast, and works perfectly for lines, quadratic curves and bezier curves. It hardly involve any calculation at all!

How It works

To see how it works, follow these steps:

Step 1: Draw the Initial Line

Step 2: Make the Line Wider

Step 3: Make the Line Dashed

Step 4: Draw the Line Twice, First Thin, Then Wide and Dashed

Step 5: Adjust the Dash Spacing and Dash Widths

Implementation

The code below will draw the scar depicted above:

<!DOCTYPE HTML>
<html>
<head>
	<style>
	body {
		margin: 0px;
		padding: 0px;
	}
	</style>
</head>
<body>
<canvas id="myCanvas" width="578" height="200"></canvas>
<script>
	var scarThickness = 3;
	var stitchThickness = 1;
	var stitchSpacing = 20;
	var stitchWidth = 25;

	 var canvas = document.getElementById('myCanvas');
	 var context = canvas.getContext('2d');

		// We'll draw the scar in 2 passes over the same line.
		for(pass = 0; pass < 2; pass++) {

			// Both passes will draw this squiggle.
			context.beginPath();
			context.moveTo(100, 20);
			context.lineTo(200, 160);
			context.quadraticCurveTo(230, 200, 250, 120);
			context.bezierCurveTo(290, -40, 300, 200, 400, 150);
			context.lineTo(500, 90);

			switch(pass){
				case 0:
					// Normal line for 1st pass, which draws the base scar.
					context.lineWidth = scarThickness;
					break;
				case 1:
					// Wide dashed lines for 2nd pass to achieve stitch effect.
					context.lineWidth = stitchWidth;
					context.setLineDash([stitchThickness, stitchSpacing]);
					break;
			}

			context.strokeStyle = 'red';
			context.stroke();
		}
	 
</script>
</body>
</html> 

The Lean / Agile Hero’s Journey

A reflection on my agile journey.

Stage 1: The Ordinary World

Doing what we’ve always done.

Stage 2: The Call to Adventure

The feeling that there must be a better way.

Stage 3: Refusal of the Call

There are many reasons to turn back:

  • Uncertainty and self-doubt.
  • Fear of change.
  • Too busy chopping down trees to take the time to sharpen the axe.
  • Lack of support.

Stage 4: Meeting the Mentor

Seeing that other people have taken this journey before you, and taking inspiration from them.

Stage 5: Crossing the First Threshold

Formally adopting Lean / Agile. Often involves adopting a framework such as Scrum.

You may have to defeat the threshold guardians, e.g. colleagues who are resistant to change, managers etc.

Stage 6: Tests, Allies and Enemies

Making changes within your team to make Lean / Agile work for you.

Stage 7: Approach To The Inmost Cave

A sense of foreboding. An inkling that you’ve missed something important. A feeling that Lean / Agile is flawed, or that you’re doing it wrong, or it has been over-sold.

Stage 8: Ordeal

Facing up to the fact that Lean / Agile hasn’t lived up to its promise.

Stage 9: Reward

A deeper understanding of Lean / Agile; not just its practices, but its deeper principles.

Stage 10: The Road Back

Adopting a Lean / Agile mind-set.

Stage 11: Resurrection

A complete re-evaluation of what you’re doing from your new vantage point. A new enthusiasm and energy. Addressing long-standing issues.

Stage 12: Return with the Elixir

Encouraging others to take the Lean / Agile journey for themselves.

The Real Problem

When finding solutions, it is essential to get at the real problem, not the thing that people say is the problem.

You can’t always get what you want
You can’t always get what you want
You can’t always get what you want
But if you try sometimes
Well, you might find
You get what you need

The Rolling Stones

The XY Problem

The XY Problem is known to anyone who tries to solve problems for people, even if they don’t use that name for it. It happens when people present their problem in terms of their idea about a solution rather than your the actual problem. Unfortunately, this leads to enormous amounts of wasted time and effort, both on the part of people asking for help, and on the part of those providing help.

It goes like this:

  1. Someone has a problem, X.
  2. They think that Y will help them solve their problem, X.
  3. They ask about Y.
  4. Someone helps them to do Y.
  5. It turns out that Y doesn’t solve their real problem, X.

On a good day, the person with problem X will realize that they’re solving the wrong problem and shift their focus to X. All too often, however, they either continue to believe that Y is the answer, and go back to step 2, or they pick a different Y and try that instead.

Example: The Paper Problem

It isn’t uncommon for people to talk about “the paper problem”. The thinking goes that if an organization can go digital and get rid of all the paper, the organization will be more efficient.

What happens is this:

  1. A customer is frustrated by their administrative processes.
  2. They assume that the problem is that they’re using paper (Y), rather than computers.
  3. The IT supplier, keen for business, offers to write an application for them.
  4. On a good day, the application reduces the administrative overhead, and they customer is happy.
  5. On a bad day, the underlying problems (X) with the administrative process have not been addressed.

The customer may now be in a worse position:

  • IT acts as an amplifier, making good processes better, but bad processes worse. They can now do the wrong thing quicker.
  • IT systems are more expensive to change than paper systems. The problems are now baked in.

And all too often, both customer and supplier think that the problem is better IT, when the real problem is fixing bad admin processes. Both spend time and resource on improving Y, when the real problem is X.

The Presenting Problem

A similar issue occurs when people ask about the symptoms of their problems rather than their underlying causes. The initial symptoms are sometimes called the “presenting problem”. Relieving these symptoms may make the help-seeker feel better, and where the underlying problem isn’t serious, that can be sufficient. However, if the underlying problem is more serious, then a “cure” which simply masks the underlying issue is likely to do more harm than good.

Example: The Beeping Smoke Alarm

  1. My smoke alarm keeps beeping (presenting problem)
  2. I take the battery out to prevent the beeping (relieving symptoms)
  3. I fail to put in a new battery (which would solve the underlying problem)
  4. My home, family and my life are at greater risk

Solutions

There are various ways to overcome these issues:

As Someone with a Problem

  • Choose the best forum to seek solutions. In particular, recognize that some people have a vested interest in you solving the presenting problem rather than the underlying problem.
  • Treat people who are trying to help you with respect and patience. First, because it is the right thing to do. Second, because doing so means that they’re more likely to help you in return.
  • Be open to new ways of approaching your problem.
  • When explaining your problem, try to be concise, precise and informative.
  • Describe your goal, and what you’ve tried already.
  • Focus on your observations, not just your guesses.
  • Accept that a better question is progress, even if you don’t feel any closer to a solution.
  • Accept that the problem solver may perceive a deeper problem than you can see.

As a Problem Solver

  • When someone comes to you with a problem (even if that someone is yourself), begin with the assumption that there is an underlying problem that they’re not asking about.
  • Use the 5 Why’s Technique where appropriate.
  • Learn to recognize common occurrences of these problems As you gain experience in your field, you’ll find that people often ask about Y when they really want to to know about X.
  • Ask about other symptoms.
  • If you carry a hammer, beware of assuming that every problem is a nail. (If you’re in IT, for example, don’t assume that technology is the answer to everything. Both people-problems and systemic issues often masquerade as technical issues.)
  • Accept that you can’t solve everyone’s problems. Sending someone elsewhere (or even doing nothing) is better than making things worse.

SQL Server Index Analysis

A query for the level of fragmentation of indexes in the current database, with usage since last restart of SQL.

I can’t take credit for this code, but it has proven useful, so I’m posting it here.

Almost Implementing a Standard

Several of our suppliers claim to implement standards, but in reality they implement a subset of those standards, and only under certain circumstances.

To make matters worse, they don’t actually document the places where their implementation diverges from the standard. You get what you expect most of the time, but not quite always. And when you don’t, it can be a problem… a big problem.

What would if other businesses worked that way? Imagine you regularly ordered a taxi to take you to the airport. Imagine that, most time, a taxi turns up, where you want it, when you want it. But just occasionally, instead of a taxi, they send a small, golden-feathered chicken. But they insist you pay for it, even if you miss your flight.

The point of a standard is that it is… standard. So, almost implementing a standard is another way of saying that you don’t implement it. Except that your marketing people won’t be able to sell your product unless you claim to implement it.

</rant>

When Disaster Recovery is a Disaster Waiting to Happen

We were ready for our Disaster Recovery exercise. The plan was simple: to fail-over to our DR site, and then to fail-back again.

But, during our last-minute safety checks, my colleague and I spotted a problem. We realised that, if we proceeded with the planned fail-over, there was a significant risk of data-loss.

Some of the team wanted to continue anyway. We were, after all, trying to simulate a real disaster situation. If a real disaster were to occur, they argued, we would have to deal with similar consequences.

Debating the merits of this position, I suggested that:

There is a difference between falling off a bridge and throwing yourself off. The first is an unfortunate accident, the second is downright recklessness.

In agreement, my colleague replied that:

Continuing with the exercise would be like throwing yourself off a bridge now just in case you fall off one later!

A deeper analysis of the risk suggested that, if we were careful, we could throw ourselves off this particular bridge safely. Although there was likely to be data loss, there was a manual process available that would recover the missing records.

So, off we jumped…