Last week we were scheduled to replace a critical component in a complex, mission-critical hospital system. About two-thirds of the way through the deployment, it became clear that I had missed something during the preparatory work for the change (security, always check security). Additional work would be needed before we could complete the upgrade, and it was very likely that we wouldn’t finish the deployment on time…
Lessons from Previous Implementation Projects
Given the critical nature of this change, experience told us that we needed to do things “properly”. Previous experience suggested that we needed to:
- Test the new solution thoroughly (we put 2,000,000 transactions through the new component and compared the results to the old solution).
- Write a sufficiently detailed implementation plan
- Include prep-work required prior to implementation
- Include enough detail so you don’t have to think during implementation. This helps under pressure, and ensures that energy is available to tackle the unexpected.
- Outline post-implementation work required
- Test the implementation plan (This was not possible for us due to differences between our test and live environments. Rectifying this would cost £ hundreds of thousands).
- Write a sufficiently detailed roll-back plan
- Test the roll-back plan (Again, not possible).
- Keep users and stakeholders informed… allowing plenty of time for them to make necessary arrangements for down-time.
- Define a change window
- When you’ll cause least disruption during the change
- When failure of the new component will cause least chaos
- When you have enough support from others
- When you’ll have enough time for post-implementation testing
- Get approval from stakeholders… in writing
- Explain the purpose of the thing you’re changing
- Explain why you’re making the change
- Say how things are at the moment
- Say how things will be in the future
- Explain how you will monitor the new solution
- Prepare your implementation and roll-back plans in advance
- Check the state of the system before changing it (so we could be sure that any faults were due to our changes and not existing faults)
Well, we had done all that, but I had made a minor mistake during prep, so things were going badly.
So, my team leader made the call: to roll back.
Lesson 1: Be Prepared to Roll Back
I don’t just mean having a written plan, although that was extremely useful. I mean psychologically. It is sometimes hard to admit defeat. However, it is better to roll-back than either (1) upset customers by breaching the change window and (2) making mistakes whilst working under pressure. It just isn’t worth it.
Lesson 2: A Successful Roll-Back Is Not a Failure
… it is a tactical retreat. As we had a good roll-back plan we were able to revert to the old module without loss of data, and to do so within the change window. We had maintained the status quo.
Lesson 3: Roll Back Completely
You really don’t want to leave a system in an indeterminate state. As it was, we left some things in place ready for our next roll-out attempt. This was a mistake, as it caused (1) some minor confusion, and (2) if we had forgotten and done more testing, we could have caused corruption of live data.
Lesson 4: Communicate
We explained the reasons for the roll-back and the steps we had taken to make sure that we wouldn’t experience the same difficulties again. This was trust-building, and others were supportive of our action.
Lesson 5: Rally and Retry
Once the roll-back was verified the others went home. I stayed late to fix the problem that had caused the implementation issues.
This was a great learning experience for me. Today we did the implementation again. But this time, it went smoothly.