Disasters Will Happen
Management is about looking around the corners to anticipate what’s coming next: enormous new business, well-funded competition, new platforms, departures of key workers. Anything at all that could disrupt the ability of your business to function.
Fundamentally, management is about preparing your team for the uncertain future world, not just anticipating potential disruptions.
In software development, in a SaaS model, a worst case operational scenario is often a major data center outage. Or in AWS terminology, a full region outage. There are various technical strategies to support disaster recovery. This is beyond the scope of this article though.
I want to be clear on what disaster is. It is an absolute certainty.
Disasters do not require bombs, tornados, or earthquakes. They may merely require some human to pressing the wrong button. Or AWS rolling out software updates — which they do every day — which fail. Or a hardware failure. Or a hacker. Disaster will happen. Knowing that disaster is a certainty makes it more palatable to focus on preparing for it.
The very simple point here is that management of business critical software systems require proving that the necessary level of fault tolerance actually works. And that it will work immediately when needed. Which could be RIGHT NOW! (But hopefully not, so you can keep reading).
No fault tolerance works and no disaster recovery works unless it’s proven to be working consistently. By way of actual testing. Which requires a time commitment.
Just like any other features of a product, disaster recovery is not something you can think about once a year and ignore for the rest of time. It needs constant validation as the system evolves.
Management is about compelling others to remain focused on what matters. If your system cannot go down, then disaster recovery matters. In order for it to be there for you when you need it, test every inch of it every on a regular cadence. We test every system every 6 weeks now.
Testing means exercising everything that you assume works but you haven’t proven that it works: every database fail over, web service fail over, web site resilience. Everything that matters. Some things may not matter. At least identify what those things are and ignore them intentionally.
Remember, you’re not simply testing the software, you’re also confirming that the engineers understand what to do and and how to do it when the pressure is on.
Here’s how it works: If you didn’t test it, then it doesn’t work. And that’s how it works.
So if you’re not able to test regularly, then it’s probably more cost effective to have no DR strategy at all. Rather than having a zombie DR staged somewhere that is many months outdated.
I have found that this diagram is persuasive to people who should be concerned with DR planning and testing but are distracted by other things. Use it freely.