Up until about 10 years ago, I played a lot of basketball — three times a week, sometimes more.
One day, something went very, very wrong. I don’t know what happened, but I more or less had to drag my right leg off the court at the end of the game.
Happily, within just a few days, I was able to walk normally again. But when I went to the doctor, he diagnosed me with a hip injury. My basketball career ended right then and there.
Last month, after giving up skiing, long hikes, and pain-free sitting over the course of the preceding decade, I finally got a hip replacement (I’m a slow medical interventionist). The procedure went perfectly and I am well on the road to recovery.
But there was one hiccup. On the morning of the surgery, the doctor walked over to my hospital bed and informed me that the sterilizing machine was broken!
Fortunately, and in that calm, “nothing can ruffle me” voice that doctors, airline pilots, and kindergarten teachers seem to have mastered, he let me know that he had already sent someone to a sister facility to sterilize the necessary equipment. Sure enough, 90 minutes later, we were underway.
What’s Your Cloud Provider Continuity Plan?
I have no idea how often sterilization machines break, what it takes to repair them, or how close we came to rescheduling my surgery that day. But I do know this: my doctor had a back-up plan — a way to maintain business continuity in the face of unforeseen obstacles.
How about you?
If the equivalent of a broken sterilization machine occurs with your cloud provider, does your operation grind to a halt? Do you have an action plan in place for responding? Are you even clear regarding which types of problems could cause an immediate and/or extended shutdown?
For example, many of our clients have their SaaS products hosted with Amazon Web Services (AWS). Fortunately, AWS doesn’t go down. Much. Recently. Well, other than on December 7, 2021, December 15, 2021, and December 22, 2021. Actually, that does seem like a lot!
Under these circumstances, what are the options?
For starters, and as discussed in a previous article, you can choose between “hot” or “warm” standby.
“Hot” standby means having redundant systems running all of the time. This approach maximizes up time, but it can be expensive to operate. “Warm” standby means that the infrastructure is provisioned but requires manual cutover. It’s less expensive to maintain, but since it does not happen automatically, there is typically a loss of service for some period of time.
With the recent failures at AWS, there has been some discussion in the press about a different option: running in multiple cloud environments, such as Microsoft Azure and Google Cloud Platform (GCP). While we do have clients that do this, we advise against it for most medium-sized companies.
Why? Well, because while having a backup of the data in a different cloud platform is a great idea, in order to take advantage of services such as AWS, Azure, or GCP, you need to use the features inherent in that particular platform. Maintaining a redundant operation there can have major cost implications because it requires building, running, and understanding two different technology stacks.
And, of course, the cost for any of these options comes at the expense of other strategies. For instance, could you make your single-cloud solution more resilient? (The answer is often yes.)
There Are No Universal Remedies
Having a resilient strategy to maintain business continuity is a multifaceted thing; the potential options and related tradeoffs are going to vary based on the particulars of your business and circumstances.
Here, though, are three things to consider for your cloud-based operations…
1. Run your infrastructure as code.
This means that instead of having a human manually configuring the servers, services, and network devices in your cloud, you have code that can set everything up from scratch. Hypothetically, you could back up your data, destroy your infrastructure, and then run the script. From there, everything would be restored as is.
Our clients use tools such as Terraform to accomplish this. From a resiliency standpoint, this ability allows companies to either bring everything down and back up quickly when there is a problem or run the infrastructure as code in another cloud region.
2. Improve your testing.
Why wait for a problem to occur? Test your solution with “chaos engineering” tools such as Chaos Monkey. These tools kill services in your implementation to see how the system responds. Does it spin up new resources or does it fail? A resilient solution will be able to handle failures of various components.
3. Store data across multiple availability zones.
If there is a significant outage in a given region and you are without access to the data needed for your service, having backup infrastructure doesn’t help. But, if you have the data replicated into different availability zones, you don’t need to wait for one zone to be repaired in order to restore service.
Don’t Wait to Devise a Plan
As mentioned previously, every business is different. For example, maybe you are prepared to be down for 10 hours and deal with the consequences. That’s a business decision and for you it may be a perfectly reasonable one.
If, however, you are simply counting on the reliability of your cloud provider in the hope that they will “come back online soon,” you are rolling the dice.
Business resiliency is about identifying your vulnerabilities, determining your risk tolerance, and specifying your response options – all considered in the cool, calm, light of day, prior to something going wrong. (Hint: “Oh no, what just happened and what do we do now?” is not a plan!)
Gotta run! I mean walk… it’s time for a slow walk.
Want to get great cybersecurity content delivered to your inbox? Click here to sign up for our monthly newsletter, Tales from the Click.