Essential disaster recovery tips: lessons from ServerBee

ServerBee Blog
3 min readSep 26, 2024

--

Image by awesomecontent on Freepik

Dealing with disaster recovery is all about finding the right balance between the cost of resources needed to save and restore important data, configurations, and databases, and the value of that data. It’s wise to invest in disaster recovery resources to reduce downtime and minimize the impact, cause losing data or infrastructure functionality, even for a short time, would be disastrous or cause big losses.

Our ServerBee team has much experience setting up reliable disaster recovery systems and automating them (auto disaster recovery). So we determined the most common issues that make it hard to set up and run disaster recovery effectively. Now, we’re ready to share.

Budgets and resources: A reliable disaster recovery, especially auto disaster recovery mechanisms, requires significant resources. You need:

a) people to develop and update documentation and the disaster recovery plan; people to maintain automation processes;

b) technical capacity to ensure that creating and deploying backups doesn’t slow down the main infrastructure;

c) backup resources for testing infrastructure and data backups;

If the budget is insufficient or there are too few resources, disaster recovery measures may be implemented too slowly, not comprehensively, and may lose relevance.

No IaC: Some rely too much on configuring infrastructure through a web interface because it provides a visual representation and simplifies actions to mouse clicks. However, without a ready IaC script for disaster recovery, you will lose a lot of time and incur losses, especially if the infrastructure is large. It’s better to spend about two hours upfront creating an IaC configuration (using tools like Ansible for on-premises or Terraform in the cloud) and not waste time on manual operations in emergency cases.

Lack of modular design in infrastructure configurations: Some companies set up their infrastructure in a way that makes it hard to break into separate, independently recoverable parts. It complicates backups and recovery and slows down maintenance. Instead of handling multiple tasks at once, the team has to wait for dependent processes, wasting time. It’s better to split configurations into separate modules, like IaC, network settings, DNS records, Persistent Volume data, static files, and databases, so you can work on and recover these parts independently and simultaneously if needed.

Ready disaster recovery plan and regular backup checkups: It’s good advice in a stable environment. However, sometimes the state of a project or company’s infrastructure can change too quickly. So any efforts to create a disaster recovery plan become useless due to inconsistencies between the primary and backup environments, such as Kubernetes versions, container images, and dependencies. However, dividing configurations into modules and storing them in version control systems (like Git) offers many advantages and forms the foundation for creating a functional disaster recovery plan.

Additional advice:

Separate your infrastructure nodes into different zones, like fire, industrial, provider, or geographical zones. In reality, things like database replicas, network setups for container clusters, monitoring systems, and storage might be on the same floor of a data center. A fire, sabotage, or natural disaster could ruin your best fault tolerance and high availability plans. By spreading your nodes across different zones, you follow many security standards (like PCI DSS) and protect parts of your infrastructure if something happens in one or more nearby zones. Also, store backups with different providers so you can still access them if you lose connection to one provider or it becomes unreliable.

If you’re setting up auto disaster recovery for your infrastructure, these tips will be helpful. If you have any specific questions, feel free to ask.

--

--

ServerBee Blog

We specialize in scalable DevOps solutions. We help companies in supporting critical software applications and infrastructure on AWS, GCP, Azure even BareMetal.