Backups, snapshots, and replication: avoiding common mistakes in data recovery

4 min readOct 10, 2024

Backups, snapshots, and replication: avoiding common mistakes in data recovery

In recent times, the structure of enterprise IT infrastructure has become technologically heterogeneous and rapidly changing. It causes complications in backup strategies for data, databases, infrastructure configurations, and their recovery. Whether IT companies use cloud services for backups or on-premises storage systems, most of our clients face similar pain points related to failed backup strategies. Here are some common issues:

How is the backup plan and recovery process for data and infrastructure documented? Many IT companies do have a plan that explains what gets backed up, where it’s stored, the keys needed for recovery, and who’s responsible. However, during an incident, this documentation might be outdated or unhelpful because, in emergencies, you need clear and straightforward steps to restore the system quickly. It’s better to have a step-by-step guide that details each action and how long it will take.
Are backup and recovery scripts tested? Backup creation and restoration are usually automated with scripts. You should regularly check how these scripts are performing. If you don’t review backup metrics, logs, their integrity and size, the relevance of secret keys, and available storage space, you can’t be sure your recovery will be quick and successful. Although this takes some resources and time, it helps you avoid significant delays during emergencies by ensuring: a) you don’t waste time trying to recover broken backups; b) if you can test your backups, make sure databases and data are consistent by doing selective tests and checking key indicators.
Backups vs. Replication. Some IT companies prefer to replicate live data and sync it at set intervals. This approach allows for easy retrieval and rollback of small amounts of data when needed, but only within a certain time window before data syncs again. If there’s an issue, this window is crucial for receiving a warning (which may not come in time) and stopping the sync. If time is lost, errors in the primary replica will be copied to the backup. Therefore, replication isn’t a true backup and isn’t meant for recovery in case of data loss or disaster, though many companies use it despite the risks.
Backups vs. Snapshots. For large infrastructures, it’s tough to create a solid backup strategy that works both at the cluster level and at the application level. Copying and archiving (not to mention test recovery) tens or hundreds of terabytes of data, databases, and configurations of thousands of containers and clusters is difficult or impractical due to the time and resources needed. Instead of traditional backups, comprehensive data storage systems are used to take “instant” snapshots of data. On-premises environments use systems like DAS/NAS, SAN, and Unified Storage to create snapshots, saving only the changes in data, and syncing them with remote replicas at intervals. It saves time and money for large infrastructures. However, cloud storage systems are often more advanced and offer more features, better convenience, higher security, and compliance with certain requirements, along with unique technologies. Examples include NetApp services like FlexArray, FlexClone, FlexVol, MetroCluster, SyncMirror, SnapRestore, SnapCenter, and more. Many companies find these cloud solutions more cost-effective for backups compared to buying their storage systems and hiring staff to maintain them and develop automated backup and disaster recovery software.
Is troubleshooting conducted? Failures can happen unexpectedly in any system. If you don’t regularly conduct troubleshooting, you won’t know: a) how quickly your team can fix issues, b) whether your disaster recovery plan is still relevant, c) if your backups, snapshots, and replicas are working properly, and what risks could arise from certain failures. This makes troubleshooting important. In development or testing environments, you should conduct regular planned and surprise team training to not only get answers and predict what will happen in real situations but also to refine recovery methods so they happen quickly and in a well-practiced way.

We’ve identified five major issues with poor backup strategies:

1) lack of a backup plan or clear step-by-step instructions;

2) failure to check the results of backup and recovery scripts;

3) using replication instead of backups and snapshots, risking serious problems and revenue loss from extended downtime while fixing synced errors;

4) budget overruns due to poor choice of tools and storage services — cloud storage might be cheaper, or on-premises solutions might be better;

5) lack of regular troubleshooting, leaving the team unprepared for failures, leading to significant time loss during real incidents.

If we missed something important or if you have interesting experiences, we’d love to hear your thoughts.

Written by ServerBee Blog