Last updated on 25 May 2022
A typical question we ask new customers is: do you have proper backups in place? Some of them have some form of backups, but answer no. The others give a wrong answer. Here’s a checklist of things you should have in place.
If you want to have reliable, rock-solid backups, ask yourself these questions:
- Do you have clear, measurable performance and availability goals?
- Are your backups automated?
- Do you take your backups frequently enough?
- In case of a failure, can you restore a backup quickly?
- In case the latest backup is not available, can’t be restored or contains corrupted data, can you recover an older one?
- Do you have an automated restore procedure?
- Is it regularly, automatically tested?
- Do you monitor backups?
- Do you need to compress backups?
- Do you need to encrypt backups?
- Do you monitor restore procedures?
- When a backup fails, is another backup ready to be used?
- Do you understand that “the cloud” is not magic?
The bullet points try to follow a logical order, not an importance order. The list is incomplete. The exact bullets and their importance vary depending on your organisation.
Performance and availability goals
Do you have clear, measurable performance and availability goals? In other words, do you have SLOs (Service Level Objectives)? You can (and probably should) even have different SLOs for different services, because services/pages don’t have the same importance. For example, the user registration page is more critical than the About Us page.
Objectives look like this:
- The page must load in < 1 second.
- Slowdowns that affect > 2% of users must be no longer than 1 hour, and occur no more than twice a month, with a minimum distance of 7 days from each other.
- Unplanned outages must not last more than 30 minutes.
SLOs should determine our choices when it comes to backups, as our decisions will determine how much recent data we could lose in case of an incident, and how much time is needed to restore a backup.
Objectives must be realistic, and they must keep into account the actual losses that performance problems and outages cause. This means that you should know how many sales you will lose in a 30 minutes slowdown, and how many users will leave your website and never come back (farewell, CLS). You should quantify these losses, and then you’ll know how much you can invest in proper interventions that aim to avoid them.
Really, calculate these numbers. If you can’t, use some guesstimates and plan to collect proper metrics in the future. Your decisions will be more rational.
Are your backups automated? Manual backups should never be relied on. People don’t take them regularly, makes mistakes, etc. Seriously… backups should be automated.
Do you take your backup frequently enough? This depends on how many recent data you can afford to lose. A common decision is to take them every 24 hours. That is fine only if you can lose up to one day of data in case of failure.
Sure, backups are heavy operations. But they can (should!) be taken from a replica or an unused node of a cluster, to avoid slowing down production. Sure, they also take space. But you can probably use incremental backups from a replica or even from the master.
Short term availability
In case of a failure, can you restore a backup quickly? The time to recover includes the time that is necessary to make a backup available, so a copy of the latest backup should probably be kept on the database server. Maybe even in the same disk – but this means that, if the disk is damaged, you will have to copy a backup from elsewhere. Using a Network Attached Storage could be a good compromise.
If your database is a cloud instance, you can at least make sure that the latest backup is in the same private network.
The time to recover also includes the interval of time after the corruption happens and before you take action. So, make sure your monitoring and alerting systems are adequate.
Long term availability
In case the latest backup is not available, can’t be restored or contains corrupted data, can you recover an older one? An older backup should be archived somewhere safe. This means that recovering may take time. This will only happen in extraordinary cases, so a longer recovery time should be acceptable. Normally you won’t rely on archived backups. Yet, you should check that restore won’t take too much time.
Do you have an automated restore procedure? Is it regularly, automatically tested?
Automated backups should be tested with an automated procedure. This can easily be combined with another need that your organisation surely has: feeding staging databases. Backups can be restored into staging database servers every night.
Having a restore script means that you can run it in case of need. This will make restore faster, documented, and will avoid human mistakes.
Do you monitor backups? You should monitor that they exist, they’re not empty, their size, and how much time they take.
Do you monitor restore procedures? You should check that it doesn’t fail, and how much time it takes.
Do you need to compress backups?
Backup compression will help reduce the costs of storing and moving around your backups. This includes the resources you have to pay, but also the time took by backup-related operations (archiving, restoring).
But remember: everything you do with your backups will add some complexity and increase the likelihood of a backup failure, or a restore failure. Monitor your backups size as mentioned above, and therefore monitor the cost of keeping those backups. If compression is not necessary, you may prefer to leave backups uncompressed. This is especially true for the latest backup.
Do you need to encrypt backups?
Applicable regulations, like GDPR, and your organisation policies determine if your backups should be encrypted. If so, monitor that they are encrypted and can successfully be decrypted. Make sure your use secure algorithms, and make sure you’ll receive an alert if a vulnerability is discovered in the encryption software you use.
When a backup fails, is another backup ready to be used?
Provided that the restore automation, the monitoring, and the alerting are perfect, the only thing they do is to to let you know that a backup failed. Hopefully you can fix the problem before the next scheduled backup, but you currently you don’t have a valid backup.
For this reason, you should have more than one backup strategies in place. When needed, you will try to restore your fastest, most reliable backup. If that fails, you will try to restore progressively slower or less reliable backups. You may have snapshots as a primary strategy.
Do you understand that “the cloud” is not magic? By “you” I mean your team, your management, and whoever is involved. Everyone needs to understand that your favourite cloud provider fails often, and cannot guarantee that the snapshots it provides will always work.
Your vendor has written somewhere, somehow, what they do or don’t guarantee. Did you read those warnings? This should save their a**e if you bring them to the court. Saving yours in case of a backup failure is up to you.
We tried to summarise the features that your database infrastructure should have. If they’re missing, you have some degree of technical debts.
If you need help with backup automation and backup testing automation, consider our database automation service.