Do you remember the infamous GitLab database incident in 2017? Three backup strategies in place… but no test. When they needed to quickly restore a backup there was none. You don’t want to be in that situation.
Here’s a high level discussion about how to test database backups. This article is not technical, and it’s not related to a specific technology.
NOTE: Sorry for often mentioning GitLab incident. I use it as an example to explain people that backups are important. I have no intention of trash-talking about GitLab. I admire their service, and the honesty they shown when they explained all the incident details in a public port-mortem.
Whichever backup and test strategy you are going to use, it needs to be automated. This has several benefits:
- Testing will actually happen. Unless your team is far less busy than all teams I’ve seen, finding time to do things is an actual problem. Too often one has to postpone important but not urgent tasks.
- You’ll test if your latest backup is valid.
- You’ll test if your restore procedure is valid.
- The restore procedure is scripted (a Bash script, an Ansible playbook, whatever) so it can be repeated quickly and with confidence.
- Your restore procedure is documented, even if you spend zero seconds to document it (though writing some explanation may be a good idea).
- Proper monitoring is done.
Several aspects of the backup and restore procedures should be monitored:
- Backup exists.
- Backup is not empty.
- Backup size. You want to know the costs of storing it and moving it.
- Backup time. This is important for the RPO (Recovery Point Objective).
- Restore succeeds:
- DBMS can be started.
- Data is readable afterwards (by someone other than root, so you can test that permissions are properly restored).
- Recovery time. Important for the RTO (Recovery Time Objective).
When a backup fails, a restore fails, or any of the other metrics becomes worrying, an alert should be triggered.
Backup compression and encryption
If you compress and/or encrypt your backups, you should do that before the test happens. The restore procedure will have to uncompress and decrypt backups.
You will monitor that these fundamental parts of the restore work properly, and how much time they take. You will also monitor size of the database before and after compression, to make sure that compressing it is worth the trouble.
The target server/instance
To test a backup, you’ll need to restore it somewhere. But maintaining a server just to test backups once a day is a waste of resources, and many organisations can’t afford such wastes.
If your database servers run in the cloud or in virtualised environments, a possible solution is to spin an instance (or VM, or container) once a week to test the restore procedure, and then delete it. This is not a big waste.
A more efficient solution would be to restore a backup into a server that exists for other reasons, when it is not used. But the point of testing backups is that, if something goes wrong, the restore procedure will fail. So this server should not be vital.
A great example of such servers are the database servers in your test environment. Here are some hints:
- Make sure that they are as similar as possible to production. They will probably have less resources, but that shouldn’t be a problem. Vital aspects that should be identical include the DBMS version, the DBMS configuration, and the versions of tools involved in the restore.
- Automating servers deployment and avoiding manual changes will help a lot.
- The test should happen when the usage is low. Typically, this means during the night. If restoring a backup methods takes too many hours (this could very well happen with a logical backup), it could only be restored during the week end.
- Do you have multiple backup strategies? Use each backup type to feed different servers. For example a ZFS snapshot can be restored on certain servers, and a dump can be restored into other servers.
- Restore a backup into one server at the time. If restoring the backup on the first server fails, don’t try restoring it into the next servers. So you will still have working servers.
Sometimes, when developers deploy their changes in testing, they may find out that some database servers are down. If they have some working databases, this will not be a huge problem. In any case, a database can always be restored. But this takes time, right? You can do something about it, tho. For example, your test procedure could take a snapshot from the test database, before trying to restore a backup.
What we left out
I didn’t mention any specific DBMS or backup tool. This was intentional.
We talked about using test database servers to test production backups. As far as the backup testing is concerned, the discussion is probably complete. But there are two big problems to consider:
- Data anonymising. Some regulations require it, including our beloved GDPR. Developers shouldn’t have access to users personal information.
- Data reduction. Some production databases are very big, and restoring them completely into the test environment is not reasonable.
There are several approaches to data anonymising and data reduction, but that is out of the scope of this article. I might write about these topics in the future.
If you need help with backup automation and backup testing automation, consider our database automation service.