Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Validating a Backup Strategy with Restore

Validating a Backup Strategy with Restore

This item in japanese

Jeff Atwood has recently lost two of his blog sites: Blog @ Stackoverflow and Coding Horror. He managed to recover the contents of both websites, but what lessons are to be learned from this event?

Jeff has a post on his blog on why and how one should backup his content, a post dated January 2008, where he concludes: 

One thing's for sure: until you have a backup strategy of some kind, you're screwed, you just don't know it yet. If backing up your data sounds like a hassle, that's because it is. Shut up. I know things. You will listen to me. Do it anyway.

Jeff had a backup strategy for Stackoverflow’s blog and Coding Horror, sites hosted by CrystalTech which regularly backed up the entire sites. Then, how was it possible to lose the websites content? The websites were hosted inside a virtual machine, and CrystalTech was backing up the VM images, but the images were actually corrupted. So, the backup contained lots of corrupt VM images good for nothing because the VM could not be started from them. In the beginning, Jeff blamed himself and the host:

Jeff Atwood: ugh, server failure at CrystalTech. And apparently their normal backup process silently fails at backing up VM images

Jeff Atwood: for the record, I blame 50% hosting provider, 50% myself (don't trust the hosting provider, make your OWN offsite backups, too!)

With the help of some users, Jeff managed to recover the lost contents of his blogs. Rich Skrenta helped him getting back the text:

I was able to get a static HTML version of Coding Horror up almost immediately thanks to Rich Skrenta He kindly provided a tarball of every spidered page on the site. Some people have goals, and some people have big hairy audacious goals. Rich's is especially awe-inspiring: taking on Google on their home turf of search. That's why he just happened to have a complete text archive of Coding Horror at hand. Rich, have I ever told you that you're my hero? Anyway, you're viewing the static HTML version of Coding Horror right now thanks to Rich. Surprisingly, there's not a tremendous amount of difference between a static HTML version of this site and the live site. One of the benefits of being a minimalist, I suppose.

while the images came from Carmine Paolino, an user who happened to have “nearly complete mirror of the site backed up on his Mac! Thanks to his mirror, we've now recovered nearly 100% of the missing images and content.” While relived after recovering the sites, Jeff ended up blaming himself:

Because I am an idiot, I didn't have my own (recent) backups of Coding Horror. Man, I wish I had read some good blog entries on backup strategies!

He concluded:

What can we all learn from this sad turn of events?

  1. I suck.
  2. No, really, I suck.
  3. Don't rely on your host or anyone else to back up your important data. Do it yourself. If you aren't personally responsible for your own backups, they are effectively not happening.
  4. If something really bad happens to your data, how would you recover? What's the process? What are the hard parts of recovery? I think in the back of my mind I had false confidence about Coding Horror recovery scenarios because I kept thinking of it as mostly text. Of course, the text turned out to be the easiest part. The images, which I had thought of as a "nice to have", were more essential than I realized and far more difficult to recover. Some argue that we shouldn't be talking about "backups", but recovery.
  5. It's worth revisiting your recovery process periodically to make sure it's still alive, kicking, and fully functional.
  6. I'm awesome! No, just kidding. I suck.

Joel Spolsky presented several other possible situations when a back-up strategy does not work unless a restore is validated:

  • The backed-up files were encrypted with a cryptographically-secure key, the only copy of which was on the machine that was lost
  • The server had enormous amounts of configuration information stored in the IIS metabase which wasn’t backed up
  • The backup files were being copied to a FAT partition and were silently being truncated to 2GB
  • Your backups were on an LTO drive which was lost with the data center, and you can’t get another LTO drive for three days
  • And a million other things that can go wrong even when you “have” “backups.”

The minimum bar for a reliable service is not that you have done a backup, but that you have done a restore. If you’re running a web service, you need to be able to show me that you can build a reasonably recent copy of the entire site, in a reasonable amount of time, on a new server or servers without ever accessing anything that was in the original data center. The bar is that you’ve done a restore.

Let’s stop asking people if they’re doing backups, and start asking if they’re doing restores.

Jeff has moved the blogging sites in case to the data center hosting Stackoverflow and the other servers in the family, more reliable and with a better backup strategy:

  1. We take full database backups of all databases at 4 AM, 4 PM, and 12 AM. (some databases are backed up more aggressively, but this is typical.) These full database backups are stored on our NAS RAID-6 device on the rack at the PEAK datacenter.
  2. We have a 500 GB USB hard drive attached to the database server. There is a C# script which copies the latest backups from the NAS to the USB hard drive every night at around 1 AM. The oldest files are deleted to make room for the new files as necessary. (The current Stack Overflow full backup is about 7 GB compressed, and the other databases are perhaps 2 GB compressed.) new: we’ll have two USB hard drives connected and do identical copies in parallel in case one of the drives develops problems.
  3. One of our team members, Geoff Dalgas, lives a mile from the PEAK data center. He drops by and physically swaps out the USB hard drive every few weeks. He holds four 500 GB USB drives at his home, while the other two are at the data center. They continually get cycled back and forth over time.
  4. new: Fog Creek will FTP in and transfer the most current database backups to their hosting facility every week, during low traffic periods on Saturday.
  5. We do Creative Commons data dumps of all sites (Stack Overflow, Server Fault, Super User) every month. This is a subset of the data, but a sizable one, and it’s available on Legal Torrents. These data dumps are physically hosted on and seeded by Legal Torrents.
  6. Our Subversion source control repository is copied to the NAS every day and also gets copied to the USB external drive, etc, through the same script.
  7. We also run a few VM images — for Linux helper services, mostly — and they are backed up through the same process. As our other host learned the hard way, backing up live VMs can be tricky, so this is definitely something you need to be careful about.
  8. We regularly download the latest database backups and restore them locally (we develop against live data all the time), so we know our backups work.

This strategy sounds much better than what was the setup in the first place. The weak link in this case is “Geoff”. What if Geoff does not show up to swap the drives? Or he drops one? Or a thief stoles them from his home?

Jeff Atwood is not really to blame. It could happen to anyone, even with better backup strategies. The point is: what are the lessons to be learned here?  

Rate this Article