BT

Fatigue, Spam, and Lack of Backups Take down GitLab.com

| by David Iffland Follow 4 Followers on Feb 03, 2017. Estimated reading time: 3 minutes |

The production data loss and hours of downtime at GitLab is an unfortunate and fascinating story about how little things, from spam to engineer fatigue, can coaelsce into something more catastrophic.

Anecdotes started to trickle in on January 31st, but a single tweet confirmed that something was amiss at GitLab.com:

We accidentally deleted production data and might have to restore from backup. Google Doc with live notes https://t.co/EVRbHzYlk8

— GitLab.com Status (@gitlabstatus) February 1, 2017

"Deleted production data" are not words any IT worker wants to hear, but it happens and that's why backups are so crucial to the operation of any production service. Unfortunately, as the team toiled through the night to restore service, the bad news got worse.

In a post outlining what happened, the trouble started when replication issues popped up due to malicious activity by spammers "hammering the database by creating snippets, making it unstable". Three hours later, the database couldn't keep up anymore and the site crashed.

Working late into the evening hours, an engineer attemping to resolve the problem succumbed to an unfortunate mistake and accidentally deleted the data on the primary cluster machine:

At 2017/01/31 11pm-ish UTC, team-member-1 thinks that perhaps pg_basebackup is refusing to work due to the PostgreSQL data directory being present (despite being empty), decides to remove the directory. After a second or two he notices he ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com.

At 2017/01/31 11:27pm UTC, team-member-1 - terminates the removal, but it’s too late. Of around 300 GB only about 4.5 GB is left.

As the team worked to discover what backups were available to restore, each option ended in a dead end.

  • LVM (Logical Volume Management) snapshots only run once every 24 hours by default
  • Regular backups only occured once every 24 hours and they weren't working
  • Disk snapshots weren't running on the Azure machines running the databases
  • Backups to S3 in AWS were empty

By chance, an engineer had made an LVM snapshot six hours prior to the deletion. Without this serendipity, even more data would have been gone forever.

Throughout the entire event, the GitLab team was completely transparent, posting live updates to a Google doc so the community could follow along. In addition, they had a live video stream of the engineers working through the restore process.

Roughly 18 hours after the database went down, GitLab.com was back online:

https://t.co/r11UmmDLDE should be available to the public again.

— GitLab.com Status (@gitlabstatus) February 1, 2017

The community was both supportive and critical of the team. Some posted messages of familiar condolences and praising GitLab for their transparency. Hacker News user js2 said the feeling was familiar: "If you're a sys admin long enough, it will eventually happen to you that you'll execute a destructive command on the wrong machine." Others were less charitable.

Despite the loss for GitLab, the community used their pain as a reminder to test backups, says David Haney, Engineering Manager at Stack Overflow:

GitLab got this part right, and are being heralded as a great example and learning experience in the industry instead of spited for mysterious downtimes and no communication. I promise you that this week, many disaster recovery people are doing extra backup tests that they wouldn’t have thought to do otherwise – all as a direct result of the GitLab incident.

Others teased that February 1st should become Check Your Backups Day.

GitLab started in 2011 as an open source alternative to the dominant player, GitHub. It has a hosted version at GitLab.com as well as self-hosted community and enterprise editions. Only the GitLab.com hosted service was impacted by the failure.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Going to check all my backups on Monday by Rory McKenna

Great, now I won't sleep well all weekend until I check my backups. This is a good article to remind everyone that checking that backups are actually working is a great idea. I like the idea of Feb 1st as "Check Your Backups Day". Think I will put in on the calendar as a recurring event.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

1 Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT