Removing Binary Files from git using Roberto Tyley’s BFG Repo-Cleaner
Source controls systems are not just black boxes that just work. If you don’t properly maintain your repositories, they will eventually become a major burden to the developers who are trying to use it. Tools can help with this, such as Roberto Tyley’s BFG Repo-Cleaner for Git. This tool is used to remove large binary files that were accidentally checked into a git branch.
InfoQ: For our readers who are not familiar with git, why are large binary files a problem?
Roberto: In Git, pretty much everyone you share your code with gets the whole history of your project when they download it. Normally this is no problem- actual source code takes up scant space, and having full history makes possible distributed cooperation, fast diffs, and more - but project history only gets bigger. If you start committing big files - say 100MB+ - your repository grows rapidly, and when people clone it, they've got a lot more data to download. Eventually you'll realize you've make a mistake, but even if you decide to make a commit deleting those files, they're still there, in your Git history, and the repo will take just as long to download.
As well as large files, accidentally committed passwords and other private data have same problem - they're still there in your history after you delete them. You made a mistake, and Git won't let you forget. What do you do now?
InfoQ: The basic way to remove a binary or other improper file from git is to use the git-filter-branch command. Can you explain what that actually does and why you felt that it wasn’t good enough on its own.
Roberto: Yep, the git-filter-branch command is a powerful tool- a magnificent Swiss-army chainsaw - for scripting the rewriting of Git history, right from the very first commit. You give it a bash script, and it runs that script on every single commit from the start of the project to the end, one after the other, rebuilding history as it goes. Your script can do whatever you like - delete a file, remove unwanted text (e.g. passwords), or replace every .jpg with a photo of a baked potato. Once the git-filter-branch run has finished, your Git history will probably look superficially the same - the same dates, commit messages, etc. - but this new history is not the same as it was before - you've fixed it.
The git-filter-branch tool became part of the standard Git suite back in 2007, two years after the initial release of Git, so it's long been the canonical way to solve tricky problems in Git history. There are two problems with git-filter-branch:
1. It's difficult to use
2. It's slow.
The difficult-to-use-thing comes from a few different places - git-filter branch is not opinionated software; it doesn't nudge you towards solutions. It's flexible enough to do literally anything, but you've got to know what you're doing. The simplest thing you might want to do is just delete a file from every commit - this is how you do it:
git filter-branch --index-filter 'git rm --cached --ignore-unmatch big.mp4' --tag-name-filter cat -- --all
...which is not too friendly. But what if you want to do something more complicated - like delete a password wherever it occurs? Or just delete all your ridiculously large files? There's not even any built-in support for finding those things with git-filter-branch - you've got to write a clever bash script before you can even tell git-filter-branch what to delete. There are plenty of filter-branch scripts out there - I wrote a few before I came to write the BFG. At about the same time, some of my fellow colleagues at the Guardian were trying to delete passwords from one of our largest codebases, and after several days they came up with a filter-branch script that took literally hours to run- they had to carefully schedule it to run overnight to minimize the inconvenience to other developers on the team, and ran laborious practice runs beforehand to make sure they'd got all the bugs straightened out.
This seemed like an awful lot of pain to be going through. Although git-filter-branch was an advanced-level tool, Git was getting more and more popular (a trend that continues to this day), and the number of people who would end up encountering it would only increase - an inexorably increasing amount of confusion and pain. It seemed a problem worth fixing.
I already had some expertise with Git, having written Agit, a Git client for Android, and understood the basics of the Git data-storage structure fairly well. As I thought about why git-filter-branch was so slow, I began to realize that if you were willing to tweak the problem requirements just slightly, you could create a seriously faster solution.
InfoQ: According to your project site, BFG Repo-Cleaner is “10 - 720x faster” faster than git-filter-branch. How is that possible?
Roberto: Actually, it's more like 10-1000x, and that's when doing the simplest tasks (like deleting a known file) that favor git-filter-branch. As an example of an independent timing, Elliot Glaysher (a Google Software Engineer on Chrome) benchmarked the BFG when looking at migrating the Chromium codebase from SVN to Git - the BFG took just 10 minutes on a task that took git-filter-branch 3 days - a speedup of around ~430x.
You can see a video speed-comparison of git-filter-branch (running on a quad-core Mac) and the BFG (running on a Raspberry Pi) here: https://www.youtube.com/watch?v=Ir4IHzPhJuI
It's possible because the BFG is not trying to be git-filter-branch.
It's possible because we're considering the popular use-case, and the way data is stored in Git itself. The use-case is getting rid of unwanted data- at least, I'm going to say that's the most popular use-case for git-filter-branch, and it's the one I'm interested in. For this particular problem, there's one fact about Git that completely changes the approach you want to take: in Git all data for files and folders is stored precisely once and given a unique id - its git-id. If you have a hundred commits that all have the same version of a file, that file is stored only once. If you have two versions of the file, and just switch back and forth between them, then the file is stored just twice, once for each version. So if each unique file is stored just once, then cleaning the same file one hundred times over, just because it appears in 100 commits, is crazy. The BFG cleans every object in the Git repo one time only - remembering the cleaned git-id for every dirty git-id it encounters, and never calculating it again. By constraining ourselves - by removing some of the freedom to do anything that git-filter-branch gives you - we get a massive speed increase.
There are also some pretty good speedups that come from the lack of process switching (everything is handled in the JVM, without the need to switch between C and bash) and the ability to fully-utilize your multi-core machine. That's something I'm really happy about, the BFG will make all of your processor cores glow hot, each core cleaning any file or folder it can lay it's hands on for pretty much all of the duration of the run - whereas unfortunately git-filter-branch is inherently limited to an sequential approach, cleaning one commit after the other - unable to start the next before finishing with the former.
The immediacy of running the BFG is - obviously - a great virtue. It makes it much easier for people to experiment and test the rewrite with a safe copy of their repo before going ahead with the Big Event: cleaning the repo for real, pushing up the result, and getting all of fellow developers to delete old copies of the repo and pull down the fresh new history - a significant communication task for any big development team.
InfoQ: Why did you choose to write BFG Repo-Cleaner in Scala instead of straight Java?
Roberto: The Guardian's been an enthusiastic adopter of Scala, and given the choice of either Java or Scala for almost any task, I'd pick Scala, every time. Scala was actually an inherently good choice for this particular project: being JVM-based, I could leverage the excellent high-performance JGit library. Being a functional programming language, it was well suited for both handling the immutable data structures of Git, and taking advantage of the embarrassingly parallel workload. It's also just a pleasure to code in. I talk about the advantages of Scala in a bit more detail in my talk for ScalaDays: Git Going Faster... with Scala.
InfoQ: Have you considered offering a GUI version of BFG Repo-Cleaner, alone or a plugin for another git tool?
Roberto: Although I think the BFG works pretty well as a command-line tool, I've made some inroads into interesting visualizations. Back in June I created an experimental 3D visualization of git-filter-branch and the BFG acting on Git history for Hack The Space at the Tate Modern - as well as being quite beautiful, it also shows very clearly why the speed difference is so great! The potential is also there for explaining more clearly what the BFG is going to do to your repo - people are understandably cautious about letting the BFG take care of culling old files (even though it never touches your current ones, on the design assumption that although the user has made mistakes in the past, they're no longer so unwise, and whatever they currently have is stuff they actually do want). Sometimes in requirements there's a tricky gap between what a user needs and what they want, and reconciling those two is something that stronger visual representations could help with.
Regarding plugins, I'd really like to make it easier for other people to script custom actions with the BFG - getting close to the flexibility of git-filter-branch, but retaining the speed. Christian Hoffmeister recently did some good work incorporating the BFG into a custom application (git-timeshift), but it still required more labor than I'd like.
As it stands, the command-line version of the BFG has been pretty well adopted - there are lots of great tweets from users talking about using it, and the referrer logs for the BFG's documentation website show a wide range of users, from major investment banks, to research laboratories, mobile phone manufacturers, even terrifying companies that write avionics software for military aircraft. Making a wild guesstimate based on the number of downloads, I would surmise that by this point, the BFG has saved roughly 30 developer-years since it was first released.
It's open-source, totally free, and I hope your readers enjoy using it next time they need to cleanup Git history!
About the Interviewee
Roberto Tyley’s is a developer at The Guardian, Technical Lead on the Guardian Membership project, and creator of the BFG, gu-who, Agit, and contributor to many open-source projects. He's worked at GitHub, 'invented' animated diffs, and loves explaining things.