Facebook makes Mercurial faster than Git

by Alex Blewitt on Jan 09, 2014 |

Earlier this week, Facebook posted Scaling Mercurial at Facebook on their engineering blog, covering how they have modified Mercurial to scale with their repository.

Facebook stores all of its code in a single repository, which two years ago was based on subversion. Instead of having separate repositories for independent projects (and sharing using a binary repository), the entire codebase is in a single large source repository.

Unfortunately for Facebook, neither git nor mercurial were designed to support a single huge repository of all projects. Since the purpose of a distributed version control system is to store all history, if all of a company's projects are stored in a single repository then the size of the repository along with its history can be prohibitively large. In comparison, centralised version control systems like CVS and subversion are able to handle single repositories for multiple projects, because clients can check out just the latest version or just a subset of the repository.

Although the Facebook team investigated modifying Git to support their needs, their conclusion was that Mercurial had more appropriate programming APIs that could be hooked into in order to support their requirements. (Git has a well-defined object structure which is interpreted by tools, as opposed to Mercurial which has low-layer APIs that are used by the codebase.) Many changes have been contributed back to the upstream Mercurial project, including new graph algorithms and rewriting code in c.

Two specific changes have enabled Facebook to use Mercurial for their repository size; modifying the status updates for files to check for specific file changes as opposed to content changes (by hooking into operating system's list of file changes) and modifying the checkout to give a lightweight or shallow clone without needing the full history state.

Normally, a distributed version control system will generate hashes based on the content of data, rather than timestamp. As a result, computing whether a repository has changes often involves scanning through every file calculating hashes for each to determine whether the file's content is different. By limiting the set of files to check to ones that the operating system has reported as having changed since the last scan, the speed is proportional to the number of files whose timestamp has changed, instead of all files in the current workspace. Git tries to reduce this by running lstat to determine file specific information, but still has to walk through every file in the repository in order to determine if they are changed. By asking the operating system to provide the information, the repository can be optimised to only scan those files that the OS reports as having changed.

The other problem Facebook tried to solve is minimising the amount of data needed for either a pull or clone operation. By storing all projects in the same repository, the size of the repository is proportional to the entire history of everything which ultimately leads to scaling problems. Since the repository wasn't able to be partitioned in an efficient way, the solution was to download only the latest version of the files, along with the commit log.

This allows developers to quickly get the current set of files (in much the same way that subversion and CVS perform) as well as iterating through the log for the set of commits that led to that point. However, if older revisions of the repository (or older revisions of files) are required, the local clone will not have that information. (Git provides a limited option with a 'shallow clone' which can get a single revision from a repository, but without the commit history.)

By extending the Mercurial APIs, missing objects from commits can 'fault' and download the content from the remote server when they are requested, whilst not downloading them in the initial checkout until they are needed. Of course this means that if the central server goes down or is unavailable, developers will not be able to check out older versions of code; but since the commit log is available new commits can be created and pushed to the server. (By comparison, git shallow clones have the same content but have differing commits which means they can only be used for read-only purposes.)

These improvements, along with a memcached layer, have sped up Facebook's use of Mercurial to outperform Git for both pull/clone and working directory status calculation by a factor of 5x. Both of these are available via Facebook's mercurial repositories at remotefilelog and hgwatchman. Whilst this setup and approach won't be suitable for every Mercurial user, it gives the DVCS a boost in an open-source world increasingly dominated by Git.

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

They seem to have solved the wrong problem by Ryan Gardner

What could possibly require all of their stuff to be in the same repo?

Re: They seem to have solved the wrong problem by Szymon Dembek

Facebook philosophy ;)

Re: They seem to have solved the wrong problem by Andrew Arnott

For a web services company, having everything in one repro makes a lot of sense, since everything is online is by definition the same version, being able to simulate that in your VCS makes for easier managing, IMO.

Re: They seem to have solved the wrong problem by Foo Bar

Most probably technical debt

Imagine rebasing a whole feature... by Steven Spencer

If you've ever had to rebase a lot of files during a branch merge (very common when you have teams working and committing a build back to a main branch) then you would know that saving a few seconds per merge will add up very quickly.

Re: They seem to have solved the wrong problem by Hunt Pogroth

With Subversion, it could be preferable, because then you only have one Subversion server config to deal with.

But, when you move to git, there's no benefit. You should use separate repositories. Shallow clones are not the way to solve the problem. One of the best things about git is that you always have all of the history.

I'm sure they justify this on the basis of cross-project dependencies, but this is not the way to go.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

6 Discuss

Educational Content

General Feedback
Editorial and all content copyright © 2006-2014 C4Media Inc. hosted at Contegix, the best ISP we've ever worked with.
Privacy policy