Facebook makes Mercurial faster than Git
Earlier this week, Facebook posted Scaling Mercurial at Facebook on their engineering blog, covering how they have modified Mercurial to scale with their repository.
Facebook stores all of its code in a single repository, which two years ago was based on subversion. Instead of having separate repositories for independent projects (and sharing using a binary repository), the entire codebase is in a single large source repository.
Unfortunately for Facebook, neither git nor mercurial were designed to support a single huge repository of all projects. Since the purpose of a distributed version control system is to store all history, if all of a company's projects are stored in a single repository then the size of the repository along with its history can be prohibitively large. In comparison, centralised version control systems like CVS and subversion are able to handle single repositories for multiple projects, because clients can check out just the latest version or just a subset of the repository.
Although the Facebook team investigated modifying Git to support their needs, their conclusion was that Mercurial had more appropriate programming APIs that could be hooked into in order to support their requirements. (Git has a well-defined object structure which is interpreted by tools, as opposed to Mercurial which has low-layer APIs that are used by the codebase.) Many changes have been contributed back to the upstream Mercurial project, including new graph algorithms and rewriting code in c.
Two specific changes have enabled Facebook to use Mercurial for their repository size; modifying the status updates for files to check for specific file changes as opposed to content changes (by hooking into operating system's list of file changes) and modifying the checkout to give a lightweight or shallow clone without needing the full history state.
Normally, a distributed version control system will generate hashes based on the content of data, rather than timestamp. As a result, computing whether a repository has changes often involves scanning through every file calculating hashes for each to determine whether the file's content is different. By limiting the set of files to check to ones that the operating system has reported as having changed since the last scan, the speed is proportional to the number of files whose timestamp has changed, instead of all files in the current workspace. Git tries to reduce this by running lstat to determine file specific information, but still has to walk through every file in the repository in order to determine if they are changed. By asking the operating system to provide the information, the repository can be optimised to only scan those files that the OS reports as having changed.
The other problem Facebook tried to solve is minimising the amount of data needed for either a pull or clone operation. By storing all projects in the same repository, the size of the repository is proportional to the entire history of everything which ultimately leads to scaling problems. Since the repository wasn't able to be partitioned in an efficient way, the solution was to download only the latest version of the files, along with the commit log.
This allows developers to quickly get the current set of files (in much the same way that subversion and CVS perform) as well as iterating through the log for the set of commits that led to that point. However, if older revisions of the repository (or older revisions of files) are required, the local clone will not have that information. (Git provides a limited option with a 'shallow clone' which can get a single revision from a repository, but without the commit history.)
By extending the Mercurial APIs, missing objects from commits can 'fault' and download the content from the remote server when they are requested, whilst not downloading them in the initial checkout until they are needed. Of course this means that if the central server goes down or is unavailable, developers will not be able to check out older versions of code; but since the commit log is available new commits can be created and pushed to the server. (By comparison, git shallow clones have the same content but have differing commits which means they can only be used for read-only purposes.)
These improvements, along with a memcached layer, have sped up Facebook's use of Mercurial to outperform Git for both pull/clone and working directory status calculation by a factor of 5x. Both of these are available via Facebook's mercurial repositories at remotefilelog and hgwatchman. Whilst this setup and approach won't be suitable for every Mercurial user, it gives the DVCS a boost in an open-source world increasingly dominated by Git.
They seem to have solved the wrong problem
Re: They seem to have solved the wrong problem
Imagine rebasing a whole feature...
Re: They seem to have solved the wrong problem
But, when you move to git, there's no benefit. You should use separate repositories. Shallow clones are not the way to solve the problem. One of the best things about git is that you always have all of the history.
I'm sure they justify this on the basis of cross-project dependencies, but this is not the way to go.
Srini Penchikala Aug 21, 2014