How Microsoft Solved Git’s Problem with Large Repositories

This item in japanese

Feb 08, 2017 7 min read

Follow us on

Youtube232K Followers

Linkedin26K Followers

While Git is often considered to be the best version control software in wide adoption, it is far from perfect. Some problems can be solved with 3rd party tools, but the way entire repositories are copied onto the developer’s machine can be a deal breaker. Microsoft discovered this when they tried to migrate a 300 GB repository from their internal system to Git. The end result was the creation of the Git Virtual File System (GVFS).

The story begins back around 2000 when Microsoft was primarily using an internal system known as “Source Depot”, which is a fork of Perforce. Over time, many of the teams migrated to TFVC, the original version control system in Team Foundation Server (TFS), but the largest teams couldn’t justify the effort to move off of Source Depot. Likewise, most teams used parts of TFS, but which parts varied a lot from team to team, with lots of 3rd party and internally developed tools mixed in.

In an effort to simplify this complicated environment, Microsoft decided to standardize most teams around Visual Studio Team Services (a.k.a. TFS in the cloud) for work planning, automated builds, and source control. For the last one, it would be VSTS with Git.

Why, Git? According to Microsoft employee and reddit user jeremyepling, there are three reasons:

Git and public repos on GitHub are the defacto standard for OSS development. Microsoft does a lot of OSS development and we want our DevOps tools, like TFS and Team Services, to work great with those workflows.

We want a single version control system for all of Microsoft. Standardization makes it easy for people to move between projects and build deep expertise. Since OSS is tied to Git and we do a lot of OSS development, that made Git the immediate front runner.

We want to acknowledge and support where the community and our DevOps customers are going. Git is the clear front-runner for modern version control systems.

But as Brian Harry explains, that has some problems:

There were many arguments against choosing Git but the most concrete one was scale. There aren’t many companies with code bases the size of some of ours. Windows and Office, in particular (but there are others), are massive. Thousands of engineers, millions of files, thousands of build machines constantly building it- quite honestly, it’s mind boggling. To be clear, when I refer to Window in this post, I’m actually painting a very broad brush – it’s Windows for PC, Mobile, Server, HoloLens, Xbox, IOT, and more. And Git is a distributed version control system (DVCS). It copies the entire repo and all its history to your local machine. Doing that with Windows is laughable (and we got laughed at plenty). TFVC and Source Depot had both been carefully optimized for huge code bases and teams. Git had *never* been applied to a problem like this (or probably even within an order of magnitude of this) and many asserted it would *never* work.

Reddit user Ruud-v-A offers some context:

The Linux kernel repository is 1.2 GiB, with almost 12 years of history, and 57k files. The initial 2005 commit notes that the full imported history would be 3.2 GiB. Extrapolating 4.4 GiB for 57k files to 3.5M files gives 270 GiB indeed.

The Chromium repository (which includes the Webkit history that goes back to 2001) is 11 GiB in size, and has 246k files. Extrapolating that to 20 years and 3.5M files yields 196 GiB.

Splitting the Windows repository into suitable small sub-repositories wasn’t really feasible. Perhaps it would have been if they had started that way, but the sheer size and age of the code base meant going back and dividing it up now simple wasn’t tenable. Brian continues:

That meant we had to embark upon scaling Git to work on codebases that are millions of files, hundreds of gigabytes and used by thousands of developers. As a contextual side note, even Source Depot did not scale to the entire Windows codebase. It had been split across 40+ depots so that we could scale it out but a layer was built over it so that, for most use cases, you could treat it like one. That abstraction wasn’t perfect and definitely created some friction.

Why not just dump the history and start over? SuperImaginativeName poses a theory:

The NT kernel, its drivers, subsystems, APIS, hardware drivers, Win32 API, are all relied on by other systems including customers. Why do you think you can almost always run a 30-year-old application on Windows? Without the history, the kernel team for example wouldn't remember that 15 years ago a particular flag has to be set on a particular CPU because its ISA has a silicon bug that stops one customers legacy application running correctly. As soon as you remove history, you remove a huge collective amount of knowledge. You cant expect every developer to remember why a particular system works one way. Imagine noticing some weird code that doesn't look right, but that weird code actually prevents file corruption? The consequences of not having the history and fixing it in a new commit with "fixed weird bug, surprised this hadn't been noticed before" would be a disaster. Compare that to viewing the code's history and even realising it's actually correct. Windows isn't some LOB; everything is auditied.

After a couple of failed attempts, including trying to use Git submodules, Microsoft began to develop the Git Virtual File System:

We tried an approach of “virtualizing” Git. Normally Git downloads *everything* when you clone. But what if it didn’t? What if we virtualized the storage under it so that it only downloaded the things you need? So a clone of a massive 300GB repo becomes very fast. As I perform Git commands or read/write files in my enlistment, the system seamlessly fetches the content from the cloud (and then stores it locally so future accesses to that data are all local). The one downside to this is that you lose offline support. If you want that, you have to “touch” everything to manifest it locally but you don’t lose anything else – you still get the 100% fidelity Git experience. And for our huge code bases, that was OK.

Saeed Noursalehi adds:

With GVFS, this means that they now have a Git experience that is much more manageable: a clone now takes a few minutes instead of 12+ hours, checkout takes 30 seconds instead of 2-3 hours, and status takes 4-5 seconds instead of 10 minutes. And we’re working on making those numbers even better (of course, the tradeoff is that their first build takes a little longer because it has to download each of the files that it is building, but subsequent builds are no slower than normal).

Microsoft’s Investment in Git

Getting this to work correctly required improving the way Git accesses files. While not usually noticeable for locally stored repositories, previous versions of Git would scan more files than strictly necessary. If you’ve noticed Microsoft submitting performance enhancements to the Git OSS project over the last year, this is why.

Jeremyepling writes:

We - the Microsoft Git team - have actually made a lot of contributions to git/git and git-for-windows to improve the performance on linux, mac, and windows. In git 2.10, we did a lot of work to make interactive rebase faster. The end result is an interactive rebase that, according to a benchmark included in Git’s source code, runs ~5x faster on Windows, ~4x faster on MacOSX and still ~3x faster on Linux.

Some of the enhancements to Git include:

sha1: use openssl sha1 routines on mingw https://github.com/git-for-windows/git/pull/915
preload-index: avoid lstat for skip-worktree items https://github.com/git-for-windows/git/pull/955
memihash perf https://github.com/git-for-windows/git/pull/964
add: use preload-index and fscache for performance https://github.com/git-for-windows/git/pull/971
read-cache: run verify_hdr() in background thread https://github.com/git-for-windows/git/pull/978
read-cache: speed up add_index_entry during checkout https://github.com/git-for-windows/git/pull/988
string-list: use ALLOC_GROW macro when reallocing string_list https://github.com/git-for-windows/git/pull/991
diffcore-rename: speed up register_rename_src https://github.com/git-for-windows/git/pull/996
fscache: add not-found directory cache to fscache https://github.com/git-for-windows/git/pull/994

Git Virtual File System

The prototype of the Git Virtual File System (GVFS) is implemented on the client side as a file system driver and a GVFS-enabled version of git. It requires “Windows 10 Anniversary Update” or later. Once a Git repository is setup with GVFS, your normal git commands work as usual. The GVFS subsystem sits underneath the filesystem, downloading any files you need from the server in the background.

Since the GVFS client is open source, this is also a good opportunity for anyone to see how virtual file systems are implemented in Windows as drivers.

On the server side, you need something that implements the GVFS protocol. Currently, that means Visual Studio Team Services, but the protocol is open source so that other vendors can offer the same capabilities. The GVFS protocol itself is quite simple, consisting of four REST-like endpoints.

InfoQ Software Architects' Newsletter

How Microsoft Solved Git’s Problem with Large Repositories

Follow us on

Rate this Article

This content is in the Source Control topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter