BT

How Microsoft Solved Git’s Problem with Large Repositories

| by Jonathan Allen Follow 575 Followers on Feb 08, 2017. Estimated reading time: 7 minutes |

While Git is often considered to be the best version control software in wide adoption, it is far from perfect. Some problems can be solved with 3rd party tools, but the way entire repositories are copied onto the developer’s machine can be a deal breaker. Microsoft discovered this when they tried to migrate a 300 GB repository from their internal system to Git. The end result was the creation of the Git Virtual File System (GVFS).

The story begins back around 2000 when Microsoft was primarily using an internal system known as “Source Depot”, which is a fork of Perforce. Over time, many of the teams migrated to TFVC, the original version control system in Team Foundation Server (TFS), but the largest teams couldn’t justify the effort to move off of Source Depot. Likewise, most teams used parts of TFS, but which parts varied a lot from team to team, with lots of 3rd party and internally developed tools mixed in.

In an effort to simplify this complicated environment, Microsoft decided to standardize most teams around Visual Studio Team Services (a.k.a. TFS in the cloud) for work planning, automated builds, and source control. For the last one, it would be VSTS with Git.

Why, Git? According to Microsoft employee and reddit user jeremyepling, there are three reasons:

Git and public repos on GitHub are the defacto standard for OSS development. Microsoft does a lot of OSS development and we want our DevOps tools, like TFS and Team Services, to work great with those workflows.

We want a single version control system for all of Microsoft. Standardization makes it easy for people to move between projects and build deep expertise. Since OSS is tied to Git and we do a lot of OSS development, that made Git the immediate front runner.

We want to acknowledge and support where the community and our DevOps customers are going. Git is the clear front-runner for modern version control systems.

But as Brian Harry explains, that has some problems:

There were many arguments against choosing Git but the most concrete one was scale. There aren’t many companies with code bases the size of some of ours. Windows and Office, in particular (but there are others), are massive. Thousands of engineers, millions of files, thousands of build machines constantly building it- quite honestly, it’s mind boggling. To be clear, when I refer to Window in this post, I’m actually painting a very broad brush – it’s Windows for PC, Mobile, Server, HoloLens, Xbox, IOT, and more. And Git is a distributed version control system (DVCS). It copies the entire repo and all its history to your local machine. Doing that with Windows is laughable (and we got laughed at plenty). TFVC and Source Depot had both been carefully optimized for huge code bases and teams. Git had *never* been applied to a problem like this (or probably even within an order of magnitude of this) and many asserted it would *never* work.

Reddit user Ruud-v-A offers some context:

The Linux kernel repository is 1.2 GiB, with almost 12 years of history, and 57k files. The initial 2005 commit notes that the full imported history would be 3.2 GiB. Extrapolating 4.4 GiB for 57k files to 3.5M files gives 270 GiB indeed.

The Chromium repository (which includes the Webkit history that goes back to 2001) is 11 GiB in size, and has 246k files. Extrapolating that to 20 years and 3.5M files yields 196 GiB.

Splitting the Windows repository into suitable small sub-repositories wasn’t really feasible. Perhaps it would have been if they had started that way, but the sheer size and age of the code base meant going back and dividing it up now simple wasn’t tenable. Brian continues:

That meant we had to embark upon scaling Git to work on codebases that are millions of files, hundreds of gigabytes and used by thousands of developers. As a contextual side note, even Source Depot did not scale to the entire Windows codebase. It had been split across 40+ depots so that we could scale it out but a layer was built over it so that, for most use cases, you could treat it like one. That abstraction wasn’t perfect and definitely created some friction.

Why not just dump the history and start over? SuperImaginativeName poses a theory:

The NT kernel, its drivers, subsystems, APIS, hardware drivers, Win32 API, are all relied on by other systems including customers. Why do you think you can almost always run a 30-year-old application on Windows? Without the history, the kernel team for example wouldn't remember that 15 years ago a particular flag has to be set on a particular CPU because its ISA has a silicon bug that stops one customers legacy application running correctly. As soon as you remove history, you remove a huge collective amount of knowledge. You cant expect every developer to remember why a particular system works one way. Imagine noticing some weird code that doesn't look right, but that weird code actually prevents file corruption? The consequences of not having the history and fixing it in a new commit with "fixed weird bug, surprised this hadn't been noticed before" would be a disaster. Compare that to viewing the code's history and even realising it's actually correct. Windows isn't some LOB; everything is auditied.

After a couple of failed attempts, including trying to use Git submodules, Microsoft began to develop the Git Virtual File System:

We tried an approach of “virtualizing” Git. Normally Git downloads *everything* when you clone. But what if it didn’t? What if we virtualized the storage under it so that it only downloaded the things you need? So a clone of a massive 300GB repo becomes very fast. As I perform Git commands or read/write files in my enlistment, the system seamlessly fetches the content from the cloud (and then stores it locally so future accesses to that data are all local). The one downside to this is that you lose offline support. If you want that, you have to “touch” everything to manifest it locally but you don’t lose anything else – you still get the 100% fidelity Git experience. And for our huge code bases, that was OK.

Saeed Noursalehi adds:

With GVFS, this means that they now have a Git experience that is much more manageable: a clone now takes a few minutes instead of 12+ hours, checkout takes 30 seconds instead of 2-3 hours, and status takes 4-5 seconds instead of 10 minutes. And we’re working on making those numbers even better (of course, the tradeoff is that their first build takes a little longer because it has to download each of the files that it is building, but subsequent builds are no slower than normal).

Microsoft’s Investment in Git

Getting this to work correctly required improving the way Git accesses files. While not usually noticeable for locally stored repositories, previous versions of Git would scan more files than strictly necessary. If you’ve noticed Microsoft submitting performance enhancements to the Git OSS project over the last year, this is why.

Jeremyepling writes:

We - the Microsoft Git team - have actually made a lot of contributions to git/git and git-for-windows to improve the performance on linux, mac, and windows. In git 2.10, we did a lot of work to make interactive rebase faster. The end result is an interactive rebase that, according to a benchmark included in Git’s source code, runs ~5x faster on Windows, ~4x faster on MacOSX and still ~3x faster on Linux.

Some of the enhancements to Git include:

Git Virtual File System

The prototype of the Git Virtual File System (GVFS) is implemented on the client side as a file system driver and a GVFS-enabled version of git. It requires “Windows 10 Anniversary Update” or later. Once a Git repository is setup with GVFS, your normal git commands work as usual. The GVFS subsystem sits underneath the filesystem, downloading any files you need from the server in the background.

Since the GVFS client is open source, this is also a good opportunity for anyone to see how virtual file systems are implemented in Windows as drivers.

On the server side, you need something that implements the GVFS protocol. Currently, that means Visual Studio Team Services, but the protocol is open source so that other vendors can offer the same capabilities. The GVFS protocol itself is quite simple, consisting of four REST-like endpoints.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT