InfoQ Homepage Articles Distributed Version Control Systems: A Not-So-Quick Guide Through

Distributed Version Control Systems: A Not-So-Quick Guide Through

This item in japanese

May 07, 2008 20 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Since Linus Torvalds presentation at Google about git in May 2007, the adoption and interest for Distributed Version Control Systems has been constantly rising. We will introduce the concept of Distributed Version Control, see when to use it, why it may be better than what you're currently using, and have a look at three actors in the area: git, Mercurial and Bazaar.

What?

A Version Control System (or SCM) is responsible for keeping track of several revisions of the same unit of information. It's commonly used in software development to manage source code project. The historical and first project VCS of choice was CVS started in 1986. Since then many other SCM have flourished with their specific advantages over CVS: Subversion (2000), Perforce (1995), CVSNT (1998), ...

In December 1999, in order to manage the mainline kernel sources, Linus chose BitKeeper described as "the best tool for the job". Prior to this Linus was integrating each patch manually. While all its predecessors were working in a Client-(Central)Server model BitKeeper was the first VCS to allow a truly distributed system in which everybody owns their own master copy. Due to licensing conflicts, BitKeeper was later abandoned in favor of git (Apr, 2005). Other systems following the same model are available: Mercurial (Apr, 2005), Bazaar (Mar, 2005), darcs (Nov, 2004), Monotone (Apr, 2003).

Why?

Or a more precise question: Why Central VCS (and notably Subversion) are not satisfying?
Several things are blamed on Subversion:

Major reason is that branching is easy but merging is a pain (but one doesn't go without the other). And it's likely that any consequent project you'll work on will need easy gymnastic with splits, dev, test branches. Subversion has no History-aware merge capability, forcing its users to manually track exactly which revisions have been merged between branches making it error-prone.
No way to push changes to another user (without submitting to the Central Server).
Subversion fails to merge changes when files or directories are renamed.
The trunk/tags/branches convention can be considered misleading.
Offline commits are not possible.
.svn files pollute your local directories.
svn:external can be harmful to handle.
Performance

The modern DVCS fixed those issues with both their own implementation tricks and from the fact that they were distributed. But as we will see in conclusion, Subversion did not resign yet.

How?

Decentralization

Distributed Version Control Systems take advantage of the peer-to-peer approach. Clients can communicate between each other and maintain their own local branches without having to go through a Central Server/Repository. Then synchronization takes place between the peers who decide which changesets to exchange.

This results in some striking differences and advantages from a centralized system:

No canonical, reference copy of the codebase exists by default; only working copies.
Disconnected operations: Common operations such as commits, viewing history, diff, and reverting changes are fast, because there is no need to communicate with a central server. Even if a central server can exist (for stable, reference or backup version), if Distribution is well used it shouldn't be as much queried as in a CVCS schema.
Each working copy is effectively a remoted backup of the codebase and change history, providing natural security against data loss.
Experimental branches – creating and destroying branches are simple operations and fast.
Collaboration between peers made easy.

For an introduction to DVCS collaboration pratices, you might have a look at the Intro to Distributed Version Control (Illustrated) or possible Collaboration workflows.

You should also be aware that there are some disadvantages in opting for DVCS, notably in term of complexity; This decentralized view is very different from Central world and it might need some time to get used to for your developers. Changeset tracking instead of file tracking can also be confusing even if very powerful and making it theoritically possible to track method move through file.

Who?

The battle rages on! Some of the Good and the Bad.

The good and the bad essentially from an updated (because some old arguments are not true anymore) compilation of blogs and my personal experience.
You should notice that it is a very short list of features (ie git has more than 150 commands), and some issues might be more critical than others.

	git	Mercurial	Bzr

Project
Maintainer	Junio C Hamano	Matt Mackall	Canonical Ltd. - Became GNU project
Concurrency model	Merge	Merge	Merge
License	GPL	GPL	GPL
Platforms supported	POSIX, Windows, Mac OS X	Unix-like, Windows, Mac OS X	Unix-like, Windows, Mac OS X
Cost	Free	Free	Free
Maturity
Version	> 1.0 (1.5.5)	> 1.0 (1.0)	> 1.0 (1.3.1)
Project Start	Apr, 2005	Apr, 2005	Mar, 2005
Implementation
SLOC (without Test src)
SLOC Count	130550	38172	79864
Test Suites	~20% of sources dedicated to Tests	~25% of sources dedicated to Tests	~50% of sources dedicated to Tests
History model	Snapshot	Changeset	Snapshot
Repo. growth	O(patch)	O(patch)	O(patch)
Network protocols	HTTP, FTP, email bundles, custom, ssh, rsync	HTTP, ssh, email	HTTP, SFTP, FTP, ssh, custom, email bundles
Basic Features
Atomic commits
File renames	implicit
Merge file renames
Symbolic links
Pre/post-event hooks
Signed revisions			Partial / Manual verification
Merge tracking
End of line conversions			Planned (1.6)
Tags
International Support		Planned
Partial checkout	Use submodules instead	Planned	Planned
Model / Architecture
File	Single top-level `.git` directory	Single top-level `.hg` directory	Single top-level `.bzr` directory
Model		Simple branch model (a clone is a branch)	Simple branch model (a clone is a branch)
Repository Specificities			Shared repositories for sharing revisions between branches. Supposed-to-be Better Storage Model
Directories versionable
Submodules	Submodule support via `git-submodule`	Submodule support via the Forest extension (as used by OpenJDK)	Workaround with 3rd Party tool ConfigManager
Per file commit	Goes against architecture	Goes against architecture	Goes against architecture
Rebase / Queue	`rebase`	Mercurial Queues	Rebase plugin, Loom plugin (comp. with Quilt)
Web Access
	Note: Repository can also be shared read-only via static files over HTTP.	Note: Repository can also be shared read-only via static files over HTTP.	Not as good as the 2 others. Faster Smart Server now available.
	gitweb, wit, cgit	hgweb (single rep), hgwebdir (multi rep)	webserve, trac, Loggerhead
Integration
Integration-ability	git is more scriptable than integrable through API (even if there are some frontend api like Ruby/Git)		Rich API
Migration	Good. `git-svn` is also a very powerful and easy to put in place bi-directional gateway between Subversion and Git allowing you to use Git over an existing Subversion Repository.	Good. `hgsvn` not as polished as `git-svn`.	Well covered but slow.
Issue Tracker Integration	Trac Versioning System Backend Plugin avail. Bugzilla workaround. No JIRA Plugin.	Trac Versioning System Backend Plugin avail. Bugzilla avail. JIRA.	Trac Versioning System Backend Plugin avail. Bugzilla avail. No JIRA Plugin.
IDE Plugins	Existing dev versions: Idea, Eclipse, NetBeans	Existing dev versions: Idea, Eclipse, NetBeans	Existing dev versions: Idea, Eclipse. Missing: NetBeans
Plugins	Emacs / Vim / ...	Emacs / Vim / ...	Emacs / Vim / ...
Performance
	git has always been historically faster than its competitors		bzr has historically been the slowest of the 3.
Advanced Features
	With more than 150 binaries it's hard not to find the killing command you always dreamt of (even if this increases complexity)
Complexity
			Bzr pretends to hide complexity by keeping a clean User Interface while adapting to the different collaboration workflows and their evolution in a team.
Revision Naming	Git revisions are SHA-1 making it less userfriendly when doing a diff between two revisions. This was chosen to guarantee safety and integrity of data and also happens to avoid collision when merging with other peers.	Simple naming	Simple revision id naming r1, r2, etc...
Commands	Familiar, with some specificities like `rename` command which differs from other SCM (won't be changed because of backward compatibility). git is the most advanced SCM in term of commands but if you add all possible commands and their options, you end up with a huge number of possibilities that it's hard to master. The fact that such tool like Easy Git exists means that Git can be considered quite complex.	Familiar (not far from subversion)	Familiar
UserBase
	Large userbase / Numerous (and large) Projects running git and interest in user feedback	Large userbase / Numerous (and large) Projects running hg.	Smallest Market Share: Apart from Canonical projects (Ubuntu, Launchpad), no big names are using it yet. Bazaar is also less well-known.
	Linux kernel, Cairo, Wine, X.Org, Rails, Rubinius, Compiz Fusion	Xine, OpenJDK, OpenSolaris, NetBeans, (Part of) Mozilla	(Part of) Ubuntu, Launchpad
Documentation	Good documentation. Very good man pages (with a lot of examples)	Good documentation.	Good documentation.
Platforms
	Poor Windows support	Cross Platform	Cross Platform
Additional Misc. Good Points
	git is scriptable over pluginable which is a good and bad point (easy entry point through script, all tests are done in bash script by the way). Very tunable for advanced administration: staging area, dangling objects, detached heads, plumbing vs porcelain, reflogs. Local branches are possible.	Robust Renaming.	Robust Renaming. Supports lightweight checkouts (without history). Bound branches. Local branches are possible. Patch Queue Manager (manages several branches, performing merges for developers)
Some very cool commands / extensions	`git-stash` (when interrupted for a quick bug-fix on another project), `git-cherry-pick` (for picking only single commits, rather that complete branches), `git-rebase` (to forward port your local commits. Quilt-like changeset like Mercurial Queues). `git add -i` (equivalent of Mercurial RecordExtension). There's hardly a command you dreamt of that git doesn't have.	RecordExtension (it lets you choose which parts of the changes in a working directory you'd like to commit). Hg Shelve Extension (same as git-stash: to interactively select changes to set aside).	Shelf plugin (same as git-stash: when interrupted for a quick bug-fix on another project, latest rel on Jan. 2007). bzr-dbus (for broadcasting hooks and revisions).
Additional Misc. Bad points
	Renaming not handled as good as bzr (Test Case). Read-only static HTTP setup is a bit obtuse (--bare and update-server-info). Handling of Unicode (UTF-16 encoded) files. Storage Model. Git stores each newly created object as a separate file that can be packed into a single file delta compressed between each. Forces to do administration and launch pack command on a regular basis. Mixture of C, Perl and bash script, which makes it far less obvious to port to other systems while maintaining the same feature set.	~~Renaming not handled as good as bzr. [FIXED]~~ Local branches are not possible, clone is used instead. To avoid lost of space, Hg use hardlinks making problem when pull (and also under Windows). Forest extension (submodules) not native and not well documented.
Gui
Windows		TortoiseHg	Complicated. TortoiseBzr (no submit on launchpad project since Aug 2007, but the project is still active). WildcatBZR.
Linux	gitk, git-gui, tig, ...	TortoiseHg, Hgtk, hgct	bzr-gtk, ...
Installation
	You'll need either cygwin installed or alternative git installation like Git on MSys
Free Hosting Available
	GitHub, gitorious	FreeHg	Launchpad

There are debates left open, like the fact that in bzr directories are branches, not branch containers like in git.
~~Also the fact that Mercurial is using external tools to do merges is also criticized by Bzr~~. This is not true anymore as of Mercurial v1.0.
You'll find other biased comparisons made by Bazaar team: Bazaar vs Git, Bazaar vs Mercurial and the associated reply from Mercurial.

Some User Statistics from Git Survey 2007

You should notice that in the survey, there was no option to choose Ruby as proficient language. Should be interesting to add it for survey 2008.

It's also funny to see that ~1/3 of people use Distributed VCS (here git) in collaboration with ... 0 or 1 person!

Guis


gitk on Linux	TortoiseHg on Windows	OliveGtk on Linux

The guis look nearly the same with a preference for the effectiveness of gitk. TortoiseHg (with folder watch activated) was really slow with a big repository like Mozilla.

A quick and non-exhaustive look at performance

Conditions of the bench

git is still leading the performance battle, but Hg and Bzr have made great improvements in the past year.

You should notice that Mercurial doubles the number of files in your repository (the historic is kept per file in .hg/store/data). It doesn't seem to be a good choice for Windows system running on NTFS.
It's also interesting to see that git takes a big advantage of the system when executing command. While Hg and Bzr do not spend a big proportion of time in system, Git can take up to 10-40% cpu time within system call, which raises the question as to how it will perform on Windows system where the git-developers won't have access to all the system performance trick they are used to with Linux.
Single Merges and Merge Queues should be tested, this is a tiedous part to benchmark.

Benchmarks should also be run on Windows as:

Even if your server is running on *nix, many developers are still having a Windows environment at work and DVCS transfered more processing on the developer station
Performance might be really different on Windows machine.

When?

Experience stories.

I had the chance to catch up with Kelly O'Hair from Sun about its choice for Hg for OpenJDK.

Sebastien Auvray: I read the reasons for migrating from TeamWare to Mercurial but had remaining questions. Did you simply follow OpenSolaris choice?

Kelly O'Hair: To some degree yes, but the OpenSolaris choice also became the Sun wide choice to any Sun Software teams having to convert. The OpenSolaris investigation was pretty complete and they had all the exact requirements we had. We had to convert for OpenJDK, because TeamWare was unacceptable for an open source project, the answer of Mercurial was pretty obvious for us.

Or did you do a refreshed tournament and tried the other DVCS again (git, ...)?

We did not do a detailed re-investigation, that seemed like a waste of time. The only other possible choice in my view was git, and since git wasn't giving Windows a priority, which we needed. Again the choice was obvious.

OpenSolaris reports took place in April 2006 which is 2 years ago.

Understood. Some things may have changed, git has improved, but the ball was rolling, and Mercurial was improving too.

Also did you encounter any specific problems in the migration?

File permissions and ownership can be a problem in sharing a repository vis a NFS or UFS file system, so we finally setup a server to handle the shared repositories, the better way. That could be made easier.
The other issue is that using hooks to rollback or filter pushes creates a window where someone could accidently pull changes that will be rolled back, so you have to use a pair of repositories, one for pushes and one for pulls, with an automatic sync after the hooks run to sync them up.
Using forests also introduces a problem because a forest push is just a set of individual pushes, and if one push failed, technically you would want to rollback all other pushes. Nobody is doing this, and just taking their chances. If the repositories in the forest are fairly independent, this is not a real problem.

In the day-to-day usage?

Remains to be seen. Change like this is easy for some, harder for others. Given time, I think most people have and will adapt and learn to love it.
The concept of "working set files" (having to do 'hg update') and having to merge changesets that don't seem to merge anything is confusing to people. Also, the idea that they are pushing changesets and not files is something people have a problem with, "Why can't I just bringover this one file?".

What is better than TeamWare?

Much much much faster than TeamWare. Our teams in China and Russia are looking forward to full deployment because they don't need to keep mirrors of integration areas. Refreshes (pulls) are very fast over slow connections.
The state of the repository in Mercurial is well-defined, unlike TeamWare which allowed for partial workspaces, TeamWare was just a loose bag of individually managed files (SCCS files).
The changeset concept was missing in TeamWare, along with the concept of well known simple state of the entire repository (a simple changeset id).

Is there anything you're missing from TeamWare?

People are missing the email notifications and putback/bringover transaction history, but the changeset provides much of that.
What may be missing is somekind of repository transaction history, but again, email archives of Mercurial events could provide this.

Is Hg becoming the VCS of choice for Sun including internal projects? Or is Sun using it only for public projects that need openness?

Both internal and external projects are converting, where it makes sense.
I've seen a big increase in interest from internal projects that are taking the plunge.

I also caught up with Pierre d'Herbemont from VLC to get their opinion about git.

Sebastien Auvray: Firstly what was the version control system you were using prior to using Git?

Pierre d'Herbemont: SVN and a git-svn mirror.

When did you migrate?

We opened a git mirror of the svn tree, to ease VLC Google Summer of Code projects. So that was back then. Then we totally migrated to git on March 1st-2nd 2008.

Why did you chose Git over its competitors?

Over SVN: Git is fast. Branch is cheap. Atomic Commits. Rebasing on top of an other tree.
Over other distributed system: Proven user base (Linux Kernel). I have been successfully using it while working on Wine. Git is sexy. And Some core developers had experiences with Git, whereas no one has with Mercurial and such. Nothing technical there.

Also did you encounter any specific problems in the migration? In the day-to-day usage?

We encountered some troubles with Trac and buildbot. Their support for Git is really minimal especially in their releases versions. We had to checkout Builbot latest trunk. For Trac we are using a crippled Git plugin. Trac Git Plugin needs Trac 0.11. But Trac 0.11 isn't stable and has some known memleak that prevent us from switching. So basically we are waiting for them to fix that...
It took some times for some committers to get accustomed with Git. But after two days, everything seemed fine. And some Git-beginners starts to really enjoy Git.

So what ?

Choosing between Distributed VCS and Central VCS is far from being easy. DVCS will definitely change the way you work and collaborate. Subversion, one of the Central VCS leader, has not resigned yet in the performance and features battle, and 1.5 version should come up with good compromises. It can count on its existing userbase and simplicity favor (at the cost of some pain). In very specific case like project dealing with large opaque binary files, Subversion would be better than DVCS because the client-side space usage stays constant. Also if you use partial checkout heavily, svn will perform better (but when massively used this reveals a problem in the setting of your modules).

Once you made the choice for either Distributed or Central solution, then it will also be hard to compare the competitors in their area as implementations/commands and at the end performances can be very different. And there is no real existing benchmarks for the common operations.
In this hard battle, Bazaar lost many new really influencing early adopters (Mozilla, Solaris, OpenJDK) because of its poor performance of the beginning. It also has to be said that Bazaar website is a lot more Marketing-oriented: by publicizing not-all-true differences with its competitors, or by publicizing benchmark comparison with its competitors only about Space efficiency while there's no timing benchmark comparison of daily commands: diff, add, ...
I feel that even though the 3 projects started out at nearly the same time, bzr did face a lot more performance and design problem at the early beginning making it a bit less mature than its competitors now.
Yet unseen phenomenon, it seems as if some choices have emerged based on the language used by the communities: Java / Sun related developments seem to be interested more in Mercurial while C / Linux / Ruby / Rails related projects are attracted by git.

Hope this article enlightened you and your experiences and feedbacks are always welcome!

Credits:
People who kindly accepted my interview: Kelly O'Hair, Pierre d'Herbemont.
Ian Clatworthy for his help and reactivity on the conversion of the Mozilla Hg Repository to Bzr.
#git, #mercurial, #bzr on Freenode IRC, #mozilla on Mozilla IRC.
Athletism Picture by Antonio Juez

Random quotes:
Linus Torvald: "Subversion has been the most pointless project ever started". "If you like using CVS, you should be in some kind of mental institution or somewhere else".
Mark Shuttleworth (Ubuntu / Canonical Ltd.): "Merging is the key to software developer collaboration."
Ian Clatworthy (Canonical / Bazaar): "By 2011-2012, I predict this technology will be widely adopted and many teams will wonder how they once managed without it."
Assaf Arkin in Git forking for fun and profit originally: "Apache built a great infrastructure around SVN, lots of sweat and tears went into making it happen, and at first I felt like we’re circumventing all of that. But the longer I thought about it, the more I realized that Git is just more social than SVN, and that’s exactly what Apache is about."

[Article updated on 20080512 according to the comments here and from Ian Clatworthy and reedit]:

Bzr plugins and Windows Gui added: rebase, ..., Wildcat BZR, ...
Hg Shelve added.
SLOC for Hg updated (HTML doc used to be counted, I kept contrib which is responsible for the presence of Lisp and Tcl/Tk).
Repository size for git updated after doing proper repack command (git repack -a -f -d --window=100 --depth=100 until size becomes constant) (Thanks to the comment by dhamma vicaya).

Apologies:
darcs, Monotone were not taken into account in this comparison because it was already a hard work to gather all this information and to actually test those 3 DVCS. Strangely, even though they are the oldest in the DVCS scene, the focus is more on the DVCS I reviewed here (which doesn't help moving the focus I admit but darcs, Monotone users/developers are welcome to post comments and advertising here!).

References:
The very exhaustive Wikipedia page about Git.
Distributed Revision Control Wikipedia page.
Comparison of Revision Control Software Wikipedia page.

Distributed Version Control - Why and How by Ian Clatworthy, Canonical (Bazaar).
Intro to Distributed Version Control (Illustrated) by Kalid Azad.
Distributed Version Control Systems by Dave Dribin (who finally chose Mercurial).
Why Distributed Version Control by Wincent Colaiuta.
Source Code Management for OpenSolaris. OpenSolaris SCM Project History (2005).
Mercurial OpenJDK Questions by Kelly O'Hair, Sun.
Why I chose git by Aristotle Pagaltzis.
Distributed SCM by Gnome crew.
FreeBSD SCM Requirements.
Open Office Requirements.
Mozilla VCS Requirements.
Use Mercurial you git! by Ian Dees.
What a DVCS gets you (maybe) by Bill de hÓra.
The Differences Between Mercurial and Git.
And all URLs referenced in this article.

Cheat Sheets:
Git Cheat Sheet
Mercurial Cheat Sheet
Bazaar Quick Start Card

Benchmark conditions.
Benchmark was done using AMD Athlon(tm) 64 Processor 3500+ 1GB RAM on Linux Kubuntu 6.10 Edgy x86_64 with ext3 fs.
Each command was run 8 times (and the best and worst time were cut out). They were done locally through the filesystem (other protocol tests should definitely be done as even if DVCS are not coupled with a central server, network communications when badly implemented can lower user performance).

Version used are:

Mercurial 1.0 released on 2008-03-24 (with Issue 1050 patch for Edgy inotify option)
Git 1.5.5 released on 2008-04-08
Bazaar 1.3.1 released on 2008-04-10

Repository consists in a snapshot of 12456 changesets (from 20080303, 70853 total revisions from the hg Repository), ~30000 files from Mozilla Repository (originally hg formatted and translated into git repository thanks to hg-fast-export.sh for git and hg-fast-export.sh coupled with fast-import plugin for bazaar).
Default file formats were used and git repository size remained the same running git-gc (which can be considered normal for a freshly migrated repository). One file was modified (dom/src/base/nsDOMClassInfo.cpp) just like a benchmark test done by Jst 1.5 year ago.

About the Author

Sébastien Auvray is a senior software designer and technology enthousiast. After being forced to use CVS, svn now he has to suffer the daily usage of Perforce at work. Sébastien is also one of the Ruby editors of InfoQ.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Distributed Version Control Systems: A Not-So-Quick Guide Through

Write for InfoQ

What?

Why?

Related Sponsored Content

How?

Decentralization

Who?

The battle rages on! Some of the Good and the Bad.

Guis

A quick and non-exhaustive look at performance

When?

Experience stories.

So what ?

About the Author

Rate this Article

This content is in the Culture & Methods topic

Related Topics:

Related Editorial

Popular across InfoQ

The InfoQ Newsletter