BT

Distributed Version Control Systems in the Enterprise

Posted by Pablo Santos on May 17, 2012 |

Every major Open Source project worldwide has already embraced Distributed Version Control Systems (DVCS), will enterprises be next?

Intro

Circa 2004 the world of version control (we can also call it Software Configuration Management (SCM), and while it is not exactly the same, most pragmatics among us will say they’re quite close) lived in a status quo. Version control had reached the point where it was just a commodity and the goal was to have one, no matter which one.

ALM Packages (Application Lifecycle Management) that wrapped version control were the ones delivering the core value to the projects, especially in the enterprise. The trend was management oriented, process driven and control centric.

And then everything turned upside down.

The Linux Kernel hackers needed a new version control system and they begged their über-guru, Torvalds, for a solution. Git was created, a hacker oriented tool to deal with the fast-changing, collaboratively-developed Linux Kernel sources.

With this, version control switched from commodity to productivity booster; The revolution had begun.

What is distributed all about?

Simply put: DVCS moves the project versioned repositories from a central location into each developer’s machine.

Take a look at Figure 1 which describes what the conventional "centralized version control" is about: one central server handles the repository (or repositories) and the developers access it in order to send changes, retrieve changes and so on. Each developer has one or many "working copies": the directories containing a single working version of the source code.

(Click on the image to enlarge it)

Figure 1. Centralized version control

Figure 2 shows what distributed version control is about: each developer has a repository on his own machine and he pushes and pulls changes from the other developers.

(Click on the image to enlarge it)

Figure 2. Distributed version control

The "central repository" is gone, the central server is gone too and each developer can work with total independence, not restricted by potential network slowdowns.

Why is the central server gone?

The scenario for the Linux Kernel developers was not exactly as Figure 1 but closer to Figure 3. Developers were widely distributed, working from different companies, even from home, and all contributing to the same project using the Internet.

(Click on the image to enlarge it)

Figure 3. Centralized version control through the Internet

Back in 2005, it was not uncommon for developers to be slowed down by Internet connection issues: slow connections, service down issues and so on. Surprisingly, the same holds true in 2012: developers hate waiting. Making them wait causes frustration and a great productivity loss. If an operation is slowed down because a large amount of data has to be sent over the wire synchronously before completing, developers’ productivity will be gone.

Hence, for the open source projects, getting rid of the need to contact to the central server was a major breakthrough:

  • Developers were able to check in their changes at light speed, into their own local repository. Slowness was gone.
  • Dependency on a central site was gone too: each repository of each project contributor contained a "clone" of the project; no way to get it fully destroyed either.
  • Developers are allowed to experiment; make their own changes locally and privately. The changes are only published when a developer feels they’re ready to be shared, all the while having the changes under the control of their local repository. Previously, the only options for any change were either: to stay uncontrolled on developer’s working copies or be checked in on the central server (which was not always convenient).

One repo to rule them all

At the end of the day, even in the purest distributed open source projects (including the Linux Kernel) there must be some sort of centralization.

The technology of DVCS allows them to switch from a centralized working mode to a pure "peer to peer" one. But still, real life imposes the need to have a "master copy" to be used as the "good one".

Continuing with the Linux Kernel example, Linus himself is the one ruling the "master copy" which receives changes from all the contributors; he approves them and they become part of the main source for developers to clone. See Figure 4.

(Click on the image to enlarge it)

Figure 4. A master repository is always welcome

Daily workflow changes

In the move from centralized to distributed, there are some noticeable changes in the way a developer works. Check Figure 5.

(Click on the image to enlarge it)

Figure 5. Daily workflow in centralized and distributed

The basic operations in a centralized mode are: checkin changes to the central repository and retrieve new changes made by others from the central repository.

In distributed mode, the checkin and update happen in a fast and frequent way since they’re performed against the local repository, which doesn’t involve a network. In order to "deliver" the changes to the central masterrepo, a push operation is performed. Retrieving changes from the master is called a pull operation.

While checkin is always a "synchronous" operation, the pull and push are non-blocking operations and won’t affect productivity.

The enterprise question: why should I care?

So far the main benefit of DVCS seems to be getting rid of the central server and hence the potential slow networks across the Internet. What’s the benefit here for the enterprise?

Most enterprise developers will think, "Well, I work on a 1Gbps LAN connected to the main version control server. The connection speed is so high that the network time is not even in the equation, and I benefit from the server huge processing power to make all my version control operations fast". They’re right, since that’s the main reason behind client-server architectures and even the cloud, ultimately.

But, the situation described above doesn’t always match reality, at least not for many software development teams in enterprises. The reasons are:

  • Having teams in different locations collaborating with each other is more and more common. It doesn’t just happen in big companies, even a 30 developer team can be distributed across different locations in different cities, countries or even continents. It sounded close to science fiction decades ago, but it is today’s reality.
  • Companies struggle to find talent and more often than not they hire the best developers wherever they live. Working from home, is also a reality enterprises face.
  • Outsourced teams in separate locations, sharing some of the requisites of internal distributed teams but normally imposing strict access policies to the repositories to keep projects coordinated.

What can DVCS do here? Let’s consider the case of a team with members in different locations as Figure 6 shows. There is a main site, where the central version control server is located, probably the first existing location in the company. Developers have a fast network there, and they’re happy working centralized.

Then there’s a second site, with more developers, connected to the main site through a VPN across the Internet. While their local network can be even faster than the one at the central site, they’re frustrated by the time version control operations take since each roundtrip has to go to the distant central server.

Then there’s the developer working from home, with a network connection that can be at times slow and unreliable.

Two thirds of the sites are losing productivity and motivation due to the centralized setup.

(Click on the image to enlarge it)

Figure 6. A distributed team using a centralized solution. Only one site is happy

The distributed alternative is all smilesas you can see in Figure 7. The figure describes a scenario that is best known as distributed multi-site with servers in different locations instead of the pure DVCS alternative with one repository at each developer machine. The initial problem can also be solved with a pure DVCS setup.

(Click on the image to enlarge it)

Figure 7. Distributed team using a multi-site solution. Everyone is happy

It is important to note that there are different ways to solve the described scenario:

Solution

Description

Result

Centralized

A single server and teams accessing through the network to the central site

Only the site where the server is located will be happy. All the other sites will suffer from slow access.

Proxy based

Optimize network traffic in the secondary sites by installing version control specific proxies to cache part of the network traffic

Connectivity is improved compared to the pure centralized scenario.

There are still issues for satellite sites:

  • Write operations (checkin) still go to central
  • Branching and merging operations still go to central
  • Can’t simply disconnect the wire and continue working
  • Developers working from home would need to setup a proxy server, which is far from optimal

Multi-site

Different servers in different locations, but imposing some "restrictions" such us mastership. It was a well-known enterprise option in the 90s.

Speed is even better than in "proxy based" but:

  • Mastership based "replication" imposes artificial limitations such as: only one site can modify a branch at a time
  • Multi-site approaches can be very expensive in hardware requirements, administration overhead and price.
  • Not feasible for developers working from home.

Distributed

Potentially one repository on each developer’s machine

All problems get solved at a lower cost.

Wasn’t multi-site already invented?

Absolutely. In mid 90s there were already some expensive solutions allowing big enterprises handling vast budgets to deal with multi-site environments. They are still in use today but the reasons why they’re losing the game against DVCS are:

  • They are extremely expensive
  • They hard and cumbersome to setup and maintain
  • They offer low performance based on aging technologies
  • They impose restrictions on the "distribution" like masterships: only one site can modify a branch at a time.

DVCS is not restricted by any of the former old multi-site issues, which is what makes it a game changer.

Proxy servers are not DVCS

It is important to keep in mind "not everything that shines is distributed".

DVCS and distributed development are a buzzword nowadays. They’re the new fashion among developers and as such, traditional centralized solution vendors are trying to jump on the boat without doing their homework.

It is important to note that distributed means you can switch off the network and still continue working against your local repository (or your local site repository). Working means doing everything: checkin, checkout, create branches, merge, anything the version control can do.

The poor’s man approach, which is really "fake distributed" and not DVCS, is to implement a "proxy" able to cache some data and avoid calling the central site. It can be helpful in some circumstances, but creating branches, checkin and many more are out of scope.

So, it is important to be careful with misleading marketing material: only a handful open source and commercial solutions, all created after 2005, are capable of true DVCS.

The added values of DVCS

I just described some key advantages related to network topology and how they can benefit speed in daily version control operations.

These advantages are the ones putting the "D" in DVCS.

Being able to have a local repository alone isn’t what is turning distributed version control into the next software development wave. There’s much more to it:

  • DVCSs are fast by design. They’re the result of new engineering applying new tech and hard-won learning’s. They’re order of magnitude faster doing operations than the previous generations of SCMs not only because they work locally but because they’re faster by design even under the same conditions. DVCSs are good at branching and excellent at merging. Branching and merging became "evil" in the late 90s and early 00s because most of the mainstream version control systems were awful at them. Branching and merging were nightmares; slow and error prone. An entire generation of young developers grew up as true branch/merge haters. DVCS changed it all by implementing lighting fast branches and true merge tracking. Branching a huge codebase went down from 30 minutes to 2 seconds, while merging became reliable.

The real benefit behind any distributed version control is its capability to boost parallel development through branching and merging. Well-known best practices like feature-branches, task-branches, release branches and many more are enabling great ways of working that were simply forgotten due to the inability of previous SCM generation to deal with proper branching and merging.

The challenges of DVCS in the enterprise

Having a repo on each developer machine opens up some questions:

  • Security: is it feasible that each developer has a full clone of the code repository?
  • Auditability: is it possible to track each access to the source code for compliance reasons?
  • Policy Enforcement: are the DVCS systems created for open source projects able to enforce the policies required on enterprise environments for security, methodology and compliance reasons?
  • Performance: the new DVCS claim to be orders of magnitude faster than former systems, but are they able to cope with 300k-1500k file counts on a single working copy effectively?
  • Huge files: Are they able to deal with huge files (binaries, art) present in aeronautics, electronics, gaming and many other industries?

These are questions being solved on a daily basis, with new customizations on top of open source DVCS tools making possible to integrate with LDAP or Active Directory systems and brand new commercial DVCS designed to cope with Access Control Lists, big binaries, partial replicas and many more.

It is worth remembering the key values introduced by DVCS:

Value

Description

Distributed

Ability to enable team members to continue working without direct connection to the central server, enabling many different scenarios.

Strong branching and merging

Enabling teams to adopt powerful branching patterns such us feature and task branches.

Speed

Operations are orders of magnitude faster than in the centralized counterparts, by design.

And then, the points to consider when evaluating an enterprise-ready DVCS are:

Question

Feature to check

Security

Partial replica to restrict what is replicated and what is not or from which point, plus ACL controls to enforce policies.

Auditability

Systems able to track every access for compliance reasons such as SoX

Policy Enforcement

ACLs, triggers and strong access controls.

Performance

Able to deal with huge working copies effectively by optimizing the access to their data structures. Huge is bigger than 300,000 files on a working copy.

Huge files

Systems able to deal with files in the Gb area effectively.

Able to work centralized as an option

Many teams will need to work centralized (even heavily relying on branching) on some sites, to optimize network and hardware resources.

Enterprise ready user interfaces

Graphical user interfaces help reducing the adoption curve. As such this is one of the key features to count on when moving towards a DVCS.

Professional support

While open source development can rely on community driven support, projects within enterprises are normally under tight deadlines and as such will need instant response from professional support teams.

Tools

The DVCS is just a part of the picture, but many surrounding tools will be needed on the enterprise: from diff and merge tools to interfaces to set up the security, replicas, backups and many more.

Conclusion - the way ahead

DVCS adoption by the enterprise seems unstoppable. The early adopters will be able to outperform their competitors by optimizing their working methods before others do.

DVCS developers have to start focusing on the enterprise needs, sometimes less flashy and social than the open source counterparts, but by doing that they’ll be able to help enterprise teams to adopt the new tech and benefit from it.

About the Author

Pablo Santos - President and co-founder Codice Software. Prior to enter in start-up mode to launch Codice Software an Plastic SCM back in 2005, Pablo worked as R&D engineer in fleet control software development (GMV, Spain) and later digital television software stack (Sony, Belgium). Then he moved to a project management position (GCC, Spain) leading the evolution of an ERP software package for industrial companies. During these years he became an SCM expert working as consultant and participating in several events as speaker. Pablo co-founded Codice Software in 2005 and since then he played several roles ranging from core engineering to marketing, business development, advertising and sales operations. Together with David, secured a initial VC round in 2009.

Pablo stepped down as CEO when Francisco Monteverde joined the company, to put more focus on his role as chief engineer, although he continues involved in marketing and key sales operations. Pablo contributes as technical evangelist as main contributor to the Codice's blog, speaker on software events, and occasional writer on different technical magazines. Back in 2004 Pablo joined the regional professional association of computing engineers as vice-dean and moved to dean in 2008, position he still holds. Pablo is an associate professor in the University of Burgos, where he is very involved in the training of young engineers teaching project management including important areas such as agile methods and software configuration management.

Pablo got his Master Degree in Computing Engineering back in 2000 (University of Valladolid, Spain). Pablo has a deep passion for software development but when not coding or designing new features he loves motor-bikes, specially track days, but including long touring trips.

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

DVCS isn't for every project by peter lin

I contribute to open source and honestly GIT clients are still painful to use. In theory DVCS is good, but in practice it still has the same old problems as every other source control systems. The worst part of DVCS like GIT is fork/merge. DVCS works best when the delta between forks is small. Once the delta is huge, fork/merge process often ends up becoming a gigantic overhead. The problem is inherent to the fork/merge model for software development. Even in a non distributed tool like SVN, a project can use the fork/merge model. The biggest example of this is IBM Clearcase, which historically is a complete pain.

DVCS is a useful tool, but it's not a magic bullet. Source control systems will never be a magic bullet. The biggest factor is the users.

Re: DVCS isn't for every project by Evtim Papushev

No solution will ever be a magic bullet.

On the branch/merge functionality - I totally disagree. Having had the opportunity to use a couple of centralized VCS, I can honestly say they are all crap - branching might take forever, even updating to latest the revision could be slow (even in LAN). Then, you have the problem with false changes when merging back, i.e. the VCS decides you've changed some files just because. That usually leads to introducing old code back in the dev/main/trunk/whatever-it-is-called (i.e. some form of revert). Those are way too common issues for a couple well-known VCS to just ignore. DVCS systems introduce a different way of handling revisions, and thus rarely mishandle merges (I never had a single issue).

Another issue with centralized systems is linearity of the history. That forces features' commits/check-ins to interleave with one another. Branching solves the issue, but brings up the aforementioned problems. Deleting a branch, deletes it's history as well. None of those problems exist in GIT - once a branch is merged, it's whole history becomes part of the master.

The speed and flexibility of the DVCS is unmatched by any of the centralized systems.

Now, GIT/Hg are not magic/silver bullets, but they are not plastic pellets either.

Re: DVCS isn't for every project by peter lin

I agree GIT is not a plastic pellet, but merge/fork approach has limitations. They're different than centralized VCS. The key point I was trying to make is the cost of merge/fork is directly related to the size of the delta. As long as delta is small, merge/fork is ok. Once delta is huge, the cost of merging multiple conflicting branches grows exponentially.

My preference is to not merge/fork at all and keep teams small with well defined responsibilities. Once dozens of people are touching the same file on a weekly basis, the cost of managing deltas grows exponentially regardless of the tool. DVCS tools "might" make it "easier", but it is far from easy.

Re: DVCS isn't for every project by Johnny FromCanada

DVCS tools/philosophies seem to follow a more "optimistic" approach to code/semantic collisions. It presumes/leverages an underlying assumption that the code is well architected/designed with the usual best practices (like high cohesion, loose coupling, etc.), thus allowing high levels of parallel dev. If that context is not true, no tool will save you anyway.

Rarely Noted Virtue of DCVS by Steve Eckhardt

In my work, I often work for several days on a given task before all of the pieces come together. On a centralized VCS, this means I can't check in code without breaking the build. But if at the end of day 3, I decide that I need to go back to where I was on day 2, I'm sunk because I haven't been able to check code in. With DCVS, I can check into my repo whenever I want but only push onto the build machine when everything is working. For me this is huge.

+1 on Git Clients are Painful by Steve Eckhardt

I look forward to the day when there is a git GUI client for Windows that works as well as WinCVS. The killer feature of WinCVS is its ability to make a filtered list of every file in a source tree. For example, I'm working on a project with several thousand files of many types in a large tree. If I want a complete list of every modified (or unknown, ignored...) file, it takes 2 mouse clicks. I can then sort this list by all of the usual criteria. Would someone please put this in a git client?

The other thing that's really hard with git is switching from CVS (and possibly other CVS's). I want to add everything to the git repo except the obj files, CVS-related files and some others. If there were a git GUI client that would allow me to sort the uncommited files by name or extension, this job would be easy. Why can't I find one that does?

In summary, I guess what I want is the WinCVS file manager in a git client. Please?

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

6 Discuss

Educational Content

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2014 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT