InfoQ Homepage Articles Distributed Version Control Systems in the Enterprise

Distributed Version Control Systems in the Enterprise

This item in japanese

May 17, 2012 14 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Every major Open Source project worldwide has already embraced Distributed Version Control Systems (DVCS), will enterprises be next?

Intro

Circa 2004 the world of version control (we can also call it Software Configuration Management (SCM), and while it is not exactly the same, most pragmatics among us will say they’re quite close) lived in a status quo. Version control had reached the point where it was just a commodity and the goal was to have one, no matter which one.

ALM Packages (Application Lifecycle Management) that wrapped version control were the ones delivering the core value to the projects, especially in the enterprise. The trend was management oriented, process driven and control centric.

And then everything turned upside down.

The Linux Kernel hackers needed a new version control system and they begged their über-guru, Torvalds, for a solution. Git was created, a hacker oriented tool to deal with the fast-changing, collaboratively-developed Linux Kernel sources.

With this, version control switched from commodity to productivity booster; The revolution had begun.

What is distributed all about?

Simply put: DVCS moves the project versioned repositories from a central location into each developer’s machine.

Take a look at Figure 1 which describes what the conventional "centralized version control" is about: one central server handles the repository (or repositories) and the developers access it in order to send changes, retrieve changes and so on. Each developer has one or many "working copies": the directories containing a single working version of the source code.

(Click on the image to enlarge it)

Figure 1. Centralized version control

Figure 2 shows what distributed version control is about: each developer has a repository on his own machine and he pushes and pulls changes from the other developers.

(Click on the image to enlarge it)

Figure 2. Distributed version control

The "central repository" is gone, the central server is gone too and each developer can work with total independence, not restricted by potential network slowdowns.

Why is the central server gone?

The scenario for the Linux Kernel developers was not exactly as Figure 1 but closer to Figure 3. Developers were widely distributed, working from different companies, even from home, and all contributing to the same project using the Internet.

(Click on the image to enlarge it)

Figure 3. Centralized version control through the Internet

Back in 2005, it was not uncommon for developers to be slowed down by Internet connection issues: slow connections, service down issues and so on. Surprisingly, the same holds true in 2012: developers hate waiting. Making them wait causes frustration and a great productivity loss. If an operation is slowed down because a large amount of data has to be sent over the wire synchronously before completing, developers’ productivity will be gone.

Hence, for the open source projects, getting rid of the need to contact to the central server was a major breakthrough:

Developers were able to check in their changes at light speed, into their own local repository. Slowness was gone.
Dependency on a central site was gone too: each repository of each project contributor contained a "clone" of the project; no way to get it fully destroyed either.
Developers are allowed to experiment; make their own changes locally and privately. The changes are only published when a developer feels they’re ready to be shared, all the while having the changes under the control of their local repository. Previously, the only options for any change were either: to stay uncontrolled on developer’s working copies or be checked in on the central server (which was not always convenient).

One repo to rule them all

At the end of the day, even in the purest distributed open source projects (including the Linux Kernel) there must be some sort of centralization.

The technology of DVCS allows them to switch from a centralized working mode to a pure "peer to peer" one. But still, real life imposes the need to have a "master copy" to be used as the "good one".

Continuing with the Linux Kernel example, Linus himself is the one ruling the "master copy" which receives changes from all the contributors; he approves them and they become part of the main source for developers to clone. See Figure 4.

(Click on the image to enlarge it)

Figure 4. A master repository is always welcome

Daily workflow changes

In the move from centralized to distributed, there are some noticeable changes in the way a developer works. Check Figure 5.

(Click on the image to enlarge it)

Figure 5. Daily workflow in centralized and distributed

The basic operations in a centralized mode are: checkin changes to the central repository and retrieve new changes made by others from the central repository.

In distributed mode, the checkin and update happen in a fast and frequent way since they’re performed against the local repository, which doesn’t involve a network. In order to "deliver" the changes to the central masterrepo, a push operation is performed. Retrieving changes from the master is called a pull operation.

While checkin is always a "synchronous" operation, the pull and push are non-blocking operations and won’t affect productivity.

The enterprise question: why should I care?

So far the main benefit of DVCS seems to be getting rid of the central server and hence the potential slow networks across the Internet. What’s the benefit here for the enterprise?

Most enterprise developers will think, "Well, I work on a 1Gbps LAN connected to the main version control server. The connection speed is so high that the network time is not even in the equation, and I benefit from the server huge processing power to make all my version control operations fast". They’re right, since that’s the main reason behind client-server architectures and even the cloud, ultimately.

But, the situation described above doesn’t always match reality, at least not for many software development teams in enterprises. The reasons are:

Having teams in different locations collaborating with each other is more and more common. It doesn’t just happen in big companies, even a 30 developer team can be distributed across different locations in different cities, countries or even continents. It sounded close to science fiction decades ago, but it is today’s reality.
Companies struggle to find talent and more often than not they hire the best developers wherever they live. Working from home, is also a reality enterprises face.
Outsourced teams in separate locations, sharing some of the requisites of internal distributed teams but normally imposing strict access policies to the repositories to keep projects coordinated.

What can DVCS do here? Let’s consider the case of a team with members in different locations as Figure 6 shows. There is a main site, where the central version control server is located, probably the first existing location in the company. Developers have a fast network there, and they’re happy working centralized.

Then there’s a second site, with more developers, connected to the main site through a VPN across the Internet. While their local network can be even faster than the one at the central site, they’re frustrated by the time version control operations take since each roundtrip has to go to the distant central server.

Then there’s the developer working from home, with a network connection that can be at times slow and unreliable.

Two thirds of the sites are losing productivity and motivation due to the centralized setup.

(Click on the image to enlarge it)

Figure 6. A distributed team using a centralized solution. Only one site is happy

The distributed alternative is all smilesas you can see in Figure 7. The figure describes a scenario that is best known as distributed multi-site with servers in different locations instead of the pure DVCS alternative with one repository at each developer machine. The initial problem can also be solved with a pure DVCS setup.

(Click on the image to enlarge it)

Figure 7. Distributed team using a multi-site solution. Everyone is happy

It is important to note that there are different ways to solve the described scenario:

Solution	Description	Result
Centralized	A single server and teams accessing through the network to the central site	Only the site where the server is located will be happy. All the other sites will suffer from slow access.
Proxy based	Optimize network traffic in the secondary sites by installing version control specific proxies to cache part of the network traffic	Connectivity is improved compared to the pure centralized scenario. There are still issues for satellite sites: Write operations (checkin) still go to central Branching and merging operations still go to central Can’t simply disconnect the wire and continue working Developers working from home would need to setup a proxy server, which is far from optimal
Multi-site	Different servers in different locations, but imposing some "restrictions" such us mastership. It was a well-known enterprise option in the 90s.	Speed is even better than in "proxy based" but: Mastership based "replication" imposes artificial limitations such as: only one site can modify a branch at a time Multi-site approaches can be very expensive in hardware requirements, administration overhead and price. Not feasible for developers working from home.
Distributed	Potentially one repository on each developer’s machine	All problems get solved at a lower cost.

Wasn’t multi-site already invented?

Absolutely. In mid 90s there were already some expensive solutions allowing big enterprises handling vast budgets to deal with multi-site environments. They are still in use today but the reasons why they’re losing the game against DVCS are:

They are extremely expensive
They hard and cumbersome to setup and maintain
They offer low performance based on aging technologies
They impose restrictions on the "distribution" like masterships: only one site can modify a branch at a time.

DVCS is not restricted by any of the former old multi-site issues, which is what makes it a game changer.

Proxy servers are not DVCS

It is important to keep in mind "not everything that shines is distributed".

DVCS and distributed development are a buzzword nowadays. They’re the new fashion among developers and as such, traditional centralized solution vendors are trying to jump on the boat without doing their homework.

It is important to note that distributed means you can switch off the network and still continue working against your local repository (or your local site repository). Working means doing everything: checkin, checkout, create branches, merge, anything the version control can do.

The poor’s man approach, which is really "fake distributed" and not DVCS, is to implement a "proxy" able to cache some data and avoid calling the central site. It can be helpful in some circumstances, but creating branches, checkin and many more are out of scope.

So, it is important to be careful with misleading marketing material: only a handful open source and commercial solutions, all created after 2005, are capable of true DVCS.

The added values of DVCS

I just described some key advantages related to network topology and how they can benefit speed in daily version control operations.

These advantages are the ones putting the "D" in DVCS.

Being able to have a local repository alone isn’t what is turning distributed version control into the next software development wave. There’s much more to it:

DVCSs are fast by design. They’re the result of new engineering applying new tech and hard-won learning’s. They’re order of magnitude faster doing operations than the previous generations of SCMs not only because they work locally but because they’re faster by design even under the same conditions. DVCSs are good at branching and excellent at merging. Branching and merging became "evil" in the late 90s and early 00s because most of the mainstream version control systems were awful at them. Branching and merging were nightmares; slow and error prone. An entire generation of young developers grew up as true branch/merge haters. DVCS changed it all by implementing lighting fast branches and true merge tracking. Branching a huge codebase went down from 30 minutes to 2 seconds, while merging became reliable.

The real benefit behind any distributed version control is its capability to boost parallel development through branching and merging. Well-known best practices like feature-branches, task-branches, release branches and many more are enabling great ways of working that were simply forgotten due to the inability of previous SCM generation to deal with proper branching and merging.

The challenges of DVCS in the enterprise

Having a repo on each developer machine opens up some questions:

Security: is it feasible that each developer has a full clone of the code repository?
Auditability: is it possible to track each access to the source code for compliance reasons?
Policy Enforcement: are the DVCS systems created for open source projects able to enforce the policies required on enterprise environments for security, methodology and compliance reasons?
Performance: the new DVCS claim to be orders of magnitude faster than former systems, but are they able to cope with 300k-1500k file counts on a single working copy effectively?
Huge files: Are they able to deal with huge files (binaries, art) present in aeronautics, electronics, gaming and many other industries?

These are questions being solved on a daily basis, with new customizations on top of open source DVCS tools making possible to integrate with LDAP or Active Directory systems and brand new commercial DVCS designed to cope with Access Control Lists, big binaries, partial replicas and many more.

It is worth remembering the key values introduced by DVCS:

Value	Description
Distributed	Ability to enable team members to continue working without direct connection to the central server, enabling many different scenarios.
Strong branching and merging	Enabling teams to adopt powerful branching patterns such us feature and task branches.
Speed	Operations are orders of magnitude faster than in the centralized counterparts, by design.

And then, the points to consider when evaluating an enterprise-ready DVCS are:

Question	Feature to check
Security	Partial replica to restrict what is replicated and what is not or from which point, plus ACL controls to enforce policies.
Auditability	Systems able to track every access for compliance reasons such as SoX
Policy Enforcement	ACLs, triggers and strong access controls.
Performance	Able to deal with huge working copies effectively by optimizing the access to their data structures. Huge is bigger than 300,000 files on a working copy.
Huge files	Systems able to deal with files in the Gb area effectively.
Able to work centralized as an option	Many teams will need to work centralized (even heavily relying on branching) on some sites, to optimize network and hardware resources.
Enterprise ready user interfaces	Graphical user interfaces help reducing the adoption curve. As such this is one of the key features to count on when moving towards a DVCS.
Professional support	While open source development can rely on community driven support, projects within enterprises are normally under tight deadlines and as such will need instant response from professional support teams.
Tools	The DVCS is just a part of the picture, but many surrounding tools will be needed on the enterprise: from diff and merge tools to interfaces to set up the security, replicas, backups and many more.

Conclusion - the way ahead

DVCS adoption by the enterprise seems unstoppable. The early adopters will be able to outperform their competitors by optimizing their working methods before others do.

DVCS developers have to start focusing on the enterprise needs, sometimes less flashy and social than the open source counterparts, but by doing that they’ll be able to help enterprise teams to adopt the new tech and benefit from it.

About the Author

Pablo Santos - President and co-founder Codice Software. Prior to enter in start-up mode to launch Codice Software an Plastic SCM back in 2005, Pablo worked as R&D engineer in fleet control software development (GMV, Spain) and later digital television software stack (Sony, Belgium). Then he moved to a project management position (GCC, Spain) leading the evolution of an ERP software package for industrial companies. During these years he became an SCM expert working as consultant and participating in several events as speaker. Pablo co-founded Codice Software in 2005 and since then he played several roles ranging from core engineering to marketing, business development, advertising and sales operations. Together with David, secured a initial VC round in 2009.

Pablo stepped down as CEO when Francisco Monteverde joined the company, to put more focus on his role as chief engineer, although he continues involved in marketing and key sales operations. Pablo contributes as technical evangelist as main contributor to the Codice's blog, speaker on software events, and occasional writer on different technical magazines. Back in 2004 Pablo joined the regional professional association of computing engineers as vice-dean and moved to dean in 2008, position he still holds. Pablo is an associate professor in the University of Burgos, where he is very involved in the training of young engineers teaching project management including important areas such as agile methods and software configuration management.

Pablo got his Master Degree in Computing Engineering back in 2000 (University of Valladolid, Spain). Pablo has a deep passion for software development but when not coding or designing new features he loves motor-bikes, specially track days, but including long touring trips.

InfoQ Software Architects' Newsletter

Distributed Version Control Systems in the Enterprise

Write for InfoQ

Intro

Related Sponsors

What is distributed all about?

Why is the central server gone?

One repo to rule them all

Daily workflow changes

The enterprise question: why should I care?

Wasn’t multi-site already invented?

Proxy servers are not DVCS

The added values of DVCS

The challenges of DVCS in the enterprise

Conclusion - the way ahead

About the Author

Rate this Article

This content is in the Source Code topic

Related Topics:

Related Editorial

Popular across InfoQ

The InfoQ Newsletter