The Wide Range of DevOps
This article is based on a talk I gave at DevOpsDays in Sweden titled “DevOps is not an absolute. It’s a range.” The video of the talk may be viewed online, but is not necessary to view prior to reading this article.
For the past few years, DevOps is a term we’ve seen or heard practically non-stop in articles, presentations, keynotes, and general conversation. DevOps claims to create a faster feedback loop and lower the cost of product iteration all while improving the overall stability of your systems. Like anything making impressive claims, it was easy to ignore or dismiss the movement due to immaturity or lack of evidence. But time has passed, companies have continued to show real-world gains, and various processes for adopting DevOps in organizations have emerged. Therefore, the time has never been better to investigate and bring this movement into your own work environment.
For the uninitiated, it’s easy to view DevOps as a single change, much like a single switch controls the power to a light. Looking at it this way, adopting such a change can seem like a daunting -- perhaps impossible -- task. And just like general engineering, trying to build something complex as a single unit of change typically results in failure. Luckily, DevOps isn’t a single switch, and it can be broken down into a series of changes. The deployment and timing of these changes can be tightly controlled and fine-tuned based on what is right for your organization.
Conveniently, the changes necessary for DevOps can be plotted on a timeline-style graph, where the extreme left represents traditional ops culture and practices, and the right represents a newer DevOps-style. In this view of the world, the question is not “Is your company practicing DevOps?” but instead is the more accurate “How strong of a DevOps culture has your company adopted?”
As a quick disclaimer, the ideas and examples put forth in this article are shaped to a certain organizational structure. These assumptions are based on my own personal experience of working in companies with in-house dev and ops teams, having ops in charge of development environments, working with a limited number of projects, and so on. The downside of this is that if your organization doesn’t fit these assumptions, the ideas presented may not be right for you. However, the upside is that these ideas are based on real world experience in multiple work environments that matched similar situations.
The Range of DevOps
Looking at this range, it’s important to firmly establish what exactly the far left and far right represent, so that we can better understand what it means as this range is traversed.
The far left side represents traditional ops culture and practices.
A generalized description of this extreme can be “black-box ops.” In this culture, the ops team is siloed away from the dev team, and interaction is either avoided or reluctantly forced. The defining trait of this side of the range is that dev and ops inherently have opposing goals. The dev team is tasked with and praised for shipping new features and moving the product forward. The goal of the ops team is to maintain stability above all else. Without proper communication, these exist in conflict with each other, since it is in the best interest of ops to not ship new features, and it is in the best interest of dev to ship new features as quickly as possible. Because introducing any kind of change into a stable system can potentially introduce unexpected instabilities, ops avoids this if at all possible.
A concrete example: an application developer introduces a bug in the code which causes an infinite loop in a certain edge case not caught by QA or tests. If such a change were deployed by ops, suddenly certain servers would spin to 100% CPU, causing instability. If ops simply avoided deploying this change, there would've been no issue, or at least no new issues. This is the point of view of this side of the range.
The far right side represents a fully embraced DevOps culture where dev and ops are one and the same. Here, devs do ops, ops do dev, and both teams have a mutual goal of shipping features together while maintaining a certain level of reliability.
By knowing these two extremes -- and stressing that both sides are indeed extreme -- getting from one side to the other may seem intimidating. And it is intimidating, as long as you view it as a single step. By breaking the timeline down into a series of manageable chunks, the task becomes approachable, the benefits are easily clear, and results suddenly seem within reach.
Cultural vs. Technical Changes in DevOps
DevOps requires both cultural and technical change in an organization. Culturally, barriers needs to be broken down so that ops teams and development teams communicate more openly and share common goals. Technically, developers need to better understand how ops teams work and have a good knowledge of system architectures. Ops engineers need to know how the development process works and have a better understanding of the code itself.
When DevOps is broken down into chunks, I’ve found it easier to introduce this by alternating between cultural and technical change. You’ll notice in the coming sections that each section follows this pattern. This is done for good reason: change is hard, radical change is near impossible. By alternating what is changing, each change is more gradually introduced. Instead of one big change, there is one small cultural change followed by one small technical change followed by another small cultural change and so on. From this style, teams never wake up feeling as though everything has changed from beneath them. Instead, it feels like change occurred organically and at a more natural pace, increasing the likelihood that such change sticks within an organization.
Metrics, Metrics Everywhere
The first chunk in moving from the left to the right is to enable metric aggregation across your organization at both an infrastructure and application level. Or, as I prefer to call it: metrics, metrics everywhere. There are many great talks on this subject, but it ultimately comes down to answering a single crucial question: What does my code do?
Development will happily answer this question by showing you the code. Unfortunately, code only describes what code should do but not what it actually does. Code is like a cooking recipe: it describes the steps to reach a certain tasty outcome, but doesn’t have any effect on the actual real-world result. We’ve all at some point in our lives attempted a recipe with less-than-edible results. Likewise, code may describe a process to achieve some desirable effect, but the actual consequence of the code on a real-world system is unpredictable from the code itself. Below is a code example where the developer may have changed a cache timeout from 3600 to 1800 seconds. Looking at the code itself you can see this change, but it is hard to predict the overall system effect of this change.
Ops will answer this question by logging into a machine and getting some data out of the running system such as memory or CPU utilization. This is the right answer! This shows the effect of code in the real world. Ops has access to a lot more data, too. This data provides answers to important questions such as “What is the system-wide effect of this change?” or “Why did service Y slow down after service X was deployed?” Historically, developers could only answer these questions by speculating about how the code will run. While this sometimes works, having access to actual data is undoubtedly more powerful. The image below shows an example of what ops can see: data for a running production system.
It is important to remember where we are on our range of DevOps at this point. We’re just to the right of the far left, so we’re still very much in a traditional ops environment. Because of this, giving developers access to production systems is not going to work. Most developers are not comfortable in this environment and when people get uncomfortable, it is natural to retract back to a comfortable environment. Attempting any sort of change in an organization without maintaining a certain level of comfort along the way will result in the change not being well supported, ultimately resulting in reverting back to old ways.
To surface this data in a developer-friendly way is quite simple: graphs. Graphing technology has been around for years, but has been particularly popular in recent years with the emergence of tools such as Graphite and Statsd. By hooking system metrics into Graphite and exposing the API to developers, the best of both worlds is achieved: ops can expose system metrics and developers can expose application metrics. Suddenly, developers have access to memory, CPU usage, etc. in addition to statistics about application events such as log ins, log outs, and so on.
For a developer, implementing a metric is a single line of code:
Which results in a graph that looks like the following in Graphite:
Setting up these new graphing systems is normal work for a traditional ops team, and the interface to Statsd and Graphite is so simple that developers can start graphing with only a few lines of code. With these low-friction technical changes, developers now have insight into performance, system-wide effects of code, and ops in general. And now, even at this point, you can say that your organization is doing some amount of DevOps, because dev and ops are now interacting in at least a small way. Moving forward, the interactions will become greater, but this is a comfortable starting position.
With a general insight into the performance and health of production systems, it is natural for developers to become curious about what comprises the underlying system. To many developers, a large scale production system is a black box: a request goes in and a response comes out, but the various systems it touches in between is unknown.
To address this, infrastructure should be documented. This can begin with very basic high level diagrams of a request flow and what software is hit at what point. As this matures, documentation should address what certain pieces of the architecture do and why it was chosen versus other potentially competing solutions. In addition to specific software packages, the documentation can hit on points such as how new servers come online, potential failure cases and resolutions, intros to unix tools, and so on. The point of this documentation is for developers to have a resource to become more comfortable with the architecture of a production system from a high-level.
Once these resources are made available, developers can freely learn more about the system architecture if they are interested. And interest in the infrastructure comes from the graphing system we implemented earlier, since that gives developers a simple way to look at a running system. With metrics and documentation, the black box behind ops is beginning to disappear. There is still not a lot of collaboration between the two teams, but the barriers to this becoming a reality are quickly disappearing.
Production-Mirror Development Environments
Up to this point, developers have interacted with ops mainly through systems instrumentation and written documentation. Equipped with this basic knowledge, it would be great if developers could actually experiment with and interact with the internals of the underlying ops. Doing this to a production environment at this point is not only unrealistic, but poses a threat to the stability of your systems. Instead, it is preferable to give developers a sandbox to play in.
Made specifically for this purpose, Vagrant is a tool for packaging and distributing development environments in the form of VirtualBox virtual machines. These virtual machines are built up using standard configuration management such as Chef, Puppet, or even just basic shell scripts. Because of this, ops can use the same production setup scripts to setup portable development environments. Developers are expected to do all work in these environments, because they match production as closely as possible. Additionally, developers no longer need to worry about manually setting up their machines, because ops handles this via properly configured Vagrant machines.
These development environments, built on top of production ops scripts, give developers a sandbox to play with real systems. If any damage is done, the virtual machine can always be destroyed and re-created. On top of simply being a sandbox, the actual setup of the virtual machines gives developers an insight into how servers are provisioned, how ops changes are rolled out to machines, and a real world look into the architecture of their systems.
DevOps Office Hours
Developers now have a sandbox to tinker with a real system, documentation to learn more about the system, and metrics as a way to gather data from production. Despite all of this, ops is still new and intimidating. Luckily, we are friendly people, and it is time to begin having true interaction between the two teams. This interaction can come from forums, a help desk, or even walking over to the other person and having a real conversation.
I’ve found the best solution for introducing this new practice to be office hours. Office hours are a fixed amount of regularly scheduled time that an ops or dev engineer dedicates to answering any sort of questions. These questions can range from extremely basic such as “how do I search for files on the machine?” to relatively advanced: “can you explain the reasoning behind this HAProxy configuration parameter?” The most important quality of these office hours is that no judgment is passed, no matter what question is asked. These office hours are a time when engineers can feel safe asking anything relevant.
With this in place, an important milestone is reached: communication! Both dev and ops have an understanding of what each other does, they’re both able to see and interact with each other’s work, and they’re both talking.
Mitigating the Risk of Devs Doing Ops
Before continuing, I’d like to point out that at this point your organization has a DevOps culture healthier than most organizations. And it has been introduced in a slow, methodical, low-risk manner. Continuing forward, we begin entering the extreme right of our aforementioned timeline of DevOps. This area is still radical compared to the previous steps and is not as well defined. However, organizations have successfully integrated this and are seeing benefits from such changes.
Developers now have all the tools to begin making real ops change and taking responsibility for it. Just as everything prior, this can be introduced in smaller chunks in order to mitigate risk and make everyone more comfortable.
The first is to use a standard open source model for ops change: pull requests and code review. When a developer wants to introduce something new, he or she can make the changes and issue a pull request. They can test this change in Vagrant managed machines that were setup earlier. The pull request gives an actual ops team member the chance to review and sanity check the change. If anything is amiss, comments can be given and the developer then doesn’t make those mistakes again in the future. In the end, the pull request is merged, with developers feeling confident and proud they made a change while ops feels safe knowing that the change was vetted by them.
Second, and more experimental even at the time of this article, is to use some level of continuous integration with ops. At a very basic level, this would be a CI server such as Jenkins verifying ops scripts run without error on every commit within a sandboxed environment, possibly managed by Vagrant. Basic smoke tests such as verifying that an HTTP request can be made to the resulting infrastructure can be done as well.
With one or both of these in place, developers are now safe to make ops changes. Ops feels safe because they are still vetting the changes. This is the first time that dev and ops are truly working together and share some level of responsibility with each other. There are still distinct dev and ops teams but the distinction is quickly dissipating.
Devs: Go Crazy!
Now, on to the true extreme right on the DevOps timeline: developers do all ops. By implementing all the previously mentioned pieces, the technical and cultural change is in place that this becomes a real possibility. In practice, this generally works by continuing to maintain two separate teams that work together much more closely. The ops team can be smaller and more developers can be brought on. Developers do actually do ops with some supervision by the few ops people. Developers can and should be on call, with ops being the second line of defense in the case of an outage.
To reiterate, this only works because of the foundation which we’ve built piece by piece. Metrics give insight into system-wide effect of a developer’s code. Documentation allows developers to learn more about the production architecture so that they can better understand the effect of different changes. Virtual machines and a workflow built on top of automated configuration scripts save time for ops by letting them using production tools to create development machines while allowing developers to have a sandbox to actually tinker with the system architecture. Office hours or forums are an outlet for any sort of questions that developers or ops may have about each other, and provide a safe learning environment. Automated infrastructure tests and code review give both ops and dev a security blanket so that the risk of ops change is mitigated. The end result of all this is that each team communicates much more freely, each team trusts each other more, and in the end the distinction between these teams is much more blurred.
The benefits of DevOps are numerous. First and foremost, there is more collaboration and trust within an organization. The rate that features are delivered is improved because there are more people to do ops and ops doesn’t need to just say “no” since developers are also held responsible for any changes. Believe it or not, DevOps also improves the overall stability of your system, because there are more capable eyes on the effect of various changes. Because features can be more quickly delivered, there are less large upgrades that require downtime. Instead, changes are delivered in smaller, more manageable pieces that may not require downtime at all.
Where are you on the timeline? Where do you want to be? As long as you’re not on the extreme left, your organization is already practicing DevOps. This breakdown gives you the steps necessary to confidently move forward without risking too much, and if you feel that the change isn’t working properly, it is small enough that it can be reverted and tried again at a later date.
About the Author
Mitchell Hashimoto is the creator of Vagrant and is an operations engineer for Kiip. He is passionate about all things ops and open source, and enjoys spending hours of his free time each day contributing to the community. In addition to simply contributing to open source, Mitchell enjoys speaking at conferences and user groups about Vagrant. Mitchell can be found on GitHub and Twitter as @mitchellh.
InfoQ Sep 01, 2015