DevOps @ Rafter

Lire ce contenu en franÃ§ais

May 31, 2013 16 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

The Bootstrap Phase

Over the last 6 years, I’ve had the unique opportunity to watch our company grow from just a couple of fresh college grads with an idea of renting textbooks to a large, mature company. When I look back, I tend to see the growth we went through in two distinct phases – pre and post Series A funding. Unlike most startups you hear or read about, we had a rather long pre-Series A period (almost 3 years).

In this phase of our growth, we didn’t spend a lot of resources or brainpower thinking about DevOps and instead focused on building our product. For most of those first few years, I considered myself a software engineer who sometimes had to spend a few days a month performing systems administration.

We always joked that I was the one who drew the short straw and got “stuck” managing the servers, but, in all honesty, I enjoyed it. It never occurred to me at the time, however, that my love of software development could be applied to managing our servers.

Keep it Simple(while you can)

There are a couple of reasons why in the early days we were able to get away without focusing much on DevOps. A lot of it has to do with being a small company and keeping the amount of change low. We chose a simple architecture and had a small number of applications, so while our products were evolving, there was little change exerted on the underlying infrastructure.

Overinvest

Secondly, we overinvested early on in server hardware. We probably could have run our entire site on 2 physical servers, but we had 10. This allowed us to spend very little time worrying about performance or growing our infrastructure, since it took several years for us to hit the limit on the initial hardware investment. There was, of course, up front cost in configuring these servers, but once they were setup, we made very few changes to them.

Pick Good Tools

Lastly, because we adopted Ruby on Rails as the framework for our applications, we were exposed to tools that allowed us to adopt good practices early on for release and deployment such as Git, Capistrano, and TeamCity. We also built on top of well-tested, and stable open source solutions such as nginx, MySQL, and memcached.

The exposure to all of these tools and frameworks kept us from having to roll out complex, and proprietary solutions which I think all too often ends up slowing down development as the company grows.

Growing Up

As our company entered the second growth phase, we grew our engineering and product teams substantially. The days of only two engineers were gone, and I simply had to blink and we were soon at 10 and then 20 people working on current and new products. As one can imagine, the amount of change that started to be introduced also grew substantially. The sheer number of applications we needed to host on our hardware doubled, then tripled, and developers began wanting to use new types of application frameworks, programming languages, databases, queuing systems, cache servers, etc.

Along with this added complexity, the cost of mistakes and outages also grew. It quickly started to become apparent that the old days of configuring our infrastructure manually was not going to scale and that our lack of maturity and flexibility at the infrastructure layer would start to cause problems both by slowing the release of new products and features and hurting our stability. Due to this realization, I stopped working on our products, and switched to focusing on developing automation, monitoring, and release processes full time.

Cooking with Chef

Luckily, as we were recognizing the need to develop DevOps practices at our company, we were introduced to OpsCode Chef and with it, the whole philosophy of “infrastructure as code”. We initially spent several months writing all of the recipes for automating each piece of our existing infrastructure. Once we were finished and could rebuild all of our servers from these automated recipes, it was an incredible weight lifted off of our shoulders. All of our servers were now setup consistently and we finally had a place where anyone on the team could look and see exactly how a piece of infrastructure was setup and configured. Equally important, it also gave us the ability to quickly spin up additional resources with relative ease.

A DevOps Team is Born

DevOps began to have unique responsibilities in the organization, providing critical application support for our product lines and ensuring our infrastructure could continue to scale. Along with these priorities, we were building sets of tools and products for managing the infrastructure that needed to be continually supported and improved. Especially early on, every new application in the infrastructure often generated changes that needed to be made in our underlying automation. The number of DevOps related requests by internal customers (Engineers and Operations folks) also increased substantially as more people began to rely on our work.

Because our engineering organization was already split into separate product teams each focusing on distinct pieces of the business, we decided to make DevOps another team in our engineering organization. This allowed us to dedicate engineers that could solve infrastructure problems full time. In addition, availability and reliability of our application platform is extremely important to the business, as outages and issues can have a big impact on the bottom line. We needed to ensure there were always dedicated and well-trained engineers available to assist and investigate issues, especially when ownership of the issue might be shared across several teams or not immediately obvious.

The Fruits of Automation

On-Demand Provisioning

One of the things we did almost immediately after adopting Chef and automating our production infrastructure was improving our test and staging systems with the same level of automation. A common complaint we were hearing is that engineers could not easily demo to the business people what they were currently working on. In response, we built a 100% self-service portal that allows anyone in the company to spin up a preconfigured server running our full stack on EC2.

A user can choose which of our applications they want installed on this server at a specific revision, and they can choose a test database or use a scrubbed snapshot of the production database from a time of their choosing. One of the big wins from this system is that we simply reuse the exact same Chef recipes that we already use to build our production servers. This allows us to flush out many potential issues before they ever land in production. Our Engineering and QA teams can also feel more confident they are testing their new features on servers that are setup identically and run the same versions as our production servers.

This staging system has been immensely popular at our company. We have had many employees remark how it took weeks and lots of paperwork to get demo servers requisitioned as their previous companies. On our system, anyone can have a staging server up in as little as 15 minutes. Building such a system would be impossible without having a strong automation foundation in place.

Datacenter Failover in Minutes

Another difficult task that DevOps automation has helped us solve at Rafter is performing datacenter failover. About two years ago, we decided that we wanted to have the ability to switch at any time to a secondary datacenter to reduce the impact and cost of outages. Since we are in the education industry, our business is seasonal and suffering a long downtime during back to school seasons can be costly. This also benefits our operations folks tremendously since they can perform risky datacenter maintenance when there is no live traffic going to that datacenter.

Having an automated infrastructure in place has allowed us to meet this difficult goal with relative ease. When we first started out, I would have been shocked if someone told me one day we would be able to failover entire datacenters in a matter of minutes. But relying on our automation to do all of the heavy lifting, we are able to design an infrastructure that supports this goal.

Shared Deployment

Another notable area of work for our DevOps team has been around improving our deployment processes. From the beginning, we’ve used a great tool called Capistrano for managing our deployment and we were generally happy with it. One improvement we did make to it was teaching our Campfire bot (named Reginald) how to deploy using Capistrano. When our product engineers want to deploy their applications now, they simply ask Reginald to do it for them.

The biggest win here has been that it makes deployment more of a shared experience. Anyone in the Campfire chat room can see if a deploy is going on and if there’s a problem, someone can immediately jump in and help. All errors and deploy logs are also stored in our database and viewable from a web application which Reginald points developers to. This makes it much more convenient to share any potential issues. Previously, when the developers were deploying from a server, it was more of a private operation, whereas when you have to deploy publicly, it’s a lot easier for everyone on the team to know what’s going on.

Self-Service Tools

One tenet that we’ve always designed for in the products our DevOps team builds and uses is being self-service whenever possible. The success of automation is all about removing manual roadblocks, especially ourselves! Giving people the tools and platform for managing parts of the infrastructure that they care about makes the entire organization perform more efficiently. Especially in areas where others can actually do the work for you (and probably do it better), your tools should empower them to do it. We try to keep this philosophy in the tools and practices we adopt.

Our Team Makeup

I’ve found that there tends to be a wide range in the industry of where a DevOps organization falls in terms of development vs. operations. Our DevOps team was formed out of the product engineering team so it has always been staffed with traditional software engineers. All of our current team members have spent time building features on our main product lines before switching to DevOps. We’ve also been blessed with a fantastic operations team that handles the care and feeding of our hardware and datacenters, so we have the luxury to focus solely on software.

Focus on Development

I believe the heavy focus on the “dev” side has worked well for us. As we build and support a more complex application infrastructure, our need for traditional software development will only grow. That said, we have found hiring to be difficult, as many software engineers do not have a lot of interest in ops/infrastructure related areas, and many operations engineers don’t have enough experience in traditional software development roles. It definitely requires a special individual to fill these shoes. At Rafter, I’ve noticed that DevOps seems to attract those with deep knowledge in certain areas as opposed to generalists. Currently, our DevOps team consists of three engineers, and we are always looking for more talented people.

One aspect that makes our DevOps team unique is that we often take on tasks that normal engineering product teams or site reliability engineers might accomplish. For example, we’ve lead projects such as upgrading our applications to work with new major versions of Ruby and Ruby on Rails. If you’ve ever been involved in a major Rails upgrade on a large application, you know this is no easy undertaking and requires a lot of expertise with both Rails and the underlying codebase. We also often assist product teams in debugging and solving performance and scalability issues in their applications. This type of work has helped our team develop an exposure to all parts of the software platform that we have to support.

Day to Day

Support

A typical day for members on our DevOps team is a combination of resolving support requests, investigating alerts, and working on longer-term projects. On average, we spend about 50% of our time on support and alerts, and 50% on project work, with requests and alerts taking priority because they are usually very time sensitive. The support requests are often a large variety of issues, but typically revolve around areas related to application support. For example, an engineer would like to get a new application added to the platform or would like to make an infrastructure change to an existing application such as spinning up more servers. Sometimes, these requests can be large in scope and require significant changes and testing such as hosting a new programming language or a new database.

Monitoring & Alerts

Another common set of requests is around investigating application alerts (either automated or reported by engineers). Often, DevOps acts as the coordinator for alerts, performing the initial investigation and finding the right team that can best resolve the issue. This is not to say, however, that we handle all application alerts. We investigate alerts that affect infrastructure (e.g. issues involving over-utilization of network, io, cpu, and memory) or areas that have shared responsibility and usage among the engineering team (e.g. shared libraries, databases, and legacy applications).

Project Work

Often, the work we do in supporting engineering requests and investigating alerts exposes areas where we need more investment. Specifically, these are areas where we aren’t providing engineers with enough tools or information to solve the problem themselves. One example of this, currently, is that we don’t have the capability for engineers to deploy their own cronjob changes. So when an engineer needs to modify a cronjob for their application, they need to open a support request with us. After doing 10 of these requests a week, it quickly becomes apparent, that we need to build out a tool that allows our engineers to deploy their own cronjobs. Thus, a significant portion of our project work comes from deficiencies that are discovered through our support and alert requests.

We also spend a significant amount of time upgrading our systems and performing the testing required to be confident in the upgrades. I could fill up several pages with all of the different types of servers, databases, applications, libraries, and operating systems in our platform. In order to keep everything running recent and secure versions, it is enough work to keep several engineers busy all year round.

Another source of project work comes from direct external requests. For example, management requesting that we need to be able to fail over to another datacenter, or an auditor asking for a specific security feature, or an engineer requesting infrastructure changes that might affect many applications.

DevOps as a Platform

As our company has grown, DevOps has found a place in between our engineering product teams and operations team. I think of it as a three-layer cake. At the bottom layer is our operations team which is responsible for acquiring and setting up the physical hardware, then in the middle is our DevOps team which is responsible for providing a platform to use these hardware resources, and then the top layer is our engineering product teams which uses the DevOps platform to deploy, monitor, and host their applications.

I think platform is a key concept here when talking about DevOps. When I first started in DevOps, I used to think of our tools as separate independent items. However, once you start building out these individual pieces, you realize that each component fits together in a larger platform and requires access to a common set of information. Each of these tools needs to be integrated; otherwise a lot of complexity and duplication will slowly creep in.

Building a DevOps Platform

Transforming our DevOps tools into a platform of services is a current and ongoing project for us. This might seem strange, but when you think about it, there are many different clients that need to know information about our infrastructure. For example, some of the applications described in this article might need to access the platform in the following ways:

Our staging server portal needs to know a list of applications that can be installed on staging servers
Our deployment framework needs to know which applications can be deployed and where to deploy them
To switch datacenters we need to know how to bring up applications in the other datacenter
Our Chef recipes need to know how to configure servers properly based on the applications we ask it to install.

From these examples, there’s a set of common information that can be exposed via services to answer basic questions about the infrastructure.

One can take the idea even farther, and develop additional services that affect change in the infrastructure. Some examples might be a service for putting a server or application into maintenance mode, a service for adding new applications to the platform, etc. Some DevOps organizations might implement these features as ad-hoc scripts. This has a downside though since usually only a DevOps engineer can run these kinds of scripts. A service has the benefit that other tools and applications can interact with it in a common way, and we can build applications on top of these services for further flexibility. This also supports the idea of building tools that are self-service whenever possible.

Chef as a Database

We have also shifted our perspective in how we see Chef in our ecosystem. It is far more than just a tool for automating our servers. We see it as the data layer in our DevOps platform. By taking advantage of Chef’s platform and APIs, we can store and query information about all of the servers and applications in our infrastructure. While Chef provides the data storage layer, we are building our own set of services on top to access this data and provide a flexible means of interacting with it.

Our Job is Never Finished

All of this comes down to ensuring we can still continue to scale, and that the infrastructure is never the roadblock preventing a new product from being launched. Even after there is significant automation in place, you have to keep iterating on top of that existing automation.

One example of this readily comes to mind. Our team spent a lot of effort building out a framework for how our applications are setup on each server. The only piece required now is to write a small JSON file describing basic information about the application and its dependencies. We then patted ourselves on the back and went to tackle the next problem.

However, we started to notice we were adding a lot of new applications and spending a decent amount of time writing these configuration files usually through several inefficient back and forth conversations with the product engineers. It soon became apparent that we had become a bottleneck in our own automation. The job of the DevOps team is never finished and what may seem like enough automation today, can still slow you down tomorrow.

About the Author

Chris Williams is a Co-Founder of BookRenter.com, the first online textbook rental service, which became Rafter Inc. in 2012. Today, he manages the DevOps group at Rafter and oversees infrastructure automation, deployment and release processes, and platform availability.

InfoQ Software Architects' Newsletter