DevOps & Product Teams - Win or Fail?
Falling in Love with DevOps
To be honest, DevOps for me wasn’t love at first sight. My first encounter with it was in 2012 in an infrastructure team. Engineers in these teams work on building services such as authentication and image metadata storage which are used by product teams. We decided to stop outsourcing the operation of our services and start going on call ourselves. It seemed like overkill: Prezi.com was a hip new startup; we weren’t struggling with the age-old enterprise problems of silos slowing down communication between developers and operation folks. With a newborn baby daughter, my biggest problem was trying to get enough sleep at night and the newly created on-call rotation was unlikely to help with that. What we hoped it would improve was availability.
Availability is a tricky thing to measure. In the fall of 2012, Prezi gave up on outsourced operations in an effort to stall our growing Mean Time To Recovery (MTTR) as our user base exploded. MTTR is the average time it takes to repair an outage, for us around 40 minutes at the time. This may not seem like a big deal, but for someone presenting their thesis, startup pitch or lecture, not being able to present could result in a catastrophe. The other widely used availability metric is Mean Time Between Failure (MTBF), which is the average amount of time the system can operate between outages. For us this was around four days.These very similar four-letter acronyms lead to very different engineering cultures. MTBF is maximised when few people make changes to the system in a highly controlled manner, carefully following well-proven processes. Ideal for companies working on pacemakers or brake systems in cargo trains.
MTTR on the other hand creates the perfect incentive for DevOps: the fastest way to find the cause of an outage and fix it is to bring together the engineers who wrote the component with those who operate it. Going one step further, Prezi put the software engineers writing the code on call. Why? In most cases, backend outages were caused by bugs or performance issues introduced during recent changes to services. Who would be better qualified to fix them than the people who introduced those changes in the first place?
DevOps delivered on our expectations. One year later, our user count doubled from 16 to 32 million, while MTTR and MTBF improved to around thirty minutes and one week, respectively. There were additional benefits: less specialisation in the team meant more tasks could be performed by any engineer, leading to fewer bottlenecks and faster development. Software engineers learned about the intricacies of debugging in production with strace. Sysadmins started writing webservices in Python. It was a love story of epic proportions. How did it end? I’m happy to say that infrastructure teams at Prezi are still well served by the DevOps way today. The DevOps nirvana ended for me personally when I joined a product team - a situation where very different rules and incentives are at play as I soon found out.
What makes product teams special?
You ain't special, so who you foolin'? Don't try to give me a line
Guns ‘n’ Roses: Bad Obsession
Infrastructure teams build services for product teams. Product teams build user-facing code on top of these services. Some product teams are responsible for the Prezi presentation editor and viewer, while others work on the prezi.com website. Most teams at Prezi contain 3 - 8 people. Unlike product teams, teams working on infrastructure contain engineers only. The most refreshing thing for me about adopting DevOps in such a team was the small set of axiomatic statements which guided our decisions:
- Everything we do is going to improve our primary metric (MTTR) at some point in the future.
- Specialisation is occasionally unavoidable, but always frowned upon. Anyone in the team should be able to write the code and operate the resulting service.
- Every service has a clear owner. It’s the owner’s responsibility to guarantee availability. Teams should not modify services belonging to other teams.
- Write lots of tests to ensure bugs are caught before they are deployed.
Even if these directives aren’t followed by all teams which claim to practice DevOps, they’re fairly popular ideas in the DevOps community. Much to my surprise, none of these - to me, sacred - laws held true in their original form once I joined the product team.
My new team is responsible for the new user experience at Prezi.com. We’re happy when people follow a link to a presentation on Prezi.com and stick around because they like what they see. To achieve this, we’re mostly using the playbook outlined in The Lean Startup: run cheap experiments to test ideas, don’t build anything complex unless it’s proven. It’s a solid way to build a product, which incidentally invalidates all the DevOps rules I’ve grown to love on the infrastructure side.
Tests and ownership
Much to my initial surprise, a cheap experiment doesn’t necessarily include tests. It’s always a good idea to write unit tests, but complex acceptance tests can often wait until the hypothesis of the experiment is validated (or invalidated, in which case we save the time to write the tests). We learned this the hard way when we spent over four weeks building a content recommendation experiment which resulted in disappointing results: none of the algorithms we tried was a hit with our users who decided to ignore most of the recommended prezis. Prior to releasing the experiment, we spent almost a week writing top-notch Cucumber tests for this functionality. We decided to try something completely different, and the code - despite being wonderfully tested - was just bloat waiting to be removed, tests included.
I also had to seriously rethink my concept of service ownership. In the infrastructure teams, each critical service belongs to one team which is on call, reviews all pull requests from other teams, and generally assumes responsibility for the availability of the service. There is no such thing as an orphaned infrastructure service. Libraries shared between these services also have owners who oversee the overall architecture of the code and communicate API changes.
In our product team, some experiments require small modifications to many different services or libraries, none of them owned by us. Outside Prezi’s infrastructure community ownership rules have traditionally been more lax so some of these have no clear owner (or their owner has no plans to work on our feature requests). In these cases it makes sense for us to do whatever is needed to get our experiment or feature out the door. As long as we’re not causing any regressions which could negatively impact other teams’ metrics, there’s no harm done in slightly loosening up the rigid infrastructure definition of ownership. For example, one of the experiments we ran required us to log information in a way which the log collection service didn’t support. After countless redirects between different teams, we took matters into our own hands and decided to send logs to a deprecated but functional endpoint which we patched to accommodate our needs.
In a product team, making an impact on metrics is just as important as it is for infrastructure teams, but there can be several metrics. Choosing which to optimize for can be a daunting challenge (and may need to be periodically re-evaluated with shifting company priorities).
Outage times are relatively easy to measure, but the most relevant metrics for a product team can sometimes be quite difficult to calculate. For example, Prezi is a good match for marketing agencies who need to stand out and impress clients with their presentations. Our team would like to cater to their needs, providing an experience which wins their hearts. As a result, the most challenging part an experiment may be to guess whether a satisfied visitor was a marketing professional or not.
MTTR -despite all its advantages- is rarely a good measure of a product team’s success. A product team which identifies an unfulfilled user need to be addressed is victorious. Teams which achieve this by building insanely buggy experiments held together by luck and duct tape are no exception. In fact, as long as it’s an experiment, the cheaper the better if the resulting data is sufficiently reliable. As long as it doesn’t affect more than 2% of users and doesn’t prevent customers from presenting their work, all is fair game.
Once the experiment is over, and the product team is ready to release something, quality (and measurements like MTTR) become more important. Still, how much energy should be poured into improving availability is primarily a product decision. I recently attended a talk by a senior engineer at Gilt, who mentioned that an outage outside of Gilt’s peak hours is not considered a big problem because very few visitors use the site before it’s daily opening at noon. Gilt’s backend engineers are better served by a specialised metric which takes this aspect of their business into account than the vanilla MTTR.
At Prezi, product teams clean up or sometimes even completely discard the “scrappy” experiment code prior to releasing a feature for all users. This is the right time to finish up those acceptance tests nobody had time to write earlier. Since management evaluates product teams based on what they actually release for all users, product teams are highly motivated not to neglect this final step which is crucial to keeping the codebase maintainable.
Specialisation (DevOps’ arch enemy) is key in speeding up the experimental cycle. A daunting spectrum of skills is required to properly and efficiently execute experiments, which favors well-defined non-overlapping roles instead of generalists.
To start with, the team needs to identify the areas where they can have the highest impact. For example, our team had to decide whether we concentrate on content recommendation or simplifying the user interface. This is generally the task of the Product Manager or Product Owner, often working together with a User Experience Researcher, who conducts user interviews and gathers qualitative input from customers. Once the hypothesis for the experiment is ready (for example, “A larger sign-up button will increase registration”), a designer can create the layout and assets necessary. Finally, engineers build the minimum amount of software necessary to validate the hypothesis. Depending on the product, it could make sense to have specialisation even among engineers. This doesn’t make sense in our team, but for example the Android team has engineers who specialise in writing the server-side features used by their client.
Despite striving to be cross-functional, one of the thornier problems product teams often face is lacking some necessary competence. For example, one of the product teams working on website personalization built a service to keep track of per-user settings such as the preferred language. This backend service needs some kind of database, but it probably doesn’t make sense to have a dedicated DBA for the team. This won’t be a problem until the database goes down because of an obscure leap-second bug at 4 AM. The traditional DevOps solution in this case would be to have the team’s phones ringing off the hook until the problem is solved (hopefully at least one team member would know what’s going on). With a product team, this is not so clear: most iOS developers wouldn’t know what hit the DB and would have a hard time trying to figure out what’s going on. One solution would be to have a dedicated team operating databases for product teams, but this sounds a lot like the developers versus operations divide of the bad old days before DevOps.
The problem with dedicated operations teams
Every now and then I’ll meet someone who is nostalgic for separate developer and operations teams which happened to get along just fine at their company. Even the classical DevOps novel The Phoenix Project ends this way, with separate teams on amicable terms. Be warned: even if this does occasionally happen, it’s exceedingly rare. The reason is opposed incentives.
Developers are expected to deliver product features or user stories, preferably in a predictable way. When unforeseen problems cause delays, developers - keeping the release date in sight - struggle frantically to compensate, releasing incomplete features (although some would argue that there’s no such thing as releasing too early).
Operations is usually prized on availability. MTTR may be more DevOps-friendly than MTBF, but regardless of how it's measured, outages are more difficult to prevent in face of constant change. This can cause engineers in operations to be over-cautious and too conservative. If lots of new product features are deployed to production, it’s the developers’ merit, but if any of those shiny new features cause an outage, the operations guys will be waking up to fix it.
Needless to say, the company needs both skillsets and accompanying perspectives. DevOps is useful because it unifies these opposing viewpoints in the team. But what if DevOps in the traditional sense is not practical? Should the product team tracking user settings hire a DBA to resolve a once-a-year database-induced outage? If the incentives are correctly aligned, there’s no reason a product team can’t work with an operations team. The trick is to have well-defined interfaces between the teams.
Working with platform teams
I had a wonderful experience recently. A friend and former teammate who is working in one of the infrastructure teams said he wanted to talk. It turned out that he was assessing what backend pains product teams had. His team is responsible for software deployment and the staging environment. He had several project suggestions which could make our lives easier, and wanted to know if there was anything in these systems which was causing us trouble. His approach was nothing like that of a traditional operations engineer. Considering his team is a platform team, not an operations team, this is hardly surprising.
Operations teams assume complete responsibility for running an application or service once the handover from development is over. They have to try to make sense of whatever is thrown over the fence. The opposed incentives are almost guaranteed to be there. Platform teams on the other hand provide tools and services which ease the burden of operating software for their customers, the product teams. The responsibility for the user experience remains solely with the latter. For instance, as an engineer in a product team, I need to deploy my code into production somehow. I don’t have the luxury of outsourcing this problem to an operations team: it’s my responsibility, but I do have a number options. I could build my own custom deployment solution, but this is taking away valuable time from experimentation and product development. I could use an open source or commercially available product, but chances are this would still require a significant time investment on my part due to the learning curve and possibly the need for customization. My friend’s platform team offers a third solution: as long as I produce a tarball with a specific format, they provide a service which deploys the code on our nodes. Needless to say, I am still responsible for what I deploy: if I end up using their service to deploy buggy code, that’s my fault. All they guarantee is that the deployment mechanism will work as expected. Still, the service they provide is a huge time-saver for me.
Many larger companies have complex internal billing systems, where the APIs of services operated by platform teams “bill” the product teams based on their usage (sometimes in terms of dollars). While this may not be an option for smaller organizations, the main message is that platform teams provide a service to product teams (who are still responsible for operating the product they build). A platform team’s value can be measured by the value of the services they provide to product teams.
To solve the user settings service problem, Prezi could establish a platform team which provides a database as a service. They could provide some guarantees to their clients (for example, they will take care of obscure leap-second induced bugs). In exchange, when the product team’s feature turns out to be a hit with users, they may humbly say that they couldn’t have done it without the database team. And they would be right.
Third party platform teams
I love Prezi’s in-house deployment system, but if I were the CTO of a startup I wouldn’t think about building my own. I would go straight to Heroku.
Heroku is the perfect example of a platform team which is not part of the company. They do an excellent job of documenting their services and their product is easy to use. If I’m not satisfied with their offering, I could go to a dozen competitors who offer various solutions to the problem of deploying and running my code.
Even larger companies are seeing the benefit of “renting” some platform teams. Amazon Web Services’ Redshift database is a great example. Redshift is a database service which is operated entirely by AWS. Its use requires no traditional DBA skills. It’s not suitable for every kind of workload, but if it proves to be a good match for your use case, it may be much easier for product teams to use than a data warehouse of their own.
Using third party services makes it easier to align the incentives of the service provider and the consuming product team. The services offered are competing with other similar services on an open market, therefore it’s in their interest to keep their offering stable, up-to-date and easy to use. At the same time, services can choose their customers. They have the right not to address some customer needs. For example, AWS Redshift is designed to be used as a data warehouse, not as a low-latency application database. The AWS ELB load balancer does not support URL rewriting: it’s tailored towards relatively simple use cases and excels precisely because it is reliable and easily configured due to it’s simplicity. In theory, the market forces both providers and consumers to compete, which is difficult to achieve with in-house platform teams.
Coming from an infrastructure team dedicated to writing backend services - the traditional DevOps stronghold - I was surprised to learn how different life can be in a product team. The need for a host of skills outside of engineering (Product Management, UX, design) requires specialization within the team. Ironically, product teams also lack certain engineering skills necessary to get the product out the door and keep it running in operation.
Companies like Prezi.com are experimenting with a solution to these problems in the form of platform teams. Despite the seemingly similar mission to the more traditional operations team, platform teams do not take over responsibility from product teams. Instead, they offer services to make the product team’s work more effective. This setup remedies the ancient problem of misaligned incentives between teams which build software and teams responsible for system infrastructure. And if that’s not the spirit of DevOps, I don’t know what is!
About the Author
Peter Neumark is a devops guy at Prezi. He lives in Budapest, Hungary with his wife Anna and two small children. When not debugging python code or changing diapers, Peter likes to ride his bicycle.