DevOps @ Nokia Entertainment
This article is part of the “DevOps War Stories” series. Each month we hear what DevOps brings to a different organisation, we learn what worked and what didn’t, and chart the challenges faced during adoption.
I’d like to tell a story about DevOps. I’ll be drawing on some of the experiences and lessons learnt in a small corner of Nokia, the Entertainment Organisation based in Bristol, England. We are creators of fine music products and during the last couple of years our outfit has learnt to deliver server side software; fast. In order to learn we’ve had to make mistakes, I’ll aim to share our successes and failures, it’s not a recipe for DevOps, but you might find some useful ingredients.
There are numerous definitions of DevOps and nearly every variation I read holds something new and interesting. For the purposes of this article, I’d suggest DevOps is a way of working where:
“Developers and Operators work together to ensure products get built, systems are created, scale and stay stable. They understand each other’s responsibilities and routinely draw on each other’s expertise.”
This can be contrasted with the situation where the two teams are separated, organisationally, or culturally, and don’t work effectively together.
DevOps is both a desirable culture and a solution to a specific problem. It’s not needed everywhere. I strolled around the Future Of Web Apps conference, and talked to a few of the start-up people about DevOps, most of them didn’t see the need. When there are just a few people in a company responsibility is shared and there isn’t room for the kind of specialism and protectionism that can prove so divisive as organisations scale up.
This article also mentions Continuous Delivery - quite a lot. That’s because chasing the Continuous Delivery vision of faster, more frequent changes led us to recognise the need for DevOps.
2. Why things had to change…
A few years ago creating software releases really hurt. They hurt in the kind of creative ways that would a make a James Bond villain proud. Although teams were developing using agile methods, changes were queued up and applied to production systems in one drop, or rather carefully built into a fragile house of cards and carefully coaxed to production. This led to a number of common problems, with all too familiar symptoms.
Quality – The rush for release deadlines could easily compromise quality of both decisions and software. The knowledge that not including a feature in this release would mean waiting another couple of months made for some tough decisions, and late nights, to get the last pieces in place.
Risk – Releases weren’t routine, no two were the same, each was unique and involved complex dependencies. This meant they risked downtime, performance issues, finger trouble and similar problems. Some of this was mitigated with testing, rehearsals and rollback plans, but the sheer number of changes and moving parts meant there was always scope for problems.
Motivation – Despite generous amounts of late night pizza, there was a people side effect. Pride and satisfaction with the release contents were beaten down by bureaucracy and the difficulty of the release process. Pressure was high during releases. One mistake and the house cards comes crashing down, and with it the chances of going home in daylight. People committed to release activities often weren’t doing the kind of roles they went to college or studied for, an unwelcome distraction from the areas that really added value to the business.
Cost – There were both opportunity and financial costs. While effort is going into building a giant release shaped house of cards, it’s not invested in innovation or adding value to existing products. Infrequent releases also made for long lead times, and a feeling of unresponsiveness. In an environment where opportunities for change are scarce completion for precious release slots can easily lead to escalation and the high cost of senior management involvement.
Despite a seemingly bleak picture we were still delivering features, but knew we could do better. People were acutely aware of the symptoms, but perhaps not their origins. Attempts were made to improve matters, both in terms of refining the existing process and steps towards alternative approaches, including Continuous Delivery and DevOps. However, the real killer was introducing change, or even creating a plan, whilst meeting our other commitments.
3. How we got there…
Instead of diving into the next release activity we stopped. Even this was hard, and took discipline. The temptation is always to keep pressing forward, like the infamous man who won’t stop to sharpen his saw, expending energy feels good, even if it’s ultimately wasteful. We stopped just long enough to consolidate our vision and plan our next steps. After all the cajoling and evangelism, the business started to listen. Particularly notable were discussions at the leadership level between Operators, Engineering, Product and Architects. Out of this came agreement to try something new, both at an organisational and technical level.
The something new was Continuous Delivery and with it the promise of reduced lead times and costs, and an end to all that release hurt. Implementation of Continuous Delivery would require agile engineering teams to consider release into production environment as the point their work was done, not handover to Quality Assurance, Integrators or Ops. In addition, frequent deployments to live would be required, something that neither tooling nor people where set up for. It was clear that greater collaboration would be required to achieve this, and that there was a lot to learn from the growing DevOps movement. I think it’s fair to say that at time we didn’t appreciate all the many benefits that way of working brings.
A team was created to help build tools and encourage progress towards Continuous Delivery and a DevOps style. This built on the work of individuals and enthusiasts who had been scattered across teams, enabling them to concentrate their initiatives, both technical and otherwise. It is perhaps an obvious move, but not an easy one to sell, it requires recognition that teams need to improve capability and build product, and an acceptance that delivery may slow while new ways of working are established. In addition to the time investment a dedicated team, or project, sends a strong signal – this is the approach for the future, and we’re committed to it. Both help people get behind the ideas, and can save a lot of wasteful should we/shouldn’t we discussions.
Inspect, Adapt, Learn
Being an agile shop the natural place to start was retrospection. This is the process of looking at what is working, and what isn’t, and generating ideas to enhance the positive factors and suppress the negative. In engineering teams retrospectives were routine, but Continuous Delivery required input and perspectives from many other areas including architects, product, quality assurance, and testers. Cross team retrospectives generated some of the most useful insights; they encouraged conversation, a sense of shared purpose, and community. Reflecting, I think these should have been run more regularly. When radical change is taking place the pace of change can be startling, but it’s not always positive change, and therefore not always welcomed, especially if the long term benefit is masked by short term inconvenience.
Focus on Value
Another useful concept was thinking, and occasionally obsessing, about the value processes, roles and tools bring to the release process. This led us to challenge the interfaces between dev and ops, and also the way we assessed and managed risk. It’s easy for habits and practices to form, and over time the bad and indifferent may become accepted and unchallenged. One straightforward technique is simply challenging process steps and asking “What benefit does that actually add to us?”. We also employed a more rigorous technique named Value Stream Mapping. This is a way to visualise and understand the entire product development process. Given the opportunity to try this again, rather than spending time and discussion trying get a single ‘true’ mapping I’d encourage separate groups to produce their own value steam mappings, and then compare the differences. This should surface differences in interpretation, and quickly reveal those ‘Oh, we only did that because we thought your team needed that’ issues.
4. What we did – what worked and what didn’t
Just enough tools
One early initiative aimed to improve the release process by reducing errors and speeding up handling of configuration information. Handover was needed because different environments on the path to production where owned by different teams. Tools quickly became complex, trying to serve both providers and consumers of data. A kind of arms race developed, with requirements being added to combat specific failures, as they grew unwieldy, some of the tools felt like more of a hindrance than a help. These tools tended to manage the interface between teams, but this just reinforced differences and, even worse, discouraged conversation and collaboration.
To assist with releases new tools, process and automation were introduced. We wanted to put responsibility for application deployment, previously the preserve of operators, into the hands of anyone competent, especially engineers. Incorporating the requirements of ops and devs, we developed a deployment tool which enforced the release pipeline, managed configuration, audited and orchestrated deployments to different environments. It did just enough, and we relied on healthy collaboration and diligence to do the rest.
The approach to dependency management was a good example. The service oriented architecture contained complex dependencies, but software seemed an expensive way to handle the problem, we saved effort with a few simple rules. As usual engineers took responsibility for automated testing and keeping services compatible with their consumers. Then we added the notion of a ‘deployment transaction’, an exclusive lock across the integration and live environments. This meant only one service could be tested and deployed at a time, reducing the risk of deployment order problems. A useful side effect was that the approach encouraged conversation between teams queuing for the transaction, promoting an understanding of the different moving parts of the system
Organisation and leadership
In the classic set up Engineering and Operations where separate organisations, there was an impedance mismatch between the two organisations – their own teams, ways of working, leadership and styles. Crucially incentives were different, to summarise gratuitously, ops were incentivised for stability and devs for delivering change. It was evident that we needed to create a structure which encouraged more collaboration and promoted shared goals.
Although we’d love to think that most problems can be solved at a technical level, in a large organisation the rate of change is often accelerated by evangelists, strong leadership and a critical mass of people setting a good example. A forward thinking manager stepped up to lead both Operations and Development teams, straddling the divide. This helped to change behaviours, and crucially enabled him to see and understand both perspectives, a priceless source of feedback. An org chart is just a diagram though, and although it can signal a clear intention, it doesn’t necessarily mean anything will change, or stay changed, and that’s why culture is so important.
The culture in Engineering was based on Agile, using a mix of Scrum, Kanban and Lean concepts. This meant we were already some way towards the DevOps culture we wanted to create. There was already some collaboration, and overlapping knowledge, but there were specific areas to enhance, and be wary of. Relaxing some of the rules and letting people work out of their normal areas of expertise could lead to failures, and tempt a demoralising and energy sapping blame culture. Key cultural principles included trust, acceptance that things would break occasionally, shared learning, and responsibility in the right place.
Adoption of the deployment tool was one of the first things to test our culture. Hither to ops made all the production changes, they had intimate knowledge of the systems, why on earth should they trust engineers to make unsupervised deployments? To learn more we used a safe fail experiment, a ‘Deployment workstation’ was set up in the heart of the ops space. It was the only place deployments could be made from, and deployments could only be made with both an operator and a developer present. Over time, experience was gained, and bugs fixed, the tool refined, and trust was gained, the tool became ubiquitous.
During the times when developers and ops people weren’t talking frequently, there was a kind of ‘reactive collaboration’, a tendency to talk only when things went wrong. This meant poor forward planning, and encouraged a defensive stance from both teams. We aimed to encourage early conversations, as collaboration is always easier if you share the same vocabulary, so we ran cross team training courses and workshops.
It is interesting to note that the next hire is a vital opportunity, not just to augment the skills of the team, but to steer culture in a desired direction. So, we considered what kind of skills and principles would further a DevOps culture. In an interview it’s informative to ask an ops type how he feels about software developers working directly on production systems, or an engineer how he would feel if his code went live 30 minutes after commit? Blood draining from the face and gripping the arms of the chair are not promising signs.
5. Where we are now
The DevOps mentality was one of the foundations for improving our release capability, but the benefits of the approach are often hidden under the Continuous Delivery banner. Let’s walk through the areas that were so problematic when releases were infrequent:
Quality - In general the quality of services has increased. The role of DevOps in this is most visible when there are live incidents such as bugs, outages or increasing response times. This shows in both how production issues are noticed, and what happens when they are. Production system monitors could be roughly categorised into two areas: infrastructure and application. Infrastructure monitors (Nagios, Keynote) are created with the needs of ops in mind, application monitors (Graphite) are more built for devs’ requirements. The overlap, or redundancy, between monitors is desirable, two perspectives on the same systems act like a parity check. When an incident does occur, collaboration is via Campfire, directly between the people that are needed and can add value. If you’ve ever been the person sweating over the console on a production server, with a project manager over one shoulder, and your boss on the other, you’ll understand how refreshing this is.
Risk - The risk, that is the likelihood of problems arising as result of release activities, has decreased. Smaller incremental changes reduce the risk per change, and crucially make it easier to assess risk, in a healthy DevOps relationship people either have enough knowledge to make an assessment themselves, or know when to draw in expertise from another team.
Motivation - I believe people’s motivation has improved, there are still plenty of frustrations, but knowing that deployment times are short, and that ops and engineers will support each other makes a big difference. With responsibility in the right place, people are doing the roles they signed up for and more time is spent on creating, rather than delivering, product. For ops this means less time fighting fires, and more time investing in our platform, particularly the automation side.
Cost - The time cost of delivering a change is where DevOps ways of working have really had a huge impact. Shared deployment tooling eliminates handover time. Collaboration, planning ahead and early testing all mean that applications are almost certain to work on production infrastructure, rather than hold nasty surprises. Costly escalations and investigations by management are reduced as a result of greater understanding and trust between the two areas.
Who gets the pager?
In case you’re wondering about the classic ‘who gets the pager?’ – its ops. They have the skills and experience to deal with live issues, not to mention ways of working geared towards call out and fast reactions. There is something different though – involvement with R&D engineers is much quicker, and an accepted way of working.
6. The Challenges we face now…
We now face two classic challenges that follow change – keeping going, and getting better. Keeping going, and sustaining a fledgling DevOps culture should be realistic, engineers and ops are getting things done by working and learning together, the business sees clear advantages and is supportive. Considering Mitchell Hashimoto’s Range of DevOps we’re veering somewhat to the right, so while developers are enthusiastic to take on new challenges, we should take care that the operator’s perspective is represented and acted upon.
John Wills makes a sound point that DevOps is about Culture, Automation, Measurement and Sharing. Automation and Measurement could be our next area for improvement, both areas needing to keep pace with change in our technology and be developed alongside features, rather than afterwards. The technical side of DevOps will never be done. We frequently experiment with new languages and need new approaches to provisioning, like Clojure in conjunction with Pallet.
There is also the challenge of taking what we’ve learnt while bringing two quite disparate teams together. Huge benefits have been gained by focusing on collaboration and understanding others responsibilities, perspectives, priorities. The behavioural patterns that lead to our interest in DevOps lurk elsewhere in the org and similar benefits could arise by focusing the spotlight on other teams.
7. In conclusion…
It’s pretty hard to summarise three years of graft, learning and change, especially when so many brilliant and diverse people were involved. To assess the impact of DevOps, Continuous Delivery, Agile or anything else we did, and draw out useful insights, is equally tricky.
There were some key building blocks. Living agile principles enabled us to inspect, adapt and learn. It feels like we’ve made progress towards becoming an organisation which learns (Jez Humble has a post on why this matters more than almost else ) These habits and ways of working (our culture you might say) were crucial, many ideas stemmed from them, as did the impetus to keep improving.
Following the thoughts and principles of people like Etsy, Flickr, Thoughtworks and Jez Humble’s awesome book, enabled us to learn quickly. The emergence of the DevOps community, its friendliness and willingness to share (typified by DevOpsDays) helped us realise the significance and benefits of the practice.
A key first step was to acknowledge failures, stop to plan our next move, and to start to change. Things could have moved quicker, but we promoted Agile and DevOps concepts and persevered, at some point momentum gathered and we began to evolve rapidly. At the same time our products improved and enthusiasm increased. Getting time and commitment were some of the greatest challenges. Recognising that some things can’t change, yet, and waiting patiently for the right opportunity, was just as important.
In summary, I think these experiences have shown that DevOps behaviours can be introduced, and sustained, in a large organisation, but it needs the five P’s: promotion, planning, perseverance, patience, and of course, pizza.
About the Author
John Clapham is a Software Development Manager in Nokia's Entertainment division based in Bristol. Previously, as Product Owner for the Continuous Delivery Team, he helped transform the Entertainment platform release process from an expensive, once every three months exercise to a once every 30mins routine. John is passionate about agile, coaching, coffee and finding new ways to build great products. John can be found on Twitter as @johnC_bristol, and on LinkedIn.
Dmytro Svarytsevych Oct 30, 2014