InfoQ Homepage Articles DevOps at Seamless: The Why, How, and What

DevOps at Seamless: The Why, How, and What

Nov 29, 2015 24 min read

InfoQ Article Contest

Share your knowledge Win a ticket to a QCon event
or an InfoQ Dev SummitFind out more

The key thing about DevOps is understanding under which circumstances it should be introduced to your organization.

Starting with “why” is crucial as there is probably no greater (and more expensive) failure than choosing the wrong tool for a problem on an organizational level. Nevertheless, let us assume that you know the “why”. The next question to ask is how to address the challenge. Let us assume that DevOps may be the answer.

What remains is determining what to do to get there. Microservices architecture, continuous integration, continuous deployment, test automation, monitoring automation, infrastructure automation etc. are frequently associated with DevOps but to consider DevOps as only the tools risks having those practices withdrawn, replaced, or diminished whenever your company faces a crisis. To increase that chance that the change is permanent, DevOps needs to become a part of company culture of your organization and everyone needs to understand why you got there.

This article focuses on why DevOps is needed, what concepts and values should support it, and how we implemented it at Seamless - what results we obtained and the challenges we faced. The ideas in the ‘“Why?” and “How?” sections are fairly universal and can be copied. Implementations presented in “What?” are highly contextual for Seamless and should be treated as examples or inspiration.

Why?

Risk and instability are natural features of the business environment. New companies can pop up with innovations that disrupt the market and players in the market consequently can quickly earn or lose market share. Analysis confirms that there is a higher degree of uncertainty nowadays (^1,2): leading in market share is even less correlated with leading in profitability (7% of cases in 2007 versus 34% in 1950), the volatility of operating margins has doubled since 1950, and the probability that a market leader will fall out of the top three in its industry has increased from 2% to 14% since 1960. Sustaining a competitive advantage is harder than ever.

Take, for example, the mobile-payments market in which Seamless operates with SEQR. At the moment, this industry has no worldwide leader; no company has attained that regardless how much cash it has. Players in the market vary from hundreds of FinTech startups to big fishes like Apple, Google, banks, and telecoms.

The same degree of fragmentation applies to how these companies conceptualize what mobile payments really are. Should a smartphone emulate a debit or credit card? Should a mobile payment system completely bypass card rails by directly integrating with consumer’s bank account? Should it, on the other hand, be prepaid with no cards or bank account involved? These questions remain unanswered. Moreover, mobile payments are not explicitly regulated in the EU or the US — the unknown legislative future is a significant threat for any company offering mobile payments.

On top of all of this, consumers are not necessarily convinced that using a mobile phone to pay is more convenient or secure than a card. Nevertheless, any company entering this market is attracted by a promise that the industry will grow 30% annually through 2019 ³.

The global mobile-payments market is a system with no defined best or even good practices. The variety of approaches to mobile payments executed by multiple companies means that you are able to understand why things happen only in retrospect. The cause and effect are frequently only obvious in hindsight.

How?

The mobile-payments market is best described by complex context in the Cynefin model ⁴. This context is characterized by a fact that there is little or no causality and agents with few constraints can modify the system, making it unpredictable and in permanent flux. Unlike simple or complicated systems with best or good practices emerging from sense/categorize/respond or sense/analyze/respond actions, respectively, the decision-making process in a complex domain should be based on a probe/sense/respond approach. The patterns of action in such a context emerge from experiments performed in a safe-to-fail environment. The process imposes special responsibility on leaders, who should turn on an experimental mode of management and accept the frequent failures from which patterns of actions emerge. Those patterns are interpreted to decide if a given experiment should be amplified or withdrawn.

A company’s main strategy should be to become as adaptive as possible — in other words, to keep the Cynefin model’s probe/sense/respond cycle as short as possible. Classical approaches to strategy definition, like the Porter five-forces analysis, no longer fit the new reality. As mentioned, in this reality, a direct relation between cause and effect may be impossible to predict and even measuring the intensity of certain factors can be extremely hard to do. John P. Kotter observed that “The hierarchical structures and organizational processes we have used for decades to run and improve our enterprises are no longer up to the task of winning in this faster-moving world.” ⁵

This relates to product development, software development, and operations. Low performance in software delivery makes a company incapable of short probe/sense/respond cycles. It takes too much time for the company to sense what has changed in the market and to react accordingly, which may lead consumers to reject its product and to bankruptcy.

This is why concepts like Lean Startup became popular in product development. It established a process for developing a product that help it to fail cheaply and quickly. The lessons of such failures help to improve existing products or to build a better, new product (pivot). The key thing, however, is to minimize the time the build/measure/learn loop takes. The lean-startup cycle relates to the probe/sense/respond cycle of the Cynefin model.

At Seamless, we try to use Lean Startup whenever we undertake to deliver new value to our consumers. What we have learned is that we should assume by default that our hypotheses about our customers (what they need, what problem they have which is still not solved, etc.) are wrong and that we need to prove that they are right. Being able to learn this as fast as possible is crucial to increase the business accuracy of our work and to keep us adaptable to the ever-changing business environment. We also work in Scrum. If every sprint ends with a shippable increment, it means that every sprint, ideally, is a occasion to perform a build/measure/learn loop (or a probe/sense/respond cycle). If teams run two-week-long sprints, there are approximately 25 occasions per team per year to make a pivot with a product and learn something new about customers.

However, there is a catch. Just because an organization adopted Scrum and Lean Startup does not mean that the frequency of pivots will automatically increase. Scrum will not solve problems of low performance in an organization. It will make them visible by forcing a high cadence of sprints that should end up with potentially shippable increments. It is the responsibility of all people involved to find a way to achieve this. If you look at the software-delivery lifecycle as a digital supply chain ⁶, it consists of the following steps: plan, code, integrate, test, release, deploy, and operate. A high-performing organization maximizes the code step, as it is the moment when actual business value is created. All the following steps should take as little time as possible. In other words, if a team spends six days of a two-week sprint on developing new features and spends four remaining days on integrating, testing, releasing, and deployment, then productivity could ideally increase by automating all the steps apart from coding. On the other hand, a team that has already achieved high performance could start working in one-week sprints to shorten the build/measure/learn loop and collect feedback from the market faster.

One way to become such a high-performing organization is to adopt DevOps practices. Conclusions from the 2015 State of DevOps Report ⁷ are clear:

High-performing IT organizations have a strong and positive impact on the overall performance of the organizations they serve.
High-performing IT organizations deploy 30 times more frequently with 200 times shorter lead times with 60 times fewer failures, and recover from failure 168 times faster. This pace is sustainable.
High performance is achievable regardless whether applications are greenfield, brownfield, or legacy.

IT plays a profound role in reducing time from concept to cash, as Mary and Tom Poppendieck would call it ⁸.

What?

DevOps needs to stem from organizational culture to increase chances of surviving the crises that every company inevitably undergoes. At Seamless, we are trying to create an adaptive strategy to help us deal with an unpredictable business environment, which the Cynefin model defines as complex. To get there, we use Scrum and Lean Startup. However, these concepts do not relate how to obtain the high performance required for short build/measure/learn cycles. Chances are that DevOps is the answer.

There are several problems we encountered in building a high-performance organization or that I witnessed as a coach. I agree with W. Edwards Deming (creator of Total Quality Management) that 95% of the performance and quality problems in a system is caused by the system itself. The remaining 5% should be attributed to the people working in that system. Hence, even though the problems I will describe relate mostly to people, solutions should be applied to the system.

Problem #1: Us versus them

“It is always someone else who does something wrong, not me.” This blame mindset is absent in high-performing organizations. Its existence is easy to determine: observe if any blame games take place in an organization. For example, a bug pops up on production environment after a rollout. The development team accuses the operations team of incorrectly rolling out the product while the operations team accuses the development team of poor or incorrect documentation.

Problem #2: Contrasting responsibilities

Source

Frequently in organizations, development teams are responsible for throughput, producing as many new business features as possible, and so are the ones responsible for introducing change. The operations team, on the other hand, maintains stability of the production environment, which means that its role is to protect against change. These opposing responsibilities often mean that these two teams have different bosses, account for different things, and have misaligned goals.

Problem #3: Handovers

A handover happens when one person or team passes work to another. During a handover, there is a risk that information and context will be partially lost.

A handover takes place, for example, when a development team gives an application to the QA team for testing. The QA team completes its tests and in turn passes the application to the configuration-management team, which takes care of deployment to production. Eventually, a maintenance team takes over the application and fixes any bugs. A number of problems may arise from this typical scenario, among them: the QA team may not have been able to identify and verify all test cases; the configuration-management team could make a configuration error when moving the application from the staging environment to a production environment; and the maintenance team may introduce a fix that solves a bug but is not consistent with the software architecture, which in the long run increases maintenance costs.

Problem #4: Little or no feedback loop

If a system is constructed as described in Problem #3, with separate development, QA, and operations teams, people often have no occasion to learn from their mistakes. For example, how is a developer supposed to learn about an error she/he made in the code if another team discovered the bug in production and fixed it? Such a system does not enhance any improvement of people or teams as little knowledge and experience is exchanged among them.

The aforementioned problems are visible among people but the root cause is the system in which they work. The software they create is often not easily testable and deployable, which increases the concept-to-cash time. Consequently, the build/measure/learn (or probe/sense/respond) cycle becomes longer, which results in an organization being less adaptive.

At Seamless, we reached several solutions for counteracting these problems through changes to the system.

Solution #1: Think about Conway’s law

In 1967, software engineer Melvin Conway observed that “organizations which design systems… are constrained to produce designs which are copies of the communication structures of these organizations.”

Hence, it should not be surprising that an organization that decided to create competence silos based on goals (development, testing, deployment, maintenance) created a system in which flow of value resembles organization structure.

At Seamless we tried another concept. We started to create teams consisting of developers, testers, and system administrators. Every team had one boss and one goal: to deliver business value to our consumers as fast as possible with no compromise on quality. The result was immediately visible. Teams started to not only think about how to implement a given business goal but also to think in terms of how easy it will be to test it and deploy it on production.

In our case, a team usually consists of three developers, one QA engineer, and one system administrator. This setup proved to be successful for us on one condition: everyone understands that the whole team is responsible for testing and deployment and must fight as one against bottlenecks. The QA engineer and system administrator are responsible for teaching to the developers the necessary knowledge about testing, deployment, and infrastructure so that any member of the team can take care of everyday operational tasks. By doing so, the QA engineer and system administrator gain time to take care of strategic activities, like taking the first steps towards automation of testing, infrastructure, and deployment, which the software engineers later can develop further.

As a side effect, team members started to develop skills beyond their area of specialty. Developers improved their QA and sysadmin skills, QA engineers improved development and sysadmin skills, etc. This does not make people cross-functional — they are still experts in their main domains but they are no longer lost in the others. People are becoming and, hopefully, considering themselves product developers rather than only developers, QA engineers, or admins.

Software engineers, QA engineers and system administrators become product developers by being in one cross-functional team

Solution #2: Delegate for distributed control of a complex system

Product development takes place in a complex system, centralized control of which may be impossible to achieve or prone to erratic decisions because of the inability of a single person or even a group to gather all information and context. Jurgen Appelo ⁹ suggested that control over complex system should be distributed via delegation — in our case, to teams. Delegation means autonomy to make decisions but also responsibility for its consequences. In practice, this means it’s the teams that deal with the system on daily basis and understand it much better than any manager or architect, so it is the teams that make decisions regarding the architecture of the system and track down dependencies in both directions on other parts of the system (other teams). An architect is present in this process, but as an advisor, not a decision maker.

Teams that are responsible for delivering business value end to end (from concept to cash), and feel that they have autonomy and responsibility for the application, create applications that more accurately track business goals and have shorter lead times (testability and deployability). Furthermore, there is a higher degree of ownership of an application as no one can be blamed for the status quo apart from the teams themselves. This, in turn, results in a team’s willingness to improve as any bad decisions made by the team hit it back.

On the other hand, such a system is a challenge for a manager. First of all, a manager needs to accept that teams will make mistakes and suboptimal decisions. What a manager should focus on at the beginning is patience, empathy, constant monitoring of decisions for accuracy, and creating a supportive environment for the team. Secondly, a manager should advise the team on the consequences of their potential decisions by connecting any dots that may be invisible. For example, a decision to implement a new application in an exotic technological stack can influence the team itself as the applications under its care become divergent (it may decrease velocity of a team) and HR, which will have to include this exotic requirement in the recruitment process (and which, perhaps, will be unable to find a specialist in the given area).

We realized at Seamless that alignment of teams is necessary for their autonomy. It means that teams should understand what the common goal is but have the freedom to decide how to reach it. We recognized two levels of alignment: first, team members understanding and supporting a team’s vision and values and, second, operational alignment, which is consistency in areas like the technology stack used, testing, and deployment strategy.

For vision and values alignment, we decided to use V2MOM ¹⁰ - a concept developed by Salesforce.com. A manager of several teams defines for them a V2MOM: a document consisting of a vision (V), the values supporting it (V), methods of reaching it (M), obstacles that may prevent teams from fulfilling the vision (O), and measures of progress towards the vision (M). A V2MOM on the team level is a manager’s belief of the best way to achieve a given business goal; it is a manifest of the manager’s management routine. The manager also asks all team members to create their own personal V2MOM in which they describe their own visions, values, methods, obstacles, and measures.

A team member is not required to agree with the team’s vision and values but is required to understand it. Each team member and the manager compare the personal V2MOM and team V2MOM documents in quarterly discussions, leading to a better understanding by both parties of how a member’s personal vision and values are aligned, misaligned, or neutral with respect to the team’s. Moreover, any conclusions growing out of these discussions have no effect on salary whatsoever. A manager starts to better understand team members and can search for patterns that indicate why people like or dislike certain aspects of their work by observing how a member agrees or disagrees with the team’s vision and values. Eventually, a manager learns what sort of workplace to create to amplify actions that help to achieve business goals. A real-life example of a team V2MOM is presented below. An equally important aspect of V2MOM discussions (apart from alignment) is that they encourage employees to communicate regularly, and to reach at least their personal vision and values, which they accept.

Team V2MOM

Solution #3: Make it safe to fail

Surviving in an unpredictable market requires experimenting to find actions that should be amplified. It does not only mean experiments with the business model on an organizational level. Experiments need to cascade down the organization, which results in DevOps teams experimenting with technologies and methodologies to react to changing requirements.

At Seamless, as far as technological experiments are concerned, teams are entitled to ask a Product Owner for a spike beyond the ordinary cadence of sprints. A spike is a special type of story whose main goal is to reduce risk. Risk may be connected with the verification of a certain technology, proof of concept for a design, or preparing a draft of a deployment pipeline. With a spike, a team is able to more accurately commit to upcoming sprints. The key thing about a spike in our case is that it is timeboxed (it lasts a couple of days) and the expected outcome is strictly defined (eg. the technological stack for solving a certain business goal must be defined). Both parties must trust each other for a Product Owner to accept a spike request from a team. To achieve this, a team must be transparent with the Product Owner on daily basis regardless of a sprint ending with success or failure.

A Product Owner may express a more advanced level of trust by providing slack time for a team. It means that a team, knowing its velocity, does not commit 100% of its capacity — and the Product Owner does not question this. The team plans to devote slack time to improvements or experiments of all kinds. We have observed several times at Seamless that people with spare time voluntarily spend it on fixing things that cause the most pain. What needs to be guaranteed, however, is that all the problems an organization has are visible, named, prioritized, and understood by everyone. Sadly, for us, having a sustainable amount of slack time every sprint is hard to achieve because short-term business goals often take priority over long-term goals. Perhaps the ability to provide slack time is a measure of agility and maturity of an organization.

People will not be eager to experiment if they are punished for failure. So instead of searching for a guilty person who has made a bad decision, call for a postmortem to analyze why it happened. A pre-condition for a safe-to-fail environment is having a manager who believes that people are willing to do the best job they can, that they can be proud of it, and that they make the best decisions they can based on the information they have. It is the responsibility of a manager to set up an organization in which all the information needed to make such decisions is accessible.

A reference point for valuable postmortems can be found at Etsy. ¹¹

Solution #4: If it hurts, do it more often

“If it hurts, do it more often, and bring the pain forward” ¹² means to release as frequently as possible however counterintuitive it may seem. With frequent releases, the change introduced by any single release to the production environment is kept small, which makes it manageable by a single person. In other words, by simplifying releases, you move from the Cynefin model’s complex domain towards its complicated domain, in which good practices apply and there is no need for experiments. The tool to make it happen is acceptance criteria set up by the Product Owner in Scrum. If the Product Owner announces that each sprint ends with shippable increment, the team, having autonomy and responsibility, is forced to commit to a feature which they will be able to deliver on production by the end of the sprint. Initially, the team may appear to have low velocity — but the velocity was already like that and Scrum is only making it visible.

Let us focus on the other part of the quote, the “and bring the pain forward”. The responsibility of a manager is to make all problems in a delivery lifecycle transparent and to make sure that teams are aware of them and accept them. If teams start to perceive those problems as their problems, they will use their autonomy and responsibility to eventually find a solution. At one point at Seamless, a manager and a ScrumMaster noticed that the production of more and more new applications had started to increase the lead time (time elapsed from concept definition to production rollout). The first action was to help teams realize that the problem exists. Subsequently, the teams were given a goal of decreasing the lead time. Neither a manager nor an architect were able to solve the problem purely by themselves but the teams feeling ownership of their applications managed to brainstorm a solution and convinced the Product Owner of the necessity of investing in a deployment pipeline.

Solution #5: Influence generative culture

"Generative culture (defined by Ron Westrum)¹⁴ and repeated in the 2015 State of DevOps Report ⁷) is typical for performance-oriented organizations. The presence of this culture can be confirmed by observing the following indicators: high levels of cooperation, shared responsibility, broken silos, time for experimentation, and blameless postmortems. Usually, introducing DevOps in an organization means a culture change. I agree with Pawel Brodzinski¹⁴ that the only thing a manager can do with a culture is to influence it. A manager cannot force it upon a group as it is a sum of behavior of all people in the organization."

However, if culture is behavior then behavior may be influenced by changing the constraints that an organization imposes on its people. This is how the DevOps transformation took root at Seamless. Initially, one new team was formed and the members told that they would need to do everything (development, test, deployment, maintenance) on their own with no support from other teams due to a lack of people. This change in constraints noticeably forced the new team into cooperation and shared responsibility. Otherwise, the team members would fail to deliver the business goals expected of them.

Consequently, a culture pocket (also known as culture bubble - a group of people with distinctive culture) emerged in the company. The thing about culture bubbles is that they are fragile and can be easily and unintentionally destroyed in various ways. For example, when a different part of the system required more work, heavily loaded teams had a natural tendency to question the existence of other teams and their goals. If the members of the new team were made responsible for the original areas, they would get support from the operations team and their requirement to take care of anything beyond development and tests would vanish. The teams constraints would change again, thus so would its culture. In my experience, the early phase of culture change is a delicate process and a manager should take extra care to defend that part of an organization undergoing that change. Otherwise, the experiment may fail, which may lead to wrong conclusions.

Once generative culture is established in an organization, the DevOps transformation may be considered a success. A reliable test of the permanence of a change is to wait for the next considerable crisis in a company, which filters out temporary changes from solid ones. The latter will survive. Paying attention to indicators of the generative culture is crucial in understanding the progress of the transformation.

An example of crisis we experienced was moving our software engineering department from Sweden to Poland. Processes that appeared to be solving synchronization and communication issues between the two offices ceased to exist after the change. For instance, before the move, teams had appointed team leads whose responsibility was to know what was happening in other teams (in Sweden and in Poland) and what technological decisions were being made. With all teams located in one office, team leads were no longer needed and were replaced by weekly meetings at which the whole department discusses operational and strategic issues.

Summary

Organizations that adopt DevOps go through a change that affects both processes and culture. Successful culture change increases the chance that a DevOps transformation will be permanent. One of the biggest challenges, however, is the readiness of an organization to empower teams by delegation of control. It may cause resistance among managers who need to switch from command-and-control mode to focusing on the creation of a supportive environment and influencing a generative (performance-oriented) culture in which it is safe for people to fail. It is a necessary condition for experiments to flourish. Those experiments are a foundation for operating in complex domain (as defined in the Cynefin model) in probe/sense/respond fashion. It is worth emphasizing that overwhelming majority of effort in DevOps transformations is put on teams who need to start to deal with more problems and take responsibility for actions taken.

The DevOps transformation is worth the risk of temporary instability in a company connected with the change. The potential reward is becoming a high-performance organization in an unpredictable market, which may result in becoming a leader or even mere survival, depending on the competition.

About the Author

Tomek Pająk (Lodz, Poland) has held several roles in IT: Software Engineer, IT Architect, and Engineering Team Lead. At the moment, he is software engineering manager at Seamless Payments and is responsible for services built on top of the core product SEQR, a mobile-payments solution available in 12 countries (including the US, UK, Sweden, and most of the Eurozone). To obtain competitive advantage, he uses Lean Startup and Scrum (for product development) and DevOps (for strong IT performance). He is also a coach at Sages, helping companies to improve their businesses through the adoption of agile/lean concepts and certain technologies. Tomek received his MBA from Akademia Leona Kozminskiego and a MSc in Telecommunications and Computer Science from Technical University of Lodz. He speaks at international conferences such as Agile Lean Europe, Agile Eastern Europe, Atmosphere, and AgileByExample. You can reach him at LinkedIn and twitter (@tomekatwork).

Bibliography

¹ M. Reeves, M. Deimler, “Adaptability: The New Competitive Advantage”, Harvard Business Review, 7-8.2011.

² M. Reeves, C. Love, P. Tillmanns, “Your Strategy Needs a Strategy”, Harvard Business Review, 9.2012.

³ TechNavio, Mobile Wallet Market in Europe 2015-2019

⁴ D. Snowden, M. Boone, “A Leader’s Framework for Decision Making”, Harvard Business Review, 11.2007.

⁵ J. Kotter, “Accelerate!”, Harvard Business Review, 11.2012.

⁶S. Thair, “DevOps and the Digital Supply Chain”, DevOpsGuys blog.

⁷Puppet Labs, 2015 State of DevOps Report.

⁸ M. Poppendieck, T. Poppendieck, Implementing Lean Software Development: From Concept to Cash, Addison-Wesley Professional, 2006

⁹J. Appelo, “Delegation and distributed control”, Management Issues, 24.06.2015.

¹⁰ M. Benioff, “How to Create Alignment Within Your Company in Order to Succeed”, Salesforce blog, 9.04.2013.

¹¹ D. Schauenberg, “Practical Postmortems at Etsy”, InfoQ, 22.08.2015.

¹²J. Humble, D. Farley, Continuous Delivery: Reliable Software Releases Through Build, Test, and Deployment Automation, Addison-Wesley Professional, 8.2010

¹³R. Westrum, “A typology of organizational cultures”, Qual Saf Health Care, 2004

¹⁴P. Brodzinski, “Culture Pockets”, Pawel Brodzinski on Software Project Management blog, 30.04.2015.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?