Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Articles Why DevOps Governance is Crucial to Enable Developer Velocity

Why DevOps Governance is Crucial to Enable Developer Velocity

This item in japanese

Lire ce contenu en français

Key Takeaways

  • The complexity of application environments and the need to maintain cost controls, planning and compliance can be a significant obstacle for developers to innovate with velocity—if mismanaged.
  • DevOps teams are especially burdened by change management in large organizations due to the quantity of environments and rapidly evolving application architecture, creating friction between DevOps and developers.
  • A proper tool that considers a realistic scenario where an application may be formed from heterogeneous infrastructure should provide elements of cost analysis.
  • The application environment—a superset of IaC such as Terraform and Helm, configurations scripts, parameters, keys, sequencing and operations—should be managed centrally by the DevOps team. This allows them to better track modifications and changes which would then be swift and transparent to developer teams.
  • Change management, cost control, and compliance highlight the fact that application environments are extraordinarily complex though they are critical to fuel innovation and velocity in any product organization.
  • It’s becoming clear that gaining access to infrastructure and leveraging IaC platforms (Helm, Terraform, etc.) is insufficient to offer the necessary collaboration, governance, and cost controls to run a modern software organization. An infrastructure control plane that supports these needs throughout the lifecycle of the application environment, from dev to production, is crucial.

It’s well established by now that product innovation is a key enabler of business success in any organization’s market vertical. The user expectation for a highly differentiated, functional, high-performance, and delightful experience is greater now than ever. COVID has only exacerbated the need for velocity and the number of businesses migrating the lion’s share of their focus to deliver services over the internet through multiple channels.

In some organizations, however, velocity and innovation are diametrically opposed to cost controls, planning, and compliance. Today, every potential obstacle to operating with velocity is a candidate for standardization and automation. One such example is the transformation of how testing is done and who does it. There is a massive shift from a centralized team that is responsible for the testing of all applications to one that requires a Software Development/Design Engineer in Test (SDET) on the development team. This shift comes with updates in tooling and process. Similarly, the topic of application infrastructure (as it pertains to the availability of environments in the dev/test stages) is a perfect candidate for an overhaul that is long overdue. 

The central DevOps team typically defines application environments as developers rarely consider environment definitions as part of their responsibilities, nor do they have the skills or the inclination. Those responsibilities would consist of infrastructure allocation, orchestration, configuration, and management. Common technologies used in this context include Terraform, Helm Charts, and Ansible scripts. As an example scenario, an application environment may be used when a developer commits code to a feature branch, and they want to run a quick smoke test to make sure they didn’t introduce a defect. The build server would then build the code and deploy the application into a newly created environment. That environment definition is created through these assets by the DevOps team and configured into the build server by either the DevOps or development teams. 

Where the wheels go off the rails 

Like many other cases, the collaboration between multiple development teams and the centralized DevOps team can introduce friction. For example, environments can change periodically due to the changing requirements of the application or developer needs. If an organization has multiple applications and many teams, this could burden the DevOps team already struggling to efficiently unblock the developers. Developers are pragmatic; they will roll up their sleeves, and you end up with a sea of configuration files, tweaked and reconfigured, sometimes using different configurations and components. This leads to challenges in multiple domains including cultural, financial, and compliance. 

Maintaining the chaos: Small (environment) changes can result in big delays 

In many organizations, the application environments used in CI have been designed by a lead architect or DevOps engineer early on. With the scale of products and teams, many variants of those assets have been created locally within the team. This approach can work well when no changes are necessary. However, the day a change is needed, especially if it impacts several teams and infrastructure configuration assets, this could become a blocker for developers.

For example, take this simple Terraform definition:

provider "aws" {
 profile = "default" 
 region = "us-west-2"

resource "aws_instance" "app_server" {
 ami = "ami-830c94e3" 
 instance_type = "t2.micro" 
 tags = {
 Name = "ExampleAppServerInstance"

Let’s say that the DevOps team is being asked to change the machine to a stronger one, or they need a change in the tagging. In the sea of Terraform files each team is using, researching who knows the repo and how they were written and tweaked is extremely burdensome. Doing this for ten teams can be a monumental undertaking and a process that could easily take two weeks or more. 

In the meantime? Developers are hacking the Terraform to enable them to continue coding, even though they know that this can be problematic down the road.

One key takeaway from all this: consolidation of application descriptors enables efficiencies via modularization and reuse of tested and proven elements. This way the DevOps team can respond quickly to the dev team needs in a way that is scalable and repeatable.

Some potential anti-patterns include:

Developers are throwing their application environment change needs over the fence via the ticketing system to the DevOps team causing the relationship to worsen. Leaders should implement safeguards to detect this scenario in advance and then consider the appropriate response. An infrastructure control plane, in many cases, can provide the capabilities to discover and subsume the underlying IaC files and detect any code drift between the environments. Automating this process can alleviate much of the friction between developers and DevOps teams.

Developers are taking things into their own hands resulting in an increased number of changes in local IaC files and an associated loss of control. Mistakes happen, things stop working, and finger pointing ensues. Similar to the above, the DevOps team needs to be enabled to have the right tooling to help developers focus on application development and remove themselves from managing environments.

Keeping costs in check 

It’s common these days for a company to have some form of artificial intelligence that relies on massive amounts of data as part of their offering. This means that when a developer wants to create or modify the code and test it, they need a fairly robust environment to work with, which could be quite costly. Then multiply that by the many developers working in parallel continuously, and it could get extremely expensive. Cloud cost management can be further complicated by using heterogeneous environments. 

An effective tool should be able to manage a variety of complex, realistic scenarios. For example, imagine an application hosted on hybrid infrastructure consisting of an on-premises infrastructure component combined with cloud, which is described through Terraform and Helm. The costing tooling should be able to provide elements of cost analysis covering both design and operations while handling multiple users, teams, and environments:

Key takeaway: Tagging and cost need to be embedded into the process and environment from the beginning. Based on that, proper reporting, both when launching and evaluating environments, but also as a tool to examine expenditure efficiency more broadly needs to be provided. In a typical single cloud vendor, there are tools that provide that. In a heterogenous infrastructure setup, it can be a bit harder.

Some potential anti-patterns to watch out for:

The team commits to be more disciplined around environment design and utilization of environments (for example, taking down environments at the end of the day). Like many behavior changes, this one is unnatural and difficult to maintain. Over time it is unlikely to last. Additionally, if the spin-up and tear-down of these environments has to be done manually, it could result in configuration drift in environments over time.

Assigning a DevOps individual to “own cost.” can be an unrealistic approach. In the absence of a proper tagging tool, this individual will have to traverse through the teams, asking them to install and maintain tags at a point in time and continue to maintain it afterwards. They will have to build reporting infrastructure and provide reports. This is a tall task, especially with a growing number of (busy) developer teams and rapidly evolving applications that are based on heterogeneous infrastructure.

Does compliance have to be at odds with developer velocity? 

The topic of compliance is typically unfamiliar to the average developer, and when the topic arises, it usually is not greeted in a warm or welcoming fashion. Yes, compliance does not improve velocity or make the end user’s life easier, but in some markets (for example, financial services and healthcare), compliance is either mandatory or a differentiator. Some organizations may be concerned that the acceleration of software creation and innovation results in compliance not being maintained. 

For the DevOps team lead who is usually tasked with reporting and leading the compliance agenda item across all teams (at least when it comes to infrastructure and access), dealing with many teams can be challenging.

Take, for example, how secrets are managed inside of the Terraform file. There are clearly different levels of secrets protection with different levels of investment (financial and effort). 

When the DevOps team takes on this decision-making process, they first need to evaluate the set of Infrastructure as Code files in use by the different teams and then go through these files and make the necessary changes. This is an extensive process. Instead, if the Terraform files were managed centrally in a modular manner by the DevOps team, they would know exactly which repo and files need to be modified and the change would be swift and transparent to the developer teams. 

One very typical aspect (and major drawback) of environments configured from disparate and decentralized IaC files is that they often do not reflect the production environment. The outcome of such a process is that sometimes developers may make assumptions about the production environment, which could be changed by another team. If the environments aren’t managed in a centralized manner, changes in one environment may not be reflected in other environments throughout the Software Development Lifecycle. As a result, a devastating defect could be introduced to production, possibly causing an outage. With the multitude of infrastructure definitions, now the DevOps team would need to debug in the production environment, a process that typically extends the outage.

Key takeaway: Treat lower environments just like production in the sense of environment topology (how well it mimics production), secrets and keys, etc. Manage the cost of lower environments by providing a means to rapidly set them up and tear them down on demand.

Watch out for lower environments (staging, dev etc.) inching towards mimicking production, but only making a marginal attempt. The end result is that this effort will not be maintained over time, and the risk of failure to deploy to production or worse, outages and security exposure could take place.

Special builds and environments are built to mimic production, but only end up covering a few cases. The natural response to a failure in production due to deployment of a version that was built on an environment that does not mimic production, or similarly, a security breach, is to fortify that team and/or microservice. But this approach will only move the risk to the next team. The approach to analyzing and aligning which teams and microservices require more realistic environments needs to be holistic and complete. The lower environments need to be described in a modular fashion such that they have common basic components and some teams have additional layers, essentially avoiding a situation where each team is completely different, which would be very difficult to manage.

In summary, change management, cost control, and compliance are only three aspects highlighting the fact that application environments are an extraordinarily complex topic, one that is critical to fuel innovation and velocity in any product organization. The vast majority of organizations do not have it perfect and are in fact far from it. They are in transition between infrastructure solutions (typically on-premises to cloud) or from a monolithic to a service-oriented architecture. Then there are the business growth needs that require careful application architecture planning to support those needs. The business’ need to maintain cost controls and governance is often at odds with the developer’s goal of innovating with velocity. But it doesn’t have to be this way. There is a better way that provides the right level of control to each team, facilitates collaboration at scale, and provides the business the growth it needs. 

About the Author

Rate this Article