InfoQ Homepage Presentations Developer Effectiveness: Optimizing Feedback Loops

Developer Effectiveness: Optimizing Feedback Loops

View Presentation

Speed:

Download

49:33

Summary

Tim Cochran presents research gathered from ThoughtWorks' varied clients and projects, and shows some of the metrics their teams have identified as guides to creating the platform and the culture for high performing teams.

Bio

Tim Cochran is a Technical Director for ThoughtWorks and he leads the East Coast Market. He provides guidance and leadership on technical platforms and the engineering culture to support high performing teams. He is passionate about taking data-driven approaches to improve developer effectiveness. He works with companies in varied domains such as a retail, finance, government and insurance.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Cochran: I'm Technical Director for ThoughtWorks. We are a software development consultancy. We work in many different domains, such as retail, finance, lots of different kinds. We work with lots of different companies. If you don't know ThoughtWorks, Martin Fowler is our chief scientist. That's why a lot of people know us. We've created a lot of open-source tools over our history. We actually have a co-creator right here, Paul. We also had a lot of thought leadership around continuous delivery, microservices, testing, BDD, things like that.

I'm going to describe a typical scenario that happens to us a lot. An enterprise or a medium-sized company is in the midst of a digital transformation. They want to modernize. They're looking at doing a cloud migration. They're looking at doing a legacy modernization. Maybe they need to modernize some of their business processes along with some of their technology. Typically, what they're going to do is they're going to reach for some modern practices. Often, it's something like this. They're going to reach for microservices, serverless, event-driven architecture. Whatever has been talked about at a conference like this, probably, conference-driven architecture.

There's a silver bullet. If we implement microservices, that will solve all our problems. Of course, what happens is, actually, if you're lucky, they build lots of microservices. Sometimes they have lots of diagrams with microservices. There are lots of delays, over budget, little in production, lots of defects. The CTO, CIOs is very frustrated. What do they do? They call a consultancy. They might call ThoughtWorks. This is sometimes when we come in. What are we going to do?

DORA Metrics

This has been mentioned. There's a fourth or fifth talk mentioning it today. If you haven't got the message yet, please read The State of DevOps report, or accelerate. There was actually five key metrics there, now. Availability just got added. Of course, what we're looking at is, we have this complex architecture, but we have not mastered the basics, these five key metrics. Maybe it looks something like this. Our meantime to recovery is measured in days. Change failure rate, pretty much every other build is failing. Lead time is three to four hours. Maybe deployment frequency is actually not too bad, but perhaps it shouldn't be deploying that much in this state. Then availability is pretty low. Often this actually would be against a pre-prod environment. Probably they haven't made it to production.

Path to Production

What do we do? If you read the DevOps Handbook, the advice is to do a Value Stream Map. We call this the path to production. You can very quickly find things that are going wrong. There's some miscommunication going on with the platform team. It's blocking the product teams. There's lots of rework happening between the front and the back-end teams. Our QA team is perhaps shared amongst different applications and we have to wait to get hold of them. Then when we actually do the QA process, it takes about 2 days. We can optimize these things. What we actually find is that we're still not really able to hit our metrics we want to hit. We're still not going fast enough. The reason for this is that what we haven't done is we haven't built an effective environment for developers. We haven't created the basics. If you try to release often, and you haven't mastered the basics, you're going to create a lot of extra work. Basically, all your engineers are going to be doing is preparing for releases the whole time. We have to make sure that we have a really effective environment.

We work for lots of different clients. We work for some that are highly effective, and some that are low effective, obviously, more on the low effective side. They don't call us as much if they're already highly effective. Let's take a look at a day in the life for an engineer working at a highly effective environment. Typically, they come in. I can start work. My tools have been updated. The developer is pretty clear on what I have to do. I have an uninterrupted block of time. I'm able to perform technical analysis pretty quickly. Then I commit my changes and number of automated checks. I can deploy to prod. Then I can release that to production. I can monitor business and operational metrics. The point is, is that he goes home happy. This is not a myth. I have actually worked at companies like this. It wasn't Netflix. It does exist at other companies. I'm curious, let's start with a high effective. Has anyone experienced this really effective environment?

What happens in a low effective environment, you arrive at work. You have a number of alerts. You don't have any access to a production log. I do a chase around. I work with my operations team and find out that a lot of those alerts are actually false positives. I'm checking on the features that I competed last week. I'm checking with the various different governance groups. It's still blocked. My day is broken up with many different meetings, most of them just different status. My nightly build is red. What I found out that, actually, it's red because the test wasn't updated. Now I try to actually do something, but I don't have enough information. I attempt to go and ask the team about that. I'm told by the project manager that I can't talk to that team. I must create a Jira ticket. I go home frustrated. Who does this sound familiar to? More hands.

Death by a Thousand Paper Cuts

It is interesting, because it's not one thing. It's a lot of things. We call it death by a thousand paper cuts. Everything you do takes that little bit longer than it should do. That has a compounding effect. The environment becomes really ineffective, and slow. Our engineers that work at what we describe as these modern digital businesses, they just feel, I got this bias for action. This momentum that everything seems oriented around delivery, which you don't have at these low effective environments where it's more bureaucratic. It's more about people's particular jobs, and roles, and things like that. This is not rocket science, but obviously, in our research that says, if you have an effective environment. If a developer feels they're able to achieve things, they are motivated and therefore they are more productive. I would hope I wouldn't have to show too much data to prove that point.

How do we fix this? What we like to do at ThoughtWorks is actually take a data-driven approach. We're going to start by looking at, what do engineers do during the day? The low effective one might be the client that called us. It's very obvious that the engineers are spending way too much time on things other than writing code, other than providing value. This is not actually that useful. It's not actually that actionable. I don't really know what to do about this. When we talk about the DevOps report, a lot of it is about feedback loops. It's about this lead time. This lead time of, I write some code. I deploy it to production. I get feedback. What we've noticed is that the best way to actually look at effectiveness is actually to look at different micro feedback loops. What is a micro feedback loop that you might know? This is the simplest micro feedback loop. It's the red-green refactor. Another one is this build, measure, learn thing. I write some software. I put it into production. I instrument that. I run an A/B test. I learn from that data. Then I pivot. I iterate. That's some basic feedback loops.

Once we analyze what an engineer does during the day, what we realize is that it's just a series of micro feedback loops. Let's take a look at what some of them might be. This is not exhaustive. This is just a few of them. The way that we go about this is we apply product management techniques. We apply user experience techniques. We're trying to create an effective environment for developers. In this situation, the product is the environment. My user is the developer. What I'm doing is I'm looking at, what is the task? What is the value my engineer is trying to do? Rather than focusing on a technique, a tool, or something like that. Let's focus on what's the value that engineer is trying to do. We've broken this down into these various tasks or jobs that are feedback loops that you're going to do over and over throughout the day. Obviously, the simplest one is, I want to see a component change in a dev environment. I might want to find a root cause for defect. We're actually going to talk through some of those and explore how we can actually optimize these different feedback loops.

Metrics

First, I wanted a little aside about metrics. We often have these high-level metrics, but how you break that down? There's actually a good analogy. If you think about when you're trying to lose weight, you might use the weight. You go on a scale, and you use the weight to actually find out your goal. I want to lose 10 pounds. That is not actionable. I can't do anything with that. What I do with that is that's my goal, I want to lose 10 pounds. What I actually want to do is come up with a strategy about how I lose that weight. What do you do these days? You buy an Apple Watch. Then you set a goal for yourself. You come up with these low-level metrics that are more actionable, and that if I keep hitting these low-level metrics, I will hit my goal. In a product space, this is sometimes called a lagging indicator or a leading indicator. The 10 pounds is a lagging indicator, because that'll change after the leading indicators change. We can apply this technique to these feedback loops, where what we're actually looking for are some of those leading indicators about the things that I need to optimize. Therefore, I'm going to improve my feedback loop. Then eventually, I'm going to improve my ultimate high-level feedback loops, which is the DevOps report metrics.

Let's start with the simplest one. At ThoughtWorks, we always talk about outcomes. It's a really good way about thinking less about the technique. We normally talk about business outcomes. In this sense, it's more of an outcome for an engineer. I'm going to use this format where we have the trigger on the left and what I'm trying to do on the right. What we saw with the path to production. We can use that for a micro feedback loop. We're going to think about how can we optimize these loops to make it faster?

We're going to start with a really simple one. You've probably seen this. I'm making a change in a dev environment. By dev environment, what I mean is personal dev environments that are going to be maybe a cloud environment. In a lot of sense, it's just your laptop. Maybe in a low effective environment, I have three manual steps. I have to run the build tool. I have to cycle the app server. I have to refresh the browser. That could be 10 or 15 seconds. On the surface of it, that doesn't feel too bad. There are three commands or something like that. What you'll find is that engineers learn that. Over time, they get this habit, where I have to run these three commands to do something. It actually can be quite difficult. It's a barrier to engineers entering your team.

There's of course, a solution to this. These days, there are lots of different tools about hot reload, and things like that. Webpack has it built in, or Spring Boot, and JRebel, I think, has this stuff built in. Ideally, what we actually want to do is we want to get to 2 seconds. Maybe I'm working in dynamic language, that's a little easier to do. I want this idea that there are no manual steps, and I can get to 2 seconds. That's interesting. Probably, a lot of you are already doing this because engineers do this naturally. The point is, did anybody care? The team probably did it, but was that lauded by your management. If that increased, would you actually have any checks and balances to make sure that you're actually going to invest some tech debt to keeping it down? What we find is that, yes, if the team is empowered, engineers do this because we're all impatient. We're all going to naturally optimize.

Sometimes, we do actually have to justify this to our management. How do we justify this? This is some pseudo-math, on the back of a paper napkin stuff. If you think about it, an engineer might build 50, 100 times a day. If your build takes 2 minutes, that's going to be 100 or 200 minutes. That's a lot of time during a day. Of course, that doesn't necessarily mean that you're not going to do something else during that time. There's a loss of focus on context switching. When we talk about this death by a thousand paper cuts, it's these small improvements that are going to make a lot of difference to the engineering organization if we can do that. Because if you get to that 2-second response time, what happens is, I forget that there's something building my software. It doesn't matter anymore, because what I'm doing is I'm writing code. I'm getting feedback. It happens almost instantaneously. That's how you get into this feedback loop, this focus.

Distributed Architecture - Application/Team Scope

What we also see is that we're probably creating a distributed architecture. Another anti-pattern, what we see is, we might have actually created that for one component. If I've got a lot of little microservices, in order to complete an actual user facing story with user value, I might actually have to rebuild more than one component. What we might have seen, especially if you split your team up by skills, or you split a front-end or back-end team is that you've got your application context wrong. What I've probably done is I've optimized. I've created these dev environments for each service separately. The smell is my engineer is spending a lot of time restarting processes manually. That really means that you've got your application scope wrong.

It's pretty simple. What I want to do is actually create a context where I have all the services that I need to complete the majority of my stories. Then, probably I want to make it easy. I create some CLI for that. What this does is it really reduces the cognitive overhead. When we've been to these really highly productive environments, engineers are just sitting down and writing code. It doesn't really matter if there's 10 microservices under it. It's that productivity. Because I think, what we're seeing is there are lots of reasons why you should create a distributed architecture. A lot of times what we've actually done is we penalize the developer because of that. We've optimized for our runtime efficiency. In some ways, we've actually made it a less effective environment because of this complexity we put in. There are lots of tools. This is not a talk about tools, but I'll throw in a few occasionally. Docker Compose and Skaffold will help you do that.

Let's quickly look at some of those low-level metrics. I think this is pretty obvious. Some of the things you could track are your build time, transpiling. The amount of time you spend in the morning setting up your dev environment, because you're not using Docker or something like that. The amount of time you spend deploying. If you do some analysis, you can get a cross section of engineers in a room and put some Post-it Notes up and try and figure this out. Some of this, obviously, you can automate the measuring for it.

Find Root Cause for Defect

Find a root cause for defect. I'm pretty sure that happens a lot. In a bad situation, it already happens every day. Hopefully, it's not every day. In a low effective environment, the worst case scenario is that it's been reported by a user. Worst case is that I don't have those logs available. Probably, what I'm doing is I'm just guessing. I'm going to keep guessing until I get that solution correct. It's going to take a little while. In the highly effective environments, my problem is detected by monitoring. I have those aggregated logs available for me. I can quickly identify that solution. Probably, what I've actually done is I've optimized my environment so that I can reduce the risk of deployments. I can probably release it in a dark fashion. I can verify that it works in production. I can remediate. In this sense, it's probably going to be 1 to 2 days, maybe with some manual steps there.

Some of the things you might use in your low level, your lagging indicators. I wouldn't use my list as an exhaustive list. I think all of you should think about, once you've found those micro feedback loops, it's a good exercise to think about, in your situation, what are the right things to measure? Here we're talking about maybe the number of different places I have to go to look at logs, perhaps, the number of false positives in my log because I want to reduce. That's defects.

Validate Component Works with Other Components

Let's talk about validate my component works with other components. Actually, what I meant here is that actually integrates with other components. This is something that is interesting because it gets a bit more complicated now, because this is going to involve multiple teams. Usually, this is going to involve a QA team or something like that. Sometimes the value of my API testing suite, or my regression suite, as an engineer, what I really want to know is, does my component work with other components? I've changed my module, my microservice. I want to know as fast as possible, will it break if I put it into the next environment.

A lot of times, we see that this feedback is very slow. Often, you don't really get that feedback until right at the end here. I have to raise my pull request. Then you have to go for a code review. It's deployed. Then eventually, once my end-to-end regression test runs overnight, I might actually get that feedback. There are lots of anti-patterns here. This is where it actually varies widely. In low effective, it could be 3 days, 2 weeks. To be honest, quite often, we see that we never actually reach confidence, because we've not actually put that safety guard in place. Probably, you're seeing a lot of times where you're getting bugs that the QA team is picking up, because I haven't actually put that testing in place. Some of the smells you might see that makes this feedback loop really bad, I'm sure you all know about this, but things like my pre-prod environment is broken all the time. My pre-prod environment differs in configuration, or test data. My services depend on each other, so I have to release them at the same time. My end-to-end test is slow and flaky. Or perhaps my tests are maintained by a separate team, I have no idea about it. My tests are testing the same path over again. That's another smell we see. This is the level of confidence in the low effective environment.

In a highly effective environment, there's a lot of different ways that you can do this. I wouldn't take what I'm suggesting here as a recommendation because I think there's a lot of different innovations at the moment. You should be thinking about testing in production. Thinking about how you handle observability. Really, looking at what is the value I'm trying to achieve? How do I get that confidence? Not necessarily I have to dogmatically write this end-to-end test or something like that. It's really good to get a cross sectional group together to think about that value, and think about, what are the ways that I can keep my environment very simple and achieve that confidence. In this particular way, this is actually one of my teams working on a luxury vehicle e-commerce. One of the most important things is that first step. I have a contract test running locally. I'm using some service virtualization there. It's based on contracts that have been published by the services that I depend upon. I can generate stubs for that. I can very quickly find out if I'm going to break a service. Then beyond that, what we're actually doing here is the testing pyramid. That we're doing less end-to-end tests and we're looking more at some of the observability using a canary approach, and actually getting confidence through that. The point here is that I've got this massive level of confidence right at the beginning. Then it goes off. The point is, this feedback loop has been optimized.

There's lots of different metrics you could look at here. We generally recommend trunk-based development. If you're using that approach, it's going to improve this feedback cycle. If that's your approach, you're looking at the time that a feature branch has been open. The time it takes to get code feedback. That's going to be some of the things you want to measure and you want to optimize for. You can imagine lots of different metrics here that you might want to be looking at, like the length of time for your end-to-end test suite. You might be looking at amount of duplication in my tests. You might be looking at the critical paths, the coverage of critical paths, and things like that.

When we talk to companies about this, especially QA managers, this idea where we're trading off completeness versus speed. We feel, if you've actually got your architecture correctly, so if you really have a well-defined interface, and we really have good component tests that you're not really trading off anything. We feel you can achieve pretty good confidence without having to resort to really long end-to-end tests. Essentially, you can have your cake and eat it, with some of these techniques. There are also other techniques, things like using parallel builds, and sharding a test suite to get it faster. Those were some of the more obvious things, where it's more about automation.

Productive On New Team

Now I'm going to look at some of the practices that are going to drive out, perhaps more cultural changes. These are some of the more interesting ones that weren't so obvious when we thought about highly effective environments to begin with. This is an interesting one. If I'm an engineer, and I'm joining a new team, how long does it take me to actually get up to productivity? This is interesting, because what we actually see, especially in the modern digital businesses is that teams change all the time. That's because we want our teams to map the product strategy. A lot of times we actually want to be ok with that. We actually want to optimize for the ability to change teams. In order to do that, I have to optimize for people joining a new team and being productive.

The trigger is joining a new team. Outcome is I want to be productive. In a low effective, it's looks something like this. I have to set up my laptop, my dev environment. I have to figure out how I get permissions to all the different tools and code. I probably get some overview from an architect or something like that. I'm probably asked to work on some trivial things, some bugs. Then I might get some feedback from code reviews. Probably, what we see a lot of times is engineers make a ton of mistakes until finally they've got told off enough that they understand what they should be doing. It could take about 2 months.

If we actually look at a production environment where you really optimize for the ability to change teams, what we see is quite often, on my first day I'm given a laptop. I have a pre-production environment that is working from day one. I'm probably going to be able to deploy to production on my first day. I'm going to have access to all the repos. Quite often, we see in a lot of these modern digital businesses, they're not going to lock down any knowledge unless they have to. They're very transparent. It shouldn't be that your repos are locked down. Of course, there are reasons for it. In a lot of situations, it's completely open. Then what I'm actually going to do is I'm going to collaborate with multiple members of the team to work on meaningful stories. I'm going to learn from the knowledge that that team has. I'm going to very quickly get up to speed.

Collaborating can be in many different forms. Obviously, I work for ThoughtWorks. I love pair programming. It could be other ways. It could be in-person code reviews. We're actually just talking about it. It could be more programming. Ways of actually extracting the knowledge out of people on the team. It can be quite hard for an engineering manager to do this, but to slow down some of your senior engineers and have them really focus on bringing the rest of the team up, rather than be the person that bullies forward and does all the really valuable work. Then perhaps I have a great knowledge system.

That is not correct. That is not the right metrics for this one. That is not 30 minutes. That would be really great if we could do that. I think it's more like 2 weeks, or something like that. A low effective is more like 2 months. I am going to stay away from trying to measure productivity. That is a very difficult conversation. A lot my recommendation about creating effective development environments, and that conversation of having that with developers is actually going to cause them to be more productive. Trying to measure productivity is a difficult one. I have seen some techniques about time to tenth commit. That is clearly very easy to game. It might be of use.

The reason why it might be complicated for engineers to join different teams is that you have too many tools and things like that. One of the things you might do is an audit of your current technical landscape. You might decide to focus on limiting the amount of different tools. ThoughtWorks' Tech Radar, Build your Own Radar is a good technique for that. That might be one of the reasons, once you've done the analysis, of why it takes so long for engineers to become productive. Actually, in this situation, a really good way of doing this is actually research with my developers. I talk to my developers, and try to understand. Having a cadence of surveys, do interviews with new starters, find out what's happening, and repeat that. You actually can make sure that we're getting progress. We're optimizing for those new starters or those people that change team. We're making sure that the changes that we're doing are actually making a difference.

Answer a Technical Query

This is very deceptively simple but surprisingly makes a lot of difference. No matter how good my documentation is, or how well built my API is, there's always going to be a situation where an engineer has a clarification. This is the feedback loop. It starts with I'm integrating with an internal API or a library. What I want to do is I want to be able to integrate with it, but there's something I don't understand. In a low effective environment, I read the documentation, of course the sum. I still have a clarification. I have to spend probably a little bit of time asking around to figure out, who actually owns this API? Does anyone own it at this point? Quite often, especially if you work in a legacy environment, at some point, somebody 10 years ago owned it and they've since left, and now at this point, no one owns it. I probably have to talk to a project manager. Perhaps I'm not allowed to talk to the engineers. I'm going to have to open a Jira ticket. I'm going to get a response in 2 days. I can guarantee that response is not going to answer the question that I actually asked. I'm going to have to go back and forth, because English is not the clearest way of describing something. Eventually, I might get an answer. Eventually, I might be able to integrate. That takes about 1 to 2 weeks.

When we actually look at this highly effective environment, what they've done is this bias for action. It starts with reading documentation. One of the things that we talked about earlier is that I can read the code. I can go and look at the code. If I still have a clarification, the ownership is clear. I know which team to talk to. Probably, I can jump into that team's chat room. I can ask them questions and they are responsive and happy to answer. That is a difficult culture to breed. That is what we see at these highly effective environments where people are trying to help each other. It's actually quite often where if a company is doing continuous delivery and trunk-based development, they're better at this. The reason why they do this is because they know that if that person is blocked, that means they can't do anything. Because if you're following best practices, it means you're not working on two things at the same time. They know that perhaps unblocking this person may actually be a little bit more important than what I'm doing. Of course, you have to be careful here because this is going against having uninterrupted time and focus, and things like that. It's this idea that engineers are trying to help each other rather than trying to protect their backs, and things like that.

Some of the metrics you could look at, ideally, is 30 minutes. Low effectiveness is going to be more 1 to 2 weeks. What you might look at is, look at all the code and all the systems you own. Do we know who owns them? Maybe there's some stuff that nobody owns. Perhaps we should assign that to a team. It can't be that everybody is just working on the new shiny project. There's a cost to some of this legacy software. Perhaps, everybody has to shoulder that burden and take on some ownership of that legacy software. Of course, you could measure the number of tickets. There are probably lots of other metrics we could do here.

Validate Component Adheres to Non-functional Requirements

This is another interesting one. This also varies incredibly. I don't know if you know what a non-functional requirement is. What I mean by that is, does it adhere to performance SLAs? Does it adhere to my security requirements? Does it adhere to my code quality requirements, all those things? This is a lot of what, as an engineer, I want to make sure that I'm doing. As an engineering leader, I want to make sure all my engineers are doing that. In the low effective environments, what I'm doing is I have a code review that has been sent back a number of times by different people. I don't know if you ever notice this, where you write a code review and you have six people piling on. Probably, they all have conflicting different opinions. What we actually see in the highly effective environments that it's more one or two people, quite often, they let a lot of stuff go because they know that the value is actually to get it to production. They might give comments to that person. There's a trust that they'll address it later if it's not actually going to cause a problem in production.

The application is reviewed by multiple supporting security groups. There's a load test probably run in a pre-prod environment probably one month before going live. It's deployed to production. Then probably there's some 2:00 a.m. test run where everybody's on a call. It's going to take a long time, probably 3 months or so until I know. A lot of this governance is done probably not by practitioners. It's probably done by an ivory tower architecture group. It's probably done against architectural diagrams, as opposed to code, or proof of concepts.

Let's take a look at what it might look like in a highly effective environment. I have my standards, and my guardrails are actually documented. I have useful checks in my CI pipelines that are actually going to find a lot of those things. I'm not really relying so much on the code review to do that. As much as possible, we're automating some of those NFRs. Perhaps there's some security scanning. Instead of a governance group that is rubber stamping, it's almost like their metric is the number of things they say no to. It's actually what we have as a group that's there to support and there to share knowledge, more like a center of excellence. Then, what we've actually done is we've created an environment where we can deploy something into production, but the risk is low. I can use some of the observability to make sure I'm hitting my NFRs. I don't have to design it perfectly. I can actually try something, and I've created and see if it works. I can very easily roll back those things.

Obviously, it really depends on the type of environment you are. If you're in a regulatory environment, if you're doing payments versus a brochure, or something, you're going to have different ways of doing this. It really varies widely. It makes a massive difference. To fix it, you're going to have to change your engineering organization. You're going to have to change the way you think about governance and adopt lightweight governance, trust-based approaches, have less enterprise architectures and have more feedback on technical decisions made by practitioners.

Some fairly obvious metrics, we could look at there, is the number of different governance groups. Then the amount of time it takes. I should be careful here. I'm not saying that you shouldn't do governance. Of course, if there's a very significant change. It makes sense to get consensus from your engineering leaders, your architects. Let's do it in a way that we have a bias for action. Maybe think about suggesting a POC, suggesting a way forward, rather than working too much on theoretical diagrams and things like that.

Launch New Service

This is the time to launch a new service. I've identified that I want to create a new service. Really, my outcome is I want to get a version of that service into production. Yes, it's a Hello World. It's a Hello World with all the things that it needs to actually run in production. What I mean there is it has all the monitoring. It has the CI/CD set up. Has all the secret management, all the routing, and things like that all set up. In an inefficient world, this is what we see, when we talk about that example when a client has called us and they've spent a lot of time talking is because what they're actually doing is spending a lot of time deciding on tool stacks, and different platform services, and things like that.

If you're not using the cloud, I probably have to right-size my environment, probably a long time before it goes live. We have to justify that, and create budgets, and talk to my infrastructure department about that. I have to spend a lot of time discussing all the different tools and techniques, and things like that. There's going to be multiple rounds of governance. It can be very hard to get a new tool approved. Then I'm going to spend a lot of time setting that up. Then I have to bake that in production. This maybe isn't so bad if you do this once, if you're creating a platform, and this is the platform that we're going to use for more applications going forward. If I keep repeating this thing, and that's what we actually see sometimes, especially if we've gone down the route a little bit too much of autonomous teams, that they're actually reinventing the wheel, time and time again.

In the highly effective environment, the team picks a common tech stack, that's been approved by the company. This is actually where Netflix does have some good stuff around the ideas of paved road. Spotify have the idea of a golden pathway. This is the idea that I've created a very easy route, that I've made it very easy for engineers to follow the best practices. It's almost hard to do the wrong thing. Quite often, that can be some tool, or a generator. It may be a website or a CLI that generates my application. It creates this skeleton that has batteries included. Of course, it's not like everything's abstracted. One of the other anti-patterns we see is that sometimes we create these abstraction layers on top of standard services when we shouldn't do that really. We should allow teams, where it does make sense, perhaps, if they have an unusual requirement, they do actually want to replace with conventions. Then deploy to production. This is a 5 day thing at a highly effective environment. At the low effective, it can be 2 to 4 months. Some of these clients that come to talk to us, they're still doing that. They've probably been working on it for 6 months or something like that.

There are a lot of different things you could look at when you analyze the process. What you might find is teams will make a very passionate case that I should definitely use CircleCI, or Concourse CI, or Jenkins CI because it has this one particular feature that I need. The reality is having those multiple tools is not going to optimize it for the company. Maybe there's a way of getting around the fact that it doesn't have that one little feature that that team particularly needs. That's where you do need some governance around that thing, so thinking about the different tools that you have. I just use CI/CD in the example. Of course, that applies to all the different platform services. Then, number of different handoffs and anti-patterns, and thing like that.

Developer Experience and Platform Teams

I don't have time to talk about developer experience teams and platform teams. If you're going to do it, please do it in a federated self-service way. Sometimes what we see is an organization will create those teams, but what they haven't done is they haven't really fixed the underlying problems with the engineering culture. Just putting a DevOps team is not going to solve your problem.

This is my summarization of it, which is where we want to get to, in an ideal world. There aren't that many clients that are in this right-hand side. There are some, for sure.

Warnings

I have one quick warning about this. It's going to echo the previous talk a little bit. I've talked a lot about metrics. They're very easy to game. Sometimes we see some of our clients are really into DevOps metrics, but they really want to use that to measure teams and to hold people accountable, and essentially, to beat people up. Please don't do that. Try not to compare teams against each other. Metrics are there as a tool for teams to improve themselves. If a team is not hitting the targets that we want, there's probably a reason for that. It's probably that they're missing the capabilities on the team. Maybe they're dealing with more tech debt than other teams, and things like that. You really have to look into what are the reasons for it.

Think about the outcome you're trying to achieve. A lot of my metrics are all about speed, but you probably want to put some stuff in there that actually make sure that you're actually achieving the desired outcome. Yes, we could have a really fast CI/CD build. If the amount of bugs goes up, then there's no point to that. Reevaluate, change all the time. Then lastly, to that product management idea is, combine the quantitative research with the qualitative. Talk to your teams, measure their happiness, measure how they're feeling about the effectiveness of their environment, and get ideas from them.

See more presentations with transcripts

Recorded at:

Jun 05, 2020

Tim Cochran

InfoQ Software Architects' Newsletter