InfoQ Homepage Presentations Agile Rehab: Engineering for Improved Delivery

Agile Rehab: Engineering for Improved Delivery

View Presentation

Speed:

49:14

Summary

Bryan Finster covers replacing the Agile process with engineering, how they moved from quarterly to daily delivery, and enabled teams to get the feedback they needed to deliver with predictability.

Bio

Bryan Finster is a passionate advocate for continuous delivery with over two decades of experience building and operating mission-critical systems for very large enterprises. He understands from experience that solving the problems required to continuously deliver value improves outcomes for the end-users, the organization, and the people doing the work.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Finster: I've been a software engineer for nearly 30 years now. I don't do agile coaching. I just do like, how do we deliver better and make the work suck less? It's really what I'm after. This is the biggest problem I see in the software industry is we're always applying solutions to problems we don't understand. We're going to do the Kubernetes, we're going to do the microservices, we're going to plug ML because we saw it in the conference once. What the Agile Industrial Complex has given us for the last 20 years is we're going to improve completion ratios, make our burn-down charts get better at story pointing, how many times you sat in meetings. We're going to do a lot of story points, get our velocity up, and we're going to get really good at PI planning. The problem is, is that this is solutions to problems we don't really understand. It's amazing how many people I've talked to where we're actually doing waterfall and 2-week sprints, we're agile, but why are we trying to do agile? The reason is, everything's wrong all the time. Requirements are wrong. We're going to misunderstand the requirements. They're going to change before we can get them out the door. What we need to do is engineer a process to get faster feedback.

2014 - 2017 (Problems and Solutions)

I'm Bryan Finster. I work for Defense Unicorns. I'm going to tell a story of a journey I went on at a very large retailer in Northwest Arkansas, several years ago, and show you what we did to solve a business problem. Here's where we were in 2015. I worked in logistics on a giant legacy system that had been bashed together from four other legacy systems over the last 20 years. We had project teams that were shipping, it was just straight-up feature factories. We had functional silos. They had separated out development, support, testing, and project management. If we'd started with better automation when I started there in 2001, we would have been this leading edge of DevOps, and they tore it all down. We had release trains to try to manage all of the teams, the coordination between teams to make sure we didn't break anything. We were releasing two to four times a year. We measured our lead time on this, to get a business request into production took 12 months. Every time we did an install, you're talking multiple hours, four to six hours of downtime for every install at a distribution center. Imagine the cost involved with shutting a DC down for Walmart for four to six hours. When we were going to do an install, the schedule came out, who's going to be working the 24-hour shifts in the war rooms for the next week? It would take weeks to months to roll that change out to the chain and get ready for the next time that happened. You really can't operate a business that way effectively, especially at the time we were in a growth period, acquiring a lot of companies and trying to integrate those companies with a year's lead time, doesn't work.

Our VP came to us with a challenge. I want to do it every two weeks, with zero downtime, and we need better quality. That's a little terrifying. Again, we're talking about a system that had no test automation whatsoever. What we want to do is we want to do that more frequently. Instead of hiring consultants to bring in and talk to us more about this, how we can do safe better, because at the time we were a safe shop, he gave it to the senior engineers in the area and said, figure this out, so we did. We read Continuous Delivery. Jez says, "If it hurts, do it more." We looked at this challenge every two weeks, that's going to hurt, but it's not going to hurt enough. If we're going to really do this, we're going to aim for daily. What was the outcome of that? 2017, two years in the future. We hadn't fixed everything, but we had a bunch of components broken out, loosely coupled teams with a loosely coupled architecture. Each one of those teams could deploy independently on-demand with no downtime. Deploy failed, it would automatically roll back. We had multiple releases per day, the team that I was on, 5 to 10 times a day was not unusual. I want to walk through how we got there because it took engineering and discipline to do this.

Descaling for Delivery

A thing that I talk about a lot is we don't scale agile. We descale the problem so that we could deliver more frequently. This is how we did this. We started with domain-driven design. We started with, we have this giant warehouse management system with a bunch of entangled capabilities, no domain boundaries whatsoever in the code. We went to the whiteboard and we said, ok, let's draw what the business capabilities are, that we need to implement in the system. This isn't all of them. It's a representation. We also set some rules, because we're doing this on the fly, we're learning. None of us are experts in these things we're doing. It's the first time we applied strategic domain-driven design, and so we came up with a rule. One of these boxes, we're going to write a sentence to describe what this does, and if we have to have a conjunction in the sentence, it's probably too big, we need to split the domain so it can get down to something a team can consume. For example, receiving. We had a lot of discussion and negotiation, what does receiving actually do in a distribution center? You'd be surprised how many differing opinions we had. It took negotiation among the engineering team. We said this is what receiving does. We defined the inputs and outputs. We did this for all of those boxes. We drew a high-level domain diagram of what those interactions should be, so that we can tell where we might have teams that have too many dependencies, and start laying out contracts between those teams. Then we used reverse Conway maneuver, that's what we were trying to do. We'll decompose the system by decomposing the teams to represent the system we want in the future. If you're not aware of Conway's Law, this is true, and I've lived it. I recommend you take it seriously. Because it's true whether or not you do take it seriously or not. It's just if you don't, you'll get hurt by it.

Strangling the Monster, and Practice Definition

This is where we were, we had our legacy nightmare, with all of these feature teams just making random changes. When I tell you that this is like a tightly coupled system, I can make a change into a receiving application that would make an unexpected behavior change in invoicing, which is at the other end of the business flow. It was a complete nightmare to support. I did this for 15 years, so I don't recommend it. What did we do? We started pulling out the product teams. Say, ok, we're going to build a team and we're going to start assigning them capabilities. Then we built an API layer on top of that legacy code to start exposing those behaviors. We built a team who's responsible for the API layer. Then we built a small platform team. We started building a team that was templating Jenkins pipelines, so that the different product teams could pull in that baseline template of how we wanted to deliver in our environment, because we're delivering to metal at the edge. Then start making additions to those pipelines as they needed to for their individual products. Then we defined some practices that worked as forcing functions to drive the outcomes we wanted. Number one, you had to be independently deployable. If you had to go and coordinate an install with another team, you're engineering your solution incorrectly. Engineer to handle those dependencies in the code instead of with process. You're only going to coordinate through APIs. Define your contracts.

Get those aligned first, then go and build. We want you to have 90% code coverage, because we haven't ever tested before. To make sure that you're testing, 90% code coverage. Don't ever do that. The outcome of doing that, was we went and inspected what those tests look like. You had a bunch of tests that tested nothing. I would rather have no tests than tests that just assert true and make me think they were tested. I can't stress this enough. I've seen this pattern repeated by management again, and it's always a bad idea.

Continuous Delivery (CD)

Now we've decomposed our teams, or at least started down that path, we started building a platform. We still have the problem of how do we do continuous delivery because that was our goal. CD doesn't just happen because we have a pipeline to go to production. You build a deploy automation, but continuous delivery is a workflow. What I did was, I took the book, I went through the CD maturity model that's in the book. I started peeling out all these different capabilities in the maturity model and built a dependency tree. How do we do these things? What has to happen to enable something else? This is color coded here because some of these things are automation, just straight-up tool. Most of it is either behavior or automation that only works if we're behaving the right way. I use this to start driving the roadmap of, on those pilot teams we need to learn how to behave in a way that will allow us to deliver daily.

Then we establish a North Star metric, because, again, one of the things I run into all the time is organizations say, we're going to do CD, but we're a special Snowflake. Again, in a perfect world, it says in the book to do this, but not in our environment. It's 100% false, because I've worked in every context you can imagine. It's always true, you can do CD. We established this metric. If it prevents us from delivering today's work today, it's a defect. We have to fix the defects. Here's where we started. We had a 3-day change control process, very hard to deliver daily that way. We had an external testing team, which is always a problem. Long-lived feature branches, because we were used to having branches open for, wasn't a sprint, we were having branches open for a quarter. At one point, I was responsible for merging those branches every quarter. There was like 400 branches merging to the trunk every quarter. Then, of course, conflicts, and bad tests. We had really vague requirements of, developers, you can figure that out during development. Nobody's ever heard that but me, I'm sure. A lot of cross-team dependencies we had to solve.

Minimum Viable Process

What did we do? Started off with, we're going to start with minimum viable process. We're going to throw away all of the ceremonies that's safe and imposed upon us at the team level, and figure out how to just do the things that actually add value to delivery. This is what we found added value. We didn't have a product owner give us requirements. We had a product owner of the team. We worked as a team to say, ok, here's the next feature we need to do, let's work together to decompose this feature into something that's manageable we understand. We did a lot of pairing to transfer knowledge across the team. One of the original pilot teams, the one I was leading, we had 13 engineers on the team, a BA and a product owner. Some of those engineers had deep domain knowledge. Me and I think one other person had domain knowledge. We had a lot of new hires on the team: some who were really good coders, some not very collaborative, really good coders, some who couldn't code very well at all. We had an average team. We didn't create some unicorn team to get this done.

We had synchronous code reviews. We found that by pairing or doing synchronous code reviews, took a lot of time out of the flow and made it easier to do things like continuous integration. We met on-demand. We didn't have very many scheduled meetings when the backlog needed refining. We said, let's go meet and refine the backlog. We threw away 2-week sprints. Those became a hindrance. Eventually, we were delivering at such a pace that we were just pulling and shipping. Sometimes we were pulling an idea from Slack, from a distribution center and shipping that the next day, because that was just the highest priority, and so we shipped it. We stopped doing bi-weekly retrospectives. This is a pain point I see all the time, teams go and have a retrospective at the end of the sprint, they put a bunch of posters up of things people are concerned about and three of them get talked about. Everybody else is just disenfranchised, we never talk about those things again. I wrote up the rules for continuous integration in the area, and every day was like, what do we have to fix to get closer to this? We just did daily inspect and adapt on our process and just tried experiments all the time, how do we get this better?

Continuous Integration (CI)

That was really the first problem, how do we CI? You cannot do continuous delivery if you are not doing continuous integration, and 80% of the problem is that. We established some rules. The trunk is always deployable. That doesn't mean that the trunk we protect it and only merge after we've done a whole bunch of stuff on other long-lived branches. We are branch and pull because of compliance. We're going to integrate to the trunk at least daily. We only commit tested code. You're not allowed to test it after that fact. All of the tests required to validate that code are part of your commit. Again, everything merges to the trunk daily. Broken builds are our highest priority. If the pipeline is broken, if it's red, we can't validate the next change, so there's no point in making the next change. Stop what we're doing, fix the pipeline as a team. That's also important. I didn't break the build, we as a team failed to harden our processes enough, and it allowed me to break the build. We as a team fixed it, even if we have to stay all night to get it done.

We had to solve these problems. Those are problems to solve. We don't go from nothing to that immediately. Why can't we integrate daily? Large coding tests with stories are 5, 10 days long. Long-lived feature branches, again. This feature complete mindset, it doesn't have any value until the entire feature is done, so I don't want to integrate it. It's false. The value comes from knowing we didn't break anything. Our testing sucked, even with 90% code coverage. Again, don't mandate code coverage. You have to fall passionately in love with testing to do CI, and we had to learn how to test. What was wrong with our tests? Again, we had this external testing team, and they were a contract. The quality department had outsourced testing to India. They would get requirements, they'd write tests. They were really bad tests. When we started taking over tests, we tried to salvage some of the effort they'd done, read through it and it's like this is garbage. We now test way better than this, we're throwing this all away. Our stuff is much better. Because of the way they were doing the testing, tests lagged development. We can't have that, because we have to integrate tested code. The tests were flaky because they're end-to-end tests, instead of focusing on writing deterministic tests that are stateless. We had a bunch of pointless unit tests, a bunch of testing getters and setters in Java.

I was having a discussion with a friend of mine about this, we were discussing this problem. It's like, "We've tested Java works," but the code coverage. I told the team to delete those tests, they said, "But our code coverage will drop." I'm like, "I don't care, I will tell management that that's fine. I will take the heat, but those tests are pointless, and all they do is slow down the pipeline and clutter everything up." We needed the tester to fix our testing. Again, we were all learning how to test because we'd all grown up on untested code. We had these vague requirements, which make it very difficult to write meaningful tests if we have requirements. I literally did a survey for some other developers in the area. It's like, why can't you CI? They said, we're doing exploratory coding, and getting feedback, and then we'll try to bolt the tests on later. We had to solve this problem of they were doing exploratory coding, because we couldn't get good requirements, we had to fix that problem, that's why we couldn't test.

Behavior-Driven Development (BDD)

We started with behavior-driven development. This is the most powerful tool that you have if you have these problems with testing in your requirements. This tool, and nobody told me about this, I went and found it because I was trying to solve this problem. This was so useful, that as a side project hobby, I went around the organization giving presentations on, "You really need to look at BDD. This solves so many problems." What does BDD do? It defines our requirements as acceptance tests. We have these scenarios that are testable outcomes that will behave exactly this way. It's a contract with the end users or the business in our case, that we have come up with these acceptance tests together. We agree it should behave this way. If we install it in production, and that's not what you wanted, we'll just change the scenario. It's our fault, so don't point fingers. This is our definition of done. It also helped us spread the thought process of testing behaviors and not implementation, of pulling it away from the code and into, it should act this way. It just taught people the mindset of how to break things because we're always in meetings, refining these tests, and it took no tools. It's whiteboard. We write down scenarios on whiteboards. We cross out scenarios that shouldn't work that way, then we add it to a Jira task and say, ok, here's a coding task. It also gave you thin vertical slices. People are told, ok, with stories, slice the stories, but no one ever tells people how behavior-driven development is how. You get these vertical slices of behavior that you can break down into tiny things. That was another thing we did. We timeboxed this. We put a constraint on it, a forcing function for clarity and quality. Can we complete this in less than two days? We threw away story points, it's pointless. We've met together as a team. We refined the work. If we all agree that one of us could pick this up, and it would take two days or less, then it was ready to start. We'd throw it on the ready to start list.

Now we have thin slices, starting to learn the testing mindset, but we still have to have an effective pipeline. I tell people that continuous delivery pipeline's job is not to deliver code, it's to prevent bad changes from going into production. Your job on a product team is to harden your pipeline. How many people have DevOps engineers who write pipelines for teams instead of teams running their pipelines? Massive anti-pattern for quality, because now I have to open a ticket to add a quality gate. I don't want to do that, because it slows me down, instead of me just pushing a change to my pipeline, to add a quality gate or improve a quality gate.

Designed the Pipeline for Operations

We had to focus on our pipelines. Another mistake people make about CD is they think it's about feature delivery. I've carried a pager my entire career. For me CD is about responding to incidents and being able to safely repair production. If there is an incident in production, we have this process where we're going to make a change and send it to the QA and they're going to do testing for three or four days. We'll open a change control, go through a change control process for another three or four days. At 3:00 in the morning, do we do that? No, we just bypass all of our quality process and slam it in and cross our fingers. Or we use continuous delivery and make sure the pipeline does all that, and pull all the process out and continuously harden our pipeline so we can safely deliver. On the team level, we need to make sure that each one of our services could deploy independently of every other service, so if one of those services broke, we didn't have to couple two services together. If two services need to deploy at the same time, that's one service, merge them. We needed deterministic pipelines. For CD to function, the pipeline defines releasable. If it says it's good, that should be enough. If it's not enough, we fix the pipeline. If you don't trust the pipeline, if it keeps failing, and you rerun it, and then it passes again, pull out the flaky tests, remediate them, trust your pipeline. We timebox that the pipeline needs to run less than 60 minutes. Literally, I had a problem with Sonar for a while where the database was corrupt, and Sonar was taking 50 minutes to run. I just stopped using Sonar until they fixed it. We've got other linting, we'll pick up the pieces later, that's not going to be a big impact. You have a single process for all changes. I can't stress this enough. If you're doing CD correctly, you always use your emergency process. Then we started designing tests, testing design. I was talking about the problem we have with testing languages and were people saying, "Yes, we do all these unit tests." They don't actually know what a unit test is, and they wind up over-testing and testing implementation. We needed tests that were stateless, deterministic, and run in the pipeline.

We had contract tests with OpenAPI, validate that the schema was correct. We spent most of our time on business flows. Martin Fowler calls these sociable unit tests. These are again, stateless tests, but they're not testing a function or a method. They're testing a broader behavior. BDD lends itself very well to that. Because we know what this broader behavior is, we could write a test for this scenario, and you automatically get this. Then we had unit tests that were focused on specific functions but risk based. This function is relatively complex business logic, we need to test this in isolation and verify that it's good. Sure, we had some end-to-end, but we didn't run in the pipeline, it was not a delivery decision gate. It was something we'd run on a schedule outside the pipeline, we used that to verify things like, are our contract mocks correct? Doing that meant that we had a deterministic pipeline. Remove drama from delivery, I don't have to worry about it. Because again, this is the secret, it's always an emergency. All we do with features is verify our emergency process.

Process vs. Engineering

Now we've got a team, this team can move quickly, as long as we don't have to be tightly coupled to other teams. Now we had to break up the release trains. I hate release trains. Trains are the least agile form of transportation. What do we do? Replace all the process with engineering. Instead of managing our dependencies with process, we handled it with code. We did it with contract-driven development. We'll talk about that. We manage releases with configuration. We built a small configuration management system, so that we could turn on and off features that the business wanted to hold off until training was done. We still delivered it, and made sure it didn't break anything. We made it so every team could deliver its own pace. We didn't have teams tied together to the slowest team. "We can't deliver anything because this other team's not ready yet, so we'll just leave that in staging, and then hope no one accidentally deploys it." Instead of PI planning, we focus on product roadmaps. PI planning is a huge waste of time. If it's not a huge waste of time, then you should be doing waterfall anyway, because you have a high certainty of everything's going to be exactly that in the first place.

Replacing Release Trains

How did we engineer around the release trains? We did contract-driven development. We set some standards around this. You don't make breaking contract changes. You don't delete properties, rename properties, move properties, everything's evolutionary. If you did the contract wrong the first time and need to make changes, yes, you're going to accrue tech debt. Allow for the tech debt but focus on your consumer. Your consumer is more important than your tech debt. We used OpenAPI contract mocks, not email or Confluence. This is the machine-readable definition of the contract change that you can use for your testing. Then the providers tested their contract. They defined the mock based off of what the contract negotiation was, and then write contract tests against the mock to make sure that they're not breaking their contract accidentally. Then they published the mocks so that the consumers can consume the mocks, and then write consumer contract tests based off of we're both testing against the same thing. Later, we'd reinvented virtual services, because we didn't know that was a thing. There was a team that I was working with, that to fully test our integration, we'd have to do about four hours of setup. We'd have to create a dummy purchase order in the system, flow that entirely through the system, wait for all those batch jobs to run, so that they could pull it down so that my team can try to receive against it. That actual receiving test took no time at all, but all that setup meant that that test was incredibly expensive to run.

We submitted a pull request to them, to give them contract tests. Also, to make it so if we sent them a request with a test header in the HTTP header, that we would get the mock instead of actually hitting the business logic, so we could just test the integration. I took that test from hours to actually milliseconds. We can just test it all the time, and we did on every single run. We were testing that pre-commit. In fact, there was a time where they accidentally broke their contract and we caught it in minutes, because we ran our tests and it blew up. This was while we were trying to go live in a distribution center. If we'd rolled that out there, everything blows up. There's mud on all of our faces. Try to figure out what broke. No, we had tests for it. Doing that meant we didn't have to collaborate at all other than focus on that contract. It allowed us to do this, make a small change ship, and that's what we should be doing all the time. What's the smallest change we can make? The one piece of advice I have for anybody on the business, is drive down the cost of change, just focus on how much does it cost to make a change, and keep shrinking it. You have to fix everything to get that done. Now we had teams who could, in theory, do continuous delivery, though we weren't yet. We figured out a way to decouple the teams so we could run independently. We had some other things to take care of.

Clearing External Constraints

Business didn't trust us. We had years of history of blowing up production. We had these brand-new things we're trying to deploy, and they didn't want us to deploy the brand-new things, even though they weren't going to be used right away, because you might blow up production. We had the QA area demanding that we test the way they want to test. We had external support team, who's like, you need to be stable for 30 days, and then we'll take over support of your feature. All of these are huge problems. I tell people that continuous delivery is not a technical problem. It's solved in every environment. The tools are out there. Everything else is behavior. It's people. It's all a people problem. It's a relationship problem. CI on a team is a relationship problem. When you're trying to work with other stakeholders, building those relationships and building the trust, so what do we do? After about 3 months of pleading, I got them to let us install. We demonstrated to them that we had a high-quality process. We can recover quickly from failure. That we didn't fail very often, and we weren't going to break anything. Or if we did, it would be broken very shortly. Negotiated with quality engineering so that my team could take over our quality process, and had to go over some heads to get this done. Again, at the beginning of the story, had a VP who wanted this outcome. Having that air cover meant that they would talk to us. Very important. I had a manager from QE say, "We're going to come audit you, we're going to make sure you're testing correctly." "Please, we're trying to learn how to do it better. We're doing it better than your contractors are. If you can come show us how we're screwing up, you're not going to hurt our feelings, we're just going to get better. Come audit us." Strangely, they never did.

How do you get rid of the external support team? We went back to where we were when I started at Walmart. I own the outcomes of my decisions. Quality is an outcome of the fear of poor quality. I test well not because you're going to give me a coverage target, I test well because I'm going to get woken up if I don't. Again, this is not something I tell people they should do, and I don't own. I own my code. Ownership is not holding people accountable to testing. It's saying that I own the problem that you gave me, I own how it's solved, and I own the consequences of my decisions. It's incredibly empowering and it gives high quality outcomes.

Metrics that Mattered, and Effective Leadership

We also needed to measure stuff. We built metrics around the things that we were forcing functions for improving how we were working. Code integration frequency. We had a 60-inch monitor in our area with dashboards running that we actually cared about, because we were trying to do CD. The whole team was aligned on this. How frequently are you integrating code? We need to be on average at least once a day, preferably more often, so we tracked it. We looked at our development cycle time, the time it would take to go from started, to deliverable, ultimately delivered. The goal is, how long does it take to finish work? When I left the team, the team was still improving things. I walked by the area one time, and on their whiteboard in their area, they had this week's development cycle time. It was 0.89. It's less than a day on average. They were rocking it. We were tracking the build cycle time, how long is the pipeline running? Again, we want that less than 60 minutes. We want CI even faster than that. We were measuring delivery frequency. You have to balance that. You have to have guardrails with quality, so we're measuring how many defects we're creating over time to make sure we weren't going off the rails. Those metrics allowed us to have the tactical view we needed to keep pushing for going faster.

We also again had effective leadership. We had a VP who was a cheerleader for what we were trying to accomplish in the area. He created a meeting every two weeks, so people could share what they were doing and its good practices. He was doing everything he could to help out. We had a leadership change, a director who put us on a death march, and he was burning people out. We went to that VP and he micromanaged that director into submission so that we could actually deliver things. He gave us air cover for other VP areas we need to talk to them and try to collaborate. If you don't have effective leadership, you're going to be very frustrated. Find a good leader. Convince them this is the right thing to do. Show them all the data. Get them on your side, and then run. Run as fast as you can before they change.

Outcomes

The outcomes we saw from this. We had improved teamwork. The people who originally worked on this back in 2015, we're still friends. Because they're just working so well together. Much faster feedback from the users. We would deliver stuff, send a Slack message to the beta users and the distribution center. They would give us feedback. Sometimes it was like, it doesn't work the way I expect it to. Or sometimes they'd come back with ideas. We'd look at that idea, see if it fit in with the product vision. Then if it did, we could spin around and change in 24 hours, and say, what do you think? Like, yes, that's cool. How much power is that? It's such an awesome way to work. Had much fewer defects. I didn't get woken up. Our stuff just worked. We didn't have any unused features because we weren't planning for the next quarter's worth of features and just going in implementing them in 2-week sprints blindly. We were saying, "Here's the next feature, what do you think?" If they didn't want it, we just delete it because it costs nothing. We had a consistent flow of work. We were dependable because all of our work was small. We just flowed work through, and we just fell into Kanban. Because there's no point in just planning for two weeks. That's too big of a piece of time anyway. If you're doing sprints, one week, not two weeks. Two weeks is way too long. We were just doing Kanban because, again, business priorities were changing faster than one week for us, so we just flowed work. This graph here, is literally the Grafana graph of us tracking deploy frequency over time. The different colors are just different services deploying, so it's just stock. We had that up there on the board and the team was really proud of it. This is the best outcome. 2017, a teammate and I at Walmart gave a talk at DevOps Enterprise Summit, "Continuous Delivery: Solving the Talent Problem." This is the reason that I spent so much time talking about continuous delivery. This was an unexpected outcome that we saw. It's the reason that I'm on LinkedIn ranting about this stuff because the work should suck less.

Lessons Learned

The lessons we learned from this is that agile process doesn't get us anywhere. There are some agile processes that are useful. Use the ones that are useful to your context, to your problem. If it can be done with engineering, it should. CD was the best tool for uncovering those problems. The simple question of why can't I deliver today's work today, just hammering away at that problem every single day, you find problems inside the team with communication, teamwork, people saying, this is my work for the next sprint. That will never get you there. It should be the team focusing on highest priority together. You'll find constraints outside the team that you'll need help clearing, where it's like, "It's stopping us, so it's a defect. Let's go fix it." They're just hammering away at that one problem. You cannot keep a bad organization in place and do continuous delivery. Everything has to improve. Smaller changes shrink batch size. I cannot stress that enough. It's not too small, make it smaller. People who have learned to work this way will never go back.

During that talk in 2017, I had a slide, I said, how many managers are in the room? It's DevOps Enterprise Summit, it's not a developer conference. The whole room raised their hands. I said, don't send a team down the path of trying to do continuous delivery, and then get cold feet and take it away. They'll keep pushing for it, but maybe not for you. The punchline to that was I had left the area and moved to platform. A lot of that had to do with a leadership change and lack of focus that destroyed CD. My co-presenter had left the company in disgust, and he was in his two weeks during that presentation. This is why, leadership is far more predictive of results than team knowledge. We didn't have any CD coaching when we started this. We didn't have any coaching on any of the things we talked about. We just learned it. The talent is there. They just have to have the right problem to solve. The problem to solve isn't how to do PI planning better. The problem to solve is, why can't we deliver today. We learn the skills, but only with good leadership. Because if we don't have the leadership air cover, things fall apart. We lost the leadership air cover in that area, things fell apart, so we left to other areas who wanted to do it better.

2018, Accelerate came out. This is what Accelerate reported, is that CD improves everything. If you have heard of Accelerate, you've probably heard of the DORA metrics. I used to push those really hard. I now have a paper out called, "How to misuse and abuse DORA metrics." I'd recommend reading it. What I do recommend is read Appendix A. This tells you the problems you need to solve to get to be a high-performing organization. With that, solve the real problems. Don't focus on agile process. If we have to add process, we're probably making a mistake. Take CD seriously. If this is true, anything that prevents us from delivering daily is a defect. Start with CI, and I've got some help for that later. Optimize for ops and quality feedback. Don't optimize for, let's get this feature out the door. Optimize for operations and you will deliver features better. Use engineering. Solve real problems in front of you, don't just apply silver bullets and dogma. This is a problem-solving thing, not a copy paste industry we're in, unless you use ChatGPT.

Resources

At Defense Unicorns, what we're trying to do is we're trying to solve the hardest delivery problems in the world, where if you want to learn how to deliver to a water-gapped environment or like a submarine, you should come work with us. I'm very vocal about this topic because I really want developers' lives to suck less. I'm also a maintainer for minimumcd.org that we put together to try to help people with the problems they really need to solve to get to minimum viable continuous delivery.

Questions and Answers

Participant 1: Going back all the way to the BDD section, how much of that benefited from doing BDD, and having a ubiquitous language, and things like that. A lot of teams that tend to do BDD, the developers would see there's a whole lot of overhead to have to do the development in one place than multiple?

Finster: As far as the overhead, I think you're asking the overhead of going through the BDD process, because that was a really large team, and it's really hard to refine work from a low level to a high level with a big team, we actually came up with a process of breaking it down. With BDD, you'll hear about the three amigos. We literally pulled out a small group of people, it was me, we had someone who was actually quality engineering, testing on the team, the product owner. We'd go through and refine with acceptance tests to about 70%. Then, we'd come to the team for editing. It's like, "This is what we've done so far, what have we missed," and walk them through it We mitigated a lot of that overhead that way. The work is not typing, the work is understanding what to type. Handing people stuff and having them type it means, if we go back to my slide what the real problem is, then we'll misunderstand what it is we're trying to do. We were able to deliver much faster because we did that upfront work.

Participant 2: How long did this effort take? How did you distribute this approach across all the different teams? How do you define this social unit test?

Finster: The timeline we're talking about here was about two years. It took about a year for my team to get to where we could deploy daily, from not deploying at all. The approach we took was, we were taking the lessons from a couple of pilot teams, and then sharing those lessons as we stood up other teams. We were rationing pipelines. We didn't just hand pipelines out to everybody. The pipelines came with lessons learned. The one that leadership changed, the new senior director said, we've got this goal where everyone is going to be doing CD the beginning of Q1. He forced the platform team to just distribute pipelines, which then resulted in children with Ferraris, who didn't understand all of the lessons learned, they weren't brought up to speed. The quality fell off, which then added process overhead, which meant that they went back to the old ways of working. Some of us left. You have to bring the tooling and the process together. Later I moved to platform at Walmart, where you've centralized your delivery platform tooling. My job was I led the DevOps Dojo, and what we did was we paired with teams to help them move to continuous delivery with the tooling. The knowledge has to come along.

See more presentations with transcripts

Recorded at:

Feb 20, 2024

Bryan Finster

InfoQ Software Architects' Newsletter