BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations How to Debug Your Team

How to Debug Your Team

Bookmarks
45:08

Summary

Lisa van Gelder tells stories about how she debugged teams at three companies - Stride, Bauer Xcel & Meetup, and the surprising and unintentional consequences of not giving teams what they need to be successful: Mastery, Autonomy, Purpose and Safety. She shares practical examples of how to diagnose and change teams.

Bio

Lisa van Gelder is currently Senior VP Engineering at Spring Health. She has been in software for over 20 years, working in a wide range of companies from early stage startups to large media companies like the BBC, the Guardian newspaper & Meetup. She used to debug code, now she debugs teams for a living.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Van Gelder: I'm Lisa van Gelder. I'm VP Engineering at Spring Health, which is a little startup in New York, that's all about removing the barriers to accessing mental health care. I've been in tech for about 20 years now. First, as an engineer. Then as a tech lead. Then as a manager. Then as a VP. Here are a few of the companies I have worked at along the way. There's something that they all have in common. That is pretty soon after I go to a new company, someone from the leadership team pulls me to one side and says, "How do we instill a sense of urgency into the team?" Really, what they're asking is, why is our pace of delivery so slow? Why is it taking so long to get things done? Is there something wrong with our architecture? Is there something wrong with our tech stack? Is there something wrong with our engineers? Do they not know what they're doing? Do they not care what they're doing? It's usually not the engineers. There's normally something else going on, on your team.

Debugging Teams

What do you do to a piece of software when it's slow? You debug it. What do you do when a team is slow? I would give you the same answer. With this talk, I'm going to tell stories about times when I debugged teams at three companies. That's Bauer Xcel Media, Stride, and Meetup. Probably, the first question you're asking is, what the hell is Bauer Xcel Media? It's basically the biggest magazine publishing company you have never heard of. You've never heard of them. They're like Condé Nast. They have 600 magazines, 100 TV and radio stations worldwide. In New York, they had brands like Life & Style, In Touch, Women's Weekly. I was brought in to see what was wrong with the New York tech team. The New York tech team was the most expensive but least performing out of all the Bauer Xcel teams globally. Velocity was going down, Sprint by Sprint. Engineers were coming in late. They were leaving early. There were literally times during the Sprint when no one knew where the whole team was, or what they were working on. They weren't answering emails, or Slack.

I joined. I'd actually joined to ask, is there something wrong with the engineers? The leadership team was asking me basically, should we fire all the engineers, because they're not motivated? I started debugging the team. I went to all the ceremonies. I went to a Sprint planning, to grooming, went to retros. I paired with the engineers. I realized pretty quickly there was something interesting happening at Sprint planning. The Scrum master told me he wanted to hold the engineers accountable for getting their work done. At the end of every Sprint, there was always a big pile of stories that weren't finished. It was really irritating to him. His idea was to say that every engineer at Sprint planning, would sign up to which stories they personally would complete. If they didn't complete them by the end of the Sprint, they had to justify why in front of the whole team and their manager. The engineers hated this. They felt blamed. Bear in mind that sometimes the reason for not finishing a story was actually beyond their control. It could be they're waiting on another team, another person, a third party. It didn't matter. They're told to be scrappy, to push through it to get done anyway. What did they do? They started padding their estimates. The engineers took on less and less work every Sprint, because they wanted to make damn sure that they could get their stories finished. Didn't have to explain why. Sprints went from Monday to Friday. By Thursday, most engineers had actually finished their work for the Sprint, but didn't want to say that to either their product manager or the scrum master, because there's a danger they could be given a story they wouldn't be able to complete by Friday. What did they do? They hid around the building. They didn't answer Slack. In other words, something that was put in place to increase team performance, to really encourage engineers to get more work done every Sprint, had completely the opposite effect. It tanked velocity.

The Drive Framework

I like talking about debugging teams. This framework is a really useful framework for thinking about how to approach debugging your team. It comes from this book called "Drive" by Daniel Pink. It's actually about individual human motivation. I think it maps really nicely to team motivation. Usually, when your team is doing something surprising or unexpected, is it down to a lack of one or more of these things? They are mastery. That is, does a team have the skills needed to do their job well? Is the path to promotion clear? Autonomy. How much control do teams have over how they solve problems? Purpose. Is it clear why teams are working on things? Does everything ladder up to a real common goal for the company? Lastly, safety. This actually isn't in the book. This is something that I added. This comes from Google's research into high performing teams. Google did a ton of research into what makes some teams perform better than others. They looked at everything from personality, to skill sets. In the end, what it came down to, according to them, was psychological safety. That makes sense. If you don't feel safe, you're not going to try to go for stretch goals. If you don't feel safe, you're going to cover your house. Doesn't that seem familiar from Bauer Xcel?

Measuring Pace of Delivery

I'm going to talk about how I introduce change to the team. First, a quick note about measuring, because pace of delivery and speed are really subjective. If you can't measure them, and you introduce change, it's really hard to know whether the changes that you have introduced actually works. My favorite way of measuring pace of delivery is the cycle time. That is the time between an engineer starting to work on a story and it being live in production. I love it because there is nowhere to hide. If it takes an engineer two days to get a story done, but then it takes three days for that story to get deployed to production. Then that's a good bottleneck you can investigate as a team. A lot of people like using velocity as a measurement of pace of delivery. The trouble is that velocity can be gamed. Many years ago, I had a stakeholder who was really unimpressed with how many points my team was doing every Sprint. We were getting 10 points of work done. I said, no problem. I told my team, we're going to times every estimate by 10. A 3 point estimate became a 30 point estimate. Overnight, my team went from doing 10 points a Sprint to 100 points a Sprint. Stakeholder thrilled, much more impressive. It was the same amount of work. Good news is that if you use Jira or Pivotal Tracker, cycle time is already calculated for you. You can just use it.

Safety

Now we can measure pace of delivery, so how to apply the drive framework to Bauer Xcel. First of all, Bauer had a safety problem. Engineers felt blamed if they didn't get a story done. Usually, the first thing you have to do if there's a safety problem is remove the safety problem. Because if engineers feel unsafe, it's really hard to motivate them to do anything else. In this case, it was simple. We entered the concept of tracking individual velocity. We said it doesn't matter if you don't complete your stories, end of a Sprint. It's most important that we track team velocity. Get as many stories on as you possibly can over that time.

Purpose

Then I said, "Now let's look at cycle time and see how we can reduce it." My team said, "Why?" Honestly, what does it matter if we goof off a little bit on Friday afternoons? Who cares? The answer was it actually cared, hugely. Bauer was in the online advertising. Bauer got its revenue from online advertising. They use Facebook to effectively paid acquisition. They paid for eyeballs on their stories. Facebook just changed their algorithm, which meant it was suddenly much more expensive for us to get eyeballs on our stories. If we didn't figure out another way of attracting users to our stories, or other ways of getting online advertising revenue, the whole office in New York was actually at risk. No one had told that to my team. They didn't want to worry them. I sat down with my team. I had a frank and difficult discussion about actually what's going on with the revenue. I literally linked all the stories we were working on to the company goals, and how much revenue we had to do to keep the office open. It was scary. Now the engineers actually understood why we had to get work done, and what the real sense of urgency was from the company. We started to look at cycle time, and start to check our bottlenecks.

Mastery Part 1: Don't Assign Stories

We actually found one pretty quickly. We weren't assigning individual owners to stories anymore. What we're still doing was assigning people to own certain parts of the codebase. The idea was that every engineer would have a part of the codebase that they were the expert on. Only they would make changes in it. This makes sense, because if you know that part of the codebase really well, then you have to make changes much more quickly with a good architecture. The trouble is, sometimes you want to make a lot of changes to one part of the code, but only one person can do them, so bottleneck. Plus, what happens if that specific human is on vacation? If they get sick? What happens if they leave? We might have a real problem. We said that instead of assigning people's own parts of the codebase, every engineer should pick up the next story in the backlog. All stories would just sit in priority order. We said, if you pick up something on an area that you're not really so sure about, pair. Pair with someone else who knows the area a lot better. Short term, again, did not do good things for velocity. We were pairing a lot. People were working on different areas of the code. Long term, it dramatically increased it, and also, decreased the number of silos on our team. Also, had another really great side effect. Back in the world of individual velocity, when you're holding people accountable for how much work they personally were doing. There were no incentives for seniors to pair with juniors, because if a senior took time away from their story to help someone else, it might hurt them later if they didn't get it done. Now in this world of pairing, when we said it's much more important for the whole team to get work done, suddenly, seniors were pairing with juniors. Knowledge is flowing amongst the team.

Mastery Part 2: Skills Matrix

I talked about juniors and seniors, but that was actually a sore point for Bauer. When I got there, there was no skills matrix. There was no definition of what it meant to be a junior, mid, or senior engineer. As a result, there were some people who had just finished a boot camp who had the title of senior engineer, and they were making some pretty big architectural decisions that they really didn't have the experience for. Things that should have been really simple, ended up being wildly over-complicated and taking much too long.

Leveling

I knew that when I came in. I'd been warned by people on the leadership team they didn't think that some people were really a senior as their title said. I put off making changes as long as I could, because taking titles away really messes with safety. I wanted to get some quick wins on the team first, to win some trust with cycle time. Eventually, we had yet another project that was wildly over-complicated and really late. I knew time had come that I had to level the team.

We're going to use a skills matrix. That is a definition of what it means to be a junior, mid, senior engineer, all the way up to director. It's not just tech skills, other things are in there as well. For example, collaboration and mentoring. It's really important to me that seniors don't just stay in a corner, but actually help others grow as well. We're going to use the skills matrix. I leveled the team. Titles changed. Salaries changed. Some people left. The ones who stayed were grateful. The seniors were grateful because their seniority was actually recognized. It's actually really demotivating, if you're a senior with 10 years' experience to watch someone come in with a fraction of your years and expertise, and get the same title and salary as you. The juniors were grateful, because finally there was a path to career progression. Before then, there was nothing really motivating them to learn. They didn't know what they had to do to get promoted, so why bother? Whereas, now, there was actually a clear path, both to track where they were right now, and also what they had to learn to get to the next level.

Change Toolkit

I talked a bit about some change I introduced. I'm going to step back for a minute and talk about the change toolkit that I used, or how I introduced change to a team. First of all, note that every team is different. You really have to approach every team with fresh eyes and understand from the team, what problems there are to solve. Here are some of the ways I do that.

Open Questions

First of all, whenever I join a new team, I came in and I ask a set of open questions to everyone on the team. That would be stakeholders, product managers, engineering leaders, tech leads, architects, every engineer on the team, designers. I ask the same question of everybody including my favorite, which is, "If you had a magic wand, and you could change anything at your company, what would you change?" The interesting thing about this is, you see what people's main pain points are. You can also start to track trends, like, is everyone worried about the same things? Something I often see would be things like the leaders of the team. That could be the tech lead is worried about pace of delivery, but the engineer is worried about bugs or tech debt. It gives you good places to dig in.

Pair Programming

Pair programming is a fantastic way to actually see, what's going on amongst the engineers on the team themselves. I will try to pair with engineers as much as I can. Some of the things that I look out for would be, how clear is it to the engineer what they have to do for the story? Do they have to get permission from somebody to make changes? How long does it take to run through the CI/CD pipeline? How hard is it to deploy? How much are they interrupted as part of the team?

Retrospectives

Retros are also a fantastic way to see what's going on, on the team. You literally ask your team, what are your pain points? The team will tell you. It's brilliant. Because it's their pain points, they are super motivated to come up with ideas to help you solve those problems. It's pretty easy to run a bad retro. I'm sure all of us have been in those too. A bad retro is a bit like therapy. You go. You have a bit of a vent, but nothing really changes. A retro is only useful if it has actual action items that come out of it. If the team can see it as a useful tool for change. It's really important to have action items, and to make sure you have follow through and things actually change as a result of it. It's also only useful if you're really talking about the pain points of the whole team, not just some key influencers like the tech lead. That's why I love doing voting to make sure they really are talking about the top things of the whole team. It's also important to get the voice of the whole team. Again, if someone influential like tech lead is the only person speaking, you aren't really getting the whole team working with you on fixing the problems.

Implementing Change - Nemawashi

Now I have some idea of what's going on with the team. Here are some ways I've actually implemented the changes. First of all, Nemawashi, which is the informal process of quietly laying the foundation for some proposed change or project by talking to the people concerned, gathering support and feedback. Or, in other words, I never go into a big change, cold. Before introducing something, I'll always talk to the main stakeholders and influencers on a team to make sure that they are behind the change I'm trying to introduce. Example at Bauer, when I introduced the skills matrix. I also introduced a performance management program, which has never happened before. I introduced a 360, feedback to everybody and SMART goals. I introduced SMART goals first to the team cynic. That is the person who told me that SMART goals were useless and he would never do them. He was one of my key bottlenecks on the team. One of the most super senior engineers. He spent all of his time solving problems for other people. He had a real hard time getting his own work done. I set goals with him all about knowledge sharing, about teaching other people how to solve problems, so he wasn't always interrupted. He loved that. Then when it came to introduce SMART goals to the rest of the team, the team cynic spoke up, and said, "Actually, I thought these things were useless but Lisa did those with me and they were fun." That really helped the rest of the team get buy-in for those SMART goals.

Give Engineers Power to Control Change

Change is really scary when it is done to you, as opposed to something that you feel that you can change. As much as possible, I try to give engineers the power to control the change. Example with the skills matrix again, when I introduced it, every engineer at Bauer was able to give feedback on the skills matrix and to help define their own level before it was used to level them. Similarly, they controlled the timeline. They told me they wanted a grace period of three months. If someone was leveled below their current title, say they had title of senior but we leveled them as probably a mid. They would have three months to work with their manager to put goals together to see if they could keep their current title.

Kaizens

Kaizens are another great way of letting teams control the change. Kaizens are small experiments that lead to continual improvement. When I want to change something, I'll set the direction I want the team to go in, but not what they have to do to change. At Bauer, I said I wanted to reduce our cycle time by 50%. Cycle time when I got there was about 15 days. I think about 15 days to get one story on average to production. I said, "Let's reduce that by 50%. That's a pretty big gap." A Kaizen is one small experiment to reduce it. Every Sprint, I challenged the team to think of one small thing that can reduce cycle time by 10%. Increasingly, those small changes add up to the big improvement that you want to do.

Reward Hard Feedback

Change works best when people are honest with me about what works and what doesn't. Whenever I can, I will reward hard feedback, or publicly thank the person for what they said if they're open to it, and say what I'll do differently as a result of the feedback. The idea of the grace period was one of those. Someone came to me and said that the whole team was really freaked out about the leveling process. Having a grace period would help them know that they wouldn't be suddenly leveled overnight. I thanked that person for it, about the team meeting.

Admit When I'm Wrong

I'm going to make mistakes. I try a lot. It doesn't always work. As a leader, people look to you to see how you cope when something goes wrong. As much as possible, I'll get up there, in front of the whole team and say, "I messed up this thing. Here's what I learned out of it." Example from Bauer, I removed one of the key quality engineering processes from the release, because I was so focused on reducing cycle time. The result of that was that a whole bunch of new bugs went into production. I got in front of the whole team, "We're not going to do that. That was a mistake." In hindsight, obviously, you've done that.

End Result

After all the changes at Bauer, so we introduced the skills matrix. We leveled the team. We ended the individual assignment of stories, and individual velocity. We introduced pairing. We ended the idea of people having ownership over a specific part of the code. Cycle time went from 15 days to less than 1. Bauer actually met their financial goals. The office didn't close. I was happy.

Learnings

One of the big things that I learned at Bauer is that you can't have autonomy without mastery. If you have a bunch of folks on the team, who really don't have the skills and expertise to do their job, and you just let them run, it's not going to end well. It doesn't take very many people to derail a team, especially if those people are key influencers or senior people like tech leads or architects. I've learned not to take those titles for granted. When I come into a new job, I will actually evaluate folks to see, do they actually have the skills needed to do their job? I'll take quick action if I need to.

Stride Consulting

I was happy at Bauer. I had done a lot. Then a friend of mine came to me and said that there was this interesting new consultancy company that was starting called Stride, which is all about debugging teams. You may realize that debugging teams is one of the things I love to do the most. This was a chance for me to go run teams of consultants who went to do this. They're like Pivotal, except that instead of going to Pivotal, Striders went to the company. They embed on the team, pair with companies, and make them better. Or they're like ThoughtWorks, but without the travel. I joined Stride as VP Engineering. Mostly, I was running teams of consultants, as opposed to personally going to an engagement. This next story I'm going to tell is actually a disaster story, which is a story that I had to go to engagement because it was not going well.

Acme Corp

We had a client, I'm going to call them Acme Corp, who were threatening to fire Stride as a client. The project was late. We said that we were going to release something in three months. It was now six months later, and it was nowhere near being released. The client didn't just say that we were late. They also said that the engineers that we had sent didn't know how to write code. That we had oversold the skill set of our team, which is the last thing you want to hear if you're a consultancy. I knew the engineers. I knew this wasn't true. Because this was an important client, I went in to debug the team, and to see what was happening.

Autonomy

Similar to Bauer, I went in, I started the ceremonies. I paired with the engineers. The first thing that I found was there was a real big problem with autonomy. The project was late. The client engineers told us that they had already decided the implementation of all of the stories. They just mapped it all out for us. My team just needed to do what we had been told. There wasn't time to rethink anything. My folks are all happy about that.

Mastery

This is the code. It's a big pile of spaghetti. No one understood it. There were no tests. The client engineers said there was no time to write tests. Honestly, like with that big pile of spaghetti, it would have been really hard to write any tests for it. The servers were pets not cattle. That is that every server has like slightly different versions of things running on it. Deployment was manual. We had to manually copy over files we wanted to deploy. The result of all that was that deploying was really hairy. Every time we deployed to production, there were a slew of new, interesting bugs that came out. Our stakeholder, the one who had brought us into the project was furious. He said that all the bugs meant that my engineers did not know how to code.

Purpose - Build It and They Will Come

We had a stakeholder. He had a set of requirements for us. Every time we met him, the requirements changed a little bit. He was also really busy. We had demo meetings but he didn't come. We did the best that we could. We built the thing that we thought he wanted. You can see where this is going. When he finally came to demo, and looked at our product, he was furious, and said we were idiots. Had no idea what we were doing. We had built the wrong thing. At this point, my team is pretty much flipping tables. Not only is the client threatening to fire us, but almost Stride is also threatening to quit. They feel like they're being blamed here. They're being blamed for the spaghetti code with the complete lack of tests. You might be wondering why they're blaming my engineers rather than the client engineers. Before we got there, there had never been a release so they couldn't compare code without us because they'd never actually seen the thing work. Nevertheless, we got all the blame for it.

Find the User - Find the Purpose

I knew that we had to do something to turn this around. The engagement manager and I went to our stakeholder's boss. We told him that unless we got access to the users we were actually building this product for, they might as well fire us because there was no way we were actually going to build anything that was useful for them or for the company. Luckily, our stakeholder's boss agreed. He put us in touch with the users. We're building an internal tool for users as part of the company. Those internal users were thrilled. No one had ever come to talk to them before. Now here we turned up. We shadowed them. We actually saw what they were doing day by day, and understood the pain points. They came to our demos. They gave us feedback. The requirements stopped changing because we literally had users now in our demos, making sure what we're building was the right thing for them. When our stakeholder finally came to a demo, I was a bit worried he was going to get angry with us. Luckily, he was actually really thrilled to see the users there, because he was so super busy that he realized that he didn't have to spend time on us. The users were there giving us the feedback that we needed.

Mastery: End-to-End Tests

We had to do something around all the bugs happening in production. It was genuinely too hard to add unit tests given the spaghetti code on the system. We introduced some really basic Selenium tests that could do full end-to-end tests that gave us confidence the main user journeys to the site were working. Run them manually, there was no way to automatically deploy at this point. The client engineers initially thought we were wasting our time, like how could tests possibly help? After a few releases that were a lot less hairy than they had been, at least the main things worked, they started to trust us. They asked us if they could pair with us and see how to write those end-to-end tests. We were really thrilled. Then they asked us, we're thinking about doing this thing for the next set of stories. Maybe we could start to talk through with you how we might solve those problems. My team finally started to get some autonomy back.

End Result

The project was late. Nothing we could do about that. It was already late by the time that we got there. The next version wasn't. Not only did we not get fired, the client actually asked us to build the next version of a product for a different set of internal users. We said we wouldn't do that unless we could write a test first from the beginning. They said, "Pair with us. Show us how to write tests." My Striders got burned out. At least half the team asked to leave that client at that point. They were just done. I got a fresh set of Striders in. They were super enthused. They paired with the engineers. The next set of work done for that client was all done with tests first.

Learnings

You can't have autonomy without purpose. My Striders were actually so focused on not having any autonomy on how to solve the problems that they missed that the client engineers didn't know what problem they were solving. They really didn't know which users they were solving that problem for. I've learned not to take that for granted either. One of the first things I'll do when I go to a new team, is ask everyone on the team, what problem are you solving? Which users are you solving it for? Just to make sure that everyone on the team has the same understanding of what is it they're working on, and why.

Meetup

I loved my time at Stride. The thing that I loved the most was debugging teams. Mostly, at Stride, other than Acme Corp, which was fun, I was mostly running teams of consultants who were fixing problems. I didn't get to be hands-on myself very much at the clients, and I missed it. When I met up with a friend of mine, she told me about her job at Meetup, "We're having big problems with pace of delivery." I thought, "I know this one. I can go in and help." I shifted company again. I went to join Meetup. Meetup brought me in to solve two things. One of them was introduce a sense of urgency to the team. They told me that they felt like they really had to push engineers to release, and they just didn't seem to want to ship new software. There was also a huge amount of problems with quality. Every time Meetup did a release, a slew of new bugs went out to production. The teams didn't seem to care about quality. Bugs were live for months, sometimes years. Meetup celebrated bug birthdays. The users were not happy. The leadership team was not happy. They wondered why the team didn't seem to care to fix those things.

Meetup Top 5

I joined Meetup, and started debugging the teams. The first question I asked engineers there was, how do you decide when something is ready to release, when something is of good enough quality to release? They said, "Let me tell you about the top five." There were nine product engineering teams made up of nine feature teams. Every Monday, every team had to put forward at least one feature they could release by Friday. The leadership team chose five of those teams that became the top five features. If those five teams did not release their feature by Friday, they were publicly shamed in front of the whole company. Meetup was effectively running a fire drill. Teams never got a chance to take a step back to look at tech debt, to look at bugs. Because if they didn't release that feature on Friday, irrespective if it worked or not, they were getting shamed in front of the whole company.

They had a safety problem. The first thing you have to do when you have a safety problem is to remove the safety problem. We ended the top five. We shifted instead to objectives and key results, which teams controlled. Teams decided what and when they were going to launch. Because I'm a big believer that you get what you measure, one of those OKRs was all around quality. That is the number of bugs in production. When I got there, we had over 100 critical, major, and important bugs. The ones you get that handle a lot more.

Quality

It turns out that teams cared hugely about quality and number of bugs. They weren't really sad with how many bugs were in production, they'd had a chance to fix them before. Now we set Kaizens, small experiments. I challenged every team to think of different ways you can get that bug count down through a whole bunch of smart ideas. One of them was we introduced flow type to React, so we had some type safety. That solved a whole bunch of really interesting type bugs.

Autonomy Part 1: Operational Ownership

We still had a lot of operational incidents and fires that were actually pretty awful for our users. I went to Meetup and I asked, "How does on-call work at Meetup? Who's on call when things go wrong?" They told me that there was a really small set of engineers who were on call for the code written by everybody else. Think five people, on-call, for the code written by 70. That was done to protect the feature teams. In the world of the top five, if you have to release something every Friday and something goes wrong in production, you had time to stop and fix that thing. There was a separate team that was going to fix the problems in production. Makes sense. The trouble is that it breaks the virtuous alerting cycle. This is my terrible diagram of the virtuous alerting cycle. Something goes wrong in production, a team is alerted. The team fixes the problem. Hopefully, the team then takes a step back, looks at the quality and stability of the system, as well as the alerts themselves. Make some improvements. The team gets alerted less. Everyone wins.

We broke that at Meetup, because a different team was woken up at 2:00 in the morning from the team that was writing the code to ship those new features. Team two didn't really have any incentive to improve the quality or stability of the system because team one was going to get woken up at 2:00 in the morning that something went wrong. You might wonder, why doesn't team one just improve the systems? Team one was also on-call, for the code of eight other teams. They didn't really have the time to improve any of the systems because they were just continually fighting fires and just trying to stay above water.

PagerDuty

The fix for this was introduce PagerDuty. I put every engineer at Meetup on call, including me. I was on a rotation as well. Every team was on call for the code and services that they owned and operated. I went to an engineering all-hands, and went, "You're all on call now." They were not too thrilled. They were worried. It wasn't that they didn't want to be on call. They were actually afraid that something would go wrong at 2:00 in the morning, and they wouldn't know how to fix it.

On-Call Training

The first thing that we did was introduce on-call training. We did incident response training with the idea that the first responder does not have to solve the issue. They have to triage. They have to see, is this really an actual thing? Then alert the right person to fix it. I also got those 70 engineers who had not been on-call before, to pair with the 5 who had to like real incidents for a few times. They got to experience to see what actually happened before they were on-call themselves for the first time.

On-Call Results

Finally, quality is getting more under control at Meetup. The number of critical, major, and important bugs in production went down from 100 to 10. The amount of incidents in production went down from multiple times a day. That could be a month sometimes, but nothing went wrong. It was great.

Cycle Time

Now we could shift to looking at cycle time. Similar to quality, I set Kaizens with the team. I said, we want to reduce cycle time. Cycle time when I got to Meetup was really high. It was about 30 days. This is an environment of continuous delivery and continuous deployment. Whenever an engineer checked in their code to master. It went to production automatically after going through a set of automated tests. It was still taking about 30 days on average to ship one story. We set Kaizens. The teams came up with a whole bunch of interesting ideas for how to improve it, including changing the way that PRs worked. Initially, when I got there, engineers were told not to interrupt their stories, to finish someone else's PR, they should wait until the end. PRs were waiting. Some went two, three days before being looked at. We said instead, people should interrupt their work. It's much more important to get that one story all the way through to production. That definitely helped.

Autonomy Part 2: No Subtasks

The cycle time was still really high. I paired with engineers, and I found subtasks. Stories at Meetup were really big. They called a story, I would call an epic. To ensure that engineers wrote code that adhered to good architectural standards, the way that it worked was that senior engineers would take one of these huge stories, decide on the implementation. Break the story into subtasks, and assign a subtask out to different engineers. A couple of problems with that. Hands up if you love being given implementation rather than a problem? No one likes that. It's very demotivational. Plus, everyone was given a small piece of the puzzle. Because they're not given the whole problem, they can't take a step back and think, "Maybe there's a better way to actually solve this problem, this implementation that the senior came up with." They were not taught to solving a problem, just implementing this one small portion of it. The fix to this was to end subtasks. We said that the engineer who picked up a story should decide on the implementation. Sure, they should run that implementation past the senior person, senior engineer, and architect. It was up to them to decide how to solve it. We ended subtasks completely.

End Result

After two years, the number of critical, major, and important bugs in production went down from over 100 to less than 10. Cycle time went down from 30 days to 5. The work to improve Meetup continues. I'm not there anymore, sadly.

Learnings

My main learning from Meetup was sometimes you need outside help. I talked about ending the top five like it was easy. That was a hard fight. Before I got there, people on my team told me they had raised concerns about the top five for months, but nobody was listening to them. I was like, "Don't worry about it. I'm a VP. They'll listen to me." No, not in any way. Sometimes, if you find that you're really not being listened to, it can help to bring someone really expensive in to say the same thing as you. I did one better. I brought in two really expensive people to say the same thing as me. I got together with my boss, and we brought in two fantastic consultants, Laura and Deepa. One of the first things they said to us was, have you thought about ending the top five? I was like, "We have thought about that." Finally, with that outside pushing, we finally managed to get it done.

In hindsight, I really wished I had brought them in more quickly. I was hesitant to bring in consultants because I thought I could see so clearly what had to change. I didn't need them to tell me what to change. I really needed some outside help to get the change done. It actually doesn't matter if you can see what needs to be changed really clearly. If you can't get the change done, it really does not help your team.

Conclusion

The next time someone comes to you and says, how do we instill a sense of urgency onto the team? Here are a few things to bear in mind. I'm a big believer that you get what you measure. Look hard at what your incentives and your penalties are. I like to ask myself, what is the worst way someone could interpret this? Think like the top five. If you're pushing your teams to release quickly, but with no balancing metric, like number of bugs in production, that may not go the way that you want. Safety is hugely motivational, but usually not in the way that you want. What happens when something goes wrong? Do you blame teams? Do you blame engineers? Do people have the skills they need to be successful? Is the path to promotion clear? Do you have a skills matrix? Do you know what it means to be a senior engineer? How much control do teams have over what they do? Do teams have autonomy in solving their own problems? Do teams have ownership over their code in production? Do teams know why they're working on things? Can I draw a straight line from what they're working on to company problems?

Questions and Answers

Participant 1: You mentioned the process of Nemawashi gaining, laying the foundation for support for a change. How do you balance the time spent on that versus time spent on executing on things that are more directly visible? Versus this hidden thing for a while, until, here, then has this benefit? Especially as an individual contributor.

Van Gelder: For me, I would use Nemawashi if there is something really big that I want to do to a team that I think is really important, that I can see that there's probably going to be some resistance to that idea. At Bauer, it was crucial to have some performance review process and SMART goals so we could actually track performance going forwards. They'd never had that before, so it's quite scary to that team. You wouldn't do it for everything. For me, it would be the things that are most important to get right that you suspect would be hard. If there's something that you're doing, it doesn't matter so much if it works or not, don't worry about that one.

Participant 2: My question is around when you have business leaders instilling a sense of urgency. How do you manage the expectation that if you aren't making the change in the practices and such that they will see the result? Because, often, what I find is that business leaders want to see those improvements fast, and sometimes some of the practices doesn't come into play until a bit later, like you say that sometimes it gets worse before it gets better. How have you sold that previously, if you've got any experience in that?

Van Gelder: It's true. Often, if you're doing things, example like shifting to pairing when they haven't paired before, the short term is that velocity can go down rather than up. It can be quite scary to senior folks. One thing that I always try to do when I join a new company is have a quick win. That is like figure out something that the team wants and that management wants, you can introduce pretty quickly, to get everyone onside, before you do the long, hard things of changing a team. For Bauer, that was a safety thing, of just saying people didn't have to finish stories by the end of every Sprint. That was a really quick change. Management could see that engineers stopped hiding around the building. That was pretty much instant, like getting trust with management. Then you do the harder things, which is like the next thing is going to take a while but bear with me, I promise something's going to happen. Velocity will get better later on. I think the same thing for both management and for engineers, find something that everyone wants so you can demonstrate success early on. You have the credibility of doing the longer, harder things later on.

Participant 3: Just a quick question about skills matrix, because I understand you do plot your engineers on the skills matrix with the engineers. Do you keep that information public?

Van Gelder: The skills matrix, that one is not public. I think the Meetup one will be. There are a bunch of them that are public, like Rent the Runway is the classic example. If you Google Rent the Runway, that one's completely public. Some of the Meetup things are public, some are not. That one that I showed you isn't. I tend to take them with me when I leave the company. If I have a [inaudible 00:43:44], I would love to make them public because I think they're a great resource to share with everybody.

There's a resource. Someone has literally collated all of them. Someone's got pretty much every open skills matrix on the internet, and put it in one place. I think someone over here knows what it's called.

Participant 4: It's called progression.fyi.

Van Gelder: Progression.fyi. Go to that one. It has got a ton of helpful skills matrices. It's a good place to start for your team.

Participant 3: Do engineers know how other engineers in the team are plotted in that matrix with detail?

Van Gelder: Everyone knows everyone's title. They don't know anyone else's performance other than that. If someone is not performing, that's not visible. That's a conversation between them and their manager. Everyone knows their title. I also like to do open salary bands for me, to know everyone's title. The title maps to a salary band. To know if someone is a senior engineer, they may be between this and this. That much is public to everybody.

 

See more presentations with transcripts

 

Recorded at:

Sep 18, 2020

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT