BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Leading Technical Projects - and How to Get Them Done

Leading Technical Projects - and How to Get Them Done

Bookmarks
22:02

Summary

Sarah Wells shares stories on how the Operations and Reliability team at the Financial Times built tools that are used by lots of their development teams: the challenges they faced, the things they tried and what worked for them.

Bio

Sarah Wells has been a developer for 15 years, leading delivery teams across consultancy, financial services and media. Over the last few years she has developed a deep interest in operability, observability and devops, and at the beginning of 2018 this led to her taking over responsibility for Operations and Reliability at the Financial Times.

About the conference

InfoQ Live is a virtual event designed for you, the modern software practitioner. Take part in facilitated sessions with world-class practitioners. Connect, see, and speak with like-minded people. Join us to accelerate your learning, be better informed, and drive innovation.

Transcript

Wells: I've been working at the Financial Times for nearly a decade now. There have been some really substantial changes in the way that we work: architecture, the technologies we use, the processes, the teams. There are really three key things. The first is the move to the cloud, meaning we could provision resources when we needed them. We didn't have to wait for a team to buy a server and configure it for us. The second was about adopting microservices, meaning that we could do lots of small, independent changes without waiting for that one big monthly release. The final thing was around moving to DevOps and empowered teams, where the team that writes the code is the team that decides to release it. They also get to make the architectural choices. They don't have to wait for some group to decide whether they can use a particular software.

Lessons Learned

The lessons from all of these are that you need to remove dependencies if you want to move faster. You need to remove the dependencies on other people. Both these excellent books are relevant here. "Accelerate" looks at what high performing technology organizations have in common. Generally, it comes down to being loosely coupled, i.e., having minimal dependencies on other teams. "Team Topologies" looks at what types of team exists in software development organizations, and how they can effectively collaborate with each other. Again, this is about finding ways that reduce dependencies and time spent waiting. You can't move fast if you have to wait on other teams or coordinate with them. Anything else other than self-service just doesn't work.

This is great if you're the team building ft.com, for example. Reduce dependencies by moving to microservices, and with other teams providing you with APIs, and you've reduced your coupling. For the last decade, I've been working on the types of teams that provide those APIs, like the content API. It's not easy to do projects for those teams that don't have dependencies. Because if you are building and maintaining APIs, a project to build a new version is as much about the migration as the build. You aren't complete until you've turned off the old thing. Otherwise, you are paying a constant tax of time and money to run two things. Moving a whole bunch of clients to a new version of an API is a really hard thing to do.

Financial Time's Content Delivery Platform

When I worked on FT's content delivery platform, we wanted to decommission our old stack. This is just part of the diagram we created to try and track the things we needed to do, and the order that they needed to happen. Anything in orange is a dependency on another team. I don't know whether anyone else has had this experience. I've often been on projects where we thought we were 80% done, then we've spent just exactly the same amount of time to get the project finished. That's because that final piece of work is almost always where you need other people to do things.

A Migration Story

I've got some ideas about how to approach migration projects. I want to start off by talking about a migration project from a few years ago, that was being run by another team at the FT, and I was a stakeholder. This team were responsible for much of the software used every day by FT employees, and that included our CRM system. Unlike a lot of companies, the FT has ended up treating our CRM as a platform, building lots of functionality on top of it, even where the fit isn't great. This team had decided to migrate some chunks of functionality away from our CRM. It affected me because it included the systems that we use to track operational incidents, and changes made in production. Those are the things I'm responsible for at the Financial Times. It also had lots of ticketing systems used by other teams, for example, the service desk, our networks team. As a stakeholder, I was pretty enthusiastic about this. We would save money for the FT. I was hoping we'd get a much better experience for managing our operational incidents and changes.

Seven months in, we cancelled this project. We wrote off the investment of time and money, and we renewed our contract with the existing supplier. Why did we cancel the project? Mostly, it was a problem with timing. To make it work, we needed to get the migration done before the point where we had to renew our contract and we just didn't feel sure we could do that. While the service desk were keen to move over to a specialized IT services management tool, this was all happening at the same time as we moved office in London. That involved moving more than 1200 people to change to agile working, give everyone laptops, cell phones, Follow Me printing. It was an extremely complex project with a hard deadline. The team just didn't have time to spend thinking about something else. My own team, we had lots of ideas about how we wanted to change the way we looked at incidents, and change tracking. We hadn't invested any time in it. We were busy with other things. We weren't sure whether a lift and shift was going to be good for us, because we do hundreds of releases every day. Any system that expects most of these to be entered manually is not going to work. We weren't sure that the ticket management system that would work very well for the service desk would be the right thing for us.

The timing wasn't right. It turns out, we weren't going to achieve the cost savings we were looking for either. Because when we dug into it, many people at the FT were using incident queues in the system to manage work. Once you gave licenses to all those people, we weren't going to save that much money. The team that were doing the migration were very focused on the cost savings. They started saying, we could just have one license per team. You start getting to the point where you're not benefiting from the change. It's a problem. If you're doing a migration for cost savings, and it doesn't look like you're going to save costs. You're not improving the experience. You need to think about what you could be doing that would give more value.

Successful Project Management

Why am I telling you about this? It's because I think it's an example of successful project management. We stopped. We stopped the project, rather than pulling lots of people with higher priorities into trying to get it over the line. It was quite an agonizing process, because it's very tough on a team to abandon the work they've been doing. I think it was the right thing. My boss at the time, John Kundert, who was CTO of the FT, said a couple of really interesting things. He said, "First, I'm not sad or disappointed. I don't see this as failure at all. It's actually a remarkable thing to have the confidence to stop something." Killing stuff is massively difficult. His second point was about value, and how that wasn't being written off. We learnt a lot, and that will result in positive changes. We knew a lot more about our processes. We'd evaluated some potential replacement systems.

You might be a bit cynical, and thinking, "It sounds good, but it was a failure. You still didn't achieve it." I want to tell you what happened the following year. Because in the short term, we signed up for a one year contract with the original supplier. The team who had tried to do the migration moved on to other things. They didn't have any interest in coming back to this problem in the short term. I don't blame them. Although abandoning the project was the right thing, it's not good for morale, for something like that to happen.

Over the next year, the various stakeholder teams started to do work. They started to take advantage of that value that we'd got through the process. The service desk bought the new service desk management tool that had been evaluated, but only once they'd completed the office move and had time to devote to making it work for them. The SaaS solution that made the service desk really happy was not the right thing for incident management and change tracking for my team. We went for different options. Firstly, we introduced the Slack bot for managing incidents. We actually just forked the Monzo response bot, which is open source. We've enhanced and improved it since, but we got significant value just from an MVP. It turned out that tracking of incidents was much less interesting to us than how we managed them when they were happening.

The second thing we did was we rewrote our change API. We had one already, but we rewrote it to be a lot more reliable. All the changes that came in were based on a queue. A consumer on that queue wrote to the existing change database, but it was very easy to start writing the same changes somewhere else. Then, to build things off that new store and stop writing to the old one.

Goals

The key thing was that Rhys, one of the principal engineers on my team decided that his own personal goal was to move all of our stuff away from the current system, and a few months away from renewal. Because he understood the timelines, the cost savings, the benefits, and had been part of that original process, he realized that we could move everyone away, and we would be able to not renew for a second year. We did. We moved everyone off the original system with weeks to spare before the renewal deadline. I think it's interesting, because the end result is spread over multiple solutions. It was done by the people who were going to benefit from these changes, which meant they could much more easily decide, what were the appropriate functionality to focus on? What could they cut? What was essential? It was a year late but we got a much more fitting set of tools and processes. We saved more money than we would have done. It was a very successful process in the end.

Succeeding With a Migration

What makes it more likely that you will succeed with a migration? Firstly, you can make sure you're completely clear on what you're doing and why, and the impact and cost of any delays on it. Then you can communicate in every way possible until you're sick of the sound of your voice. Finally, you can put yourself in your customers' shoes. Here, you can learn from behavioral economics, also known as nudge theory, to increase your chances of persuading people to do what you need them to do.

Clarity

I want to start with clarity. The first thing to understand is, why are you doing this? Why are you doing this work? Why are you doing it now? For the migration project that I spoke about, they were doing the work to save us money, and they were doing it now because the contract was coming up for renewal. You need to understand what the finish line is. When will the project be complete? How will you know? Will you stop because you completed a set of tasks, got a certain amount of scope done, because you get to a certain date, or because you spent all the money allocated to the project, and the people working on it have to move to something else. It's the classic constraints on a project: schedule, cost, scope. You should understand upfront, which of these is the most important for this project. In particular, if there's one that really cannot change.

An example from one of my own teams was we started using containers very early at the FT, and we built our own cluster orchestration using Fleet and systemd, among other things. Once Kubernetes was production ready, we wanted to move to it. We started work on that migration. At that point, it was a cost decision. It was about reducing the amount of time we spent running our cluster. Just after we started work, Fleet announced that they were going into end-of-life. Now the key constraint for us was the schedule, we had a deadline. We knew that everything else was second to that deadline. We made a lot of decisions on scope that cut things out. We were prepared to do things in a manual way for a while and come back to it after we'd done the migration.

You also need to understand who needs to do work outside of your own team, what that work is and how complicated it is. For those teams that you need to do work, what context are they working in? Will they prioritize this work? Can they, even if they want to? If they're doing something really critical or time sensitive, something that would win out over your work in priority, you should know early, because it will save you from pain. Then you need to understand, what are the consequences of failing to hit the finishing line? If you have to cut scope, or run past the deadline, or take more time from people so they can't work on the next thing they were meant to do, what is the cost to you and what is the risk? That migration to Kubernetes, the problem we had was once something's past end-of-life, you have no guarantees on security. You don't know what's going to happen if there's a vulnerability. We felt this was a deadline where there was a huge amount of risk if we didn't hit it. That clarity really helps you. Sometimes that might mean that you realize it's not the right time to do the work, or there's some other way to approach it. Better to know that early.

Communication

The next important aspect is communication. Mostly, teams don't do enough communication because they assume that everyone engages with whatever communication they send. The following tweet is 100% true. "One thing that happens in groups of 50-plus is messages never really get to everyone unless you make an extraordinary effort. If you say you can request a free Popsicle at any time, someone still doesn't know." That is absolutely true. You need to make that extraordinary effort. You have to talk about this thing until you really can't believe there's anyone who doesn't know about it, and you're still wrong. In March, just before the UK lockdown for Coronavirus, we had a suspected case on my floor at work. We needed to ask everyone to work from home the next day. It happened just at the end of the day. We spent the evening getting in touch with people to ask them to work from home the next day. We emailed people. We sent them Slack messages. Where we could, we called people. We still had a couple of people turn up at the office the next day because they hadn't seen any of that messaging. That's a pretty simple message. As soon as there is any complexity, so you can request a free Popsicle on Mondays and Wednesdays, you will find that people totally have different views on what that means. You need to be very clear about what exactly people need to do, when they need to do it by, and what will happen if they don't do it by then. You need to understand, is it, we'll just ask you to do it the week afterwards? Or is it, we will be really exposed to a risk.

For many years, we have been working to move out of our data centers so that we can decommission them, and the tagline is Cloud only 2020. Most certainly, the thing from our tech strategy that most developers would most be able to quote. It's simple. You know what you need to do and you know when. You have to communicate in every way possible. Use every channel you can: email, Slack, posters. One thing that works very well for us is monthly newsletters. We know this reaches people very effectively because people reply to them. They let us know that they liked what they saw. That newsletter is based around, here are the things you need to know about what we've been doing.

The best thing is to specifically talk to people. You want to be addressing people directly. We saw this with the source control migration. Someone new joined the team and sent emails direct to every delivery lead that had not yet done the migration from the old source control, listing every repository to migrate and being very specific about timelines. Suddenly, we started making much more progress. In terms of other channels, we have a procedure called the tech governance group, where we discuss things that have a wide impact for engineering at the FT. It allows people to share ideas, receive feedback on them, and says, how do we make the decision? It's a great way to make sure lots of people know about some work you have planned. I think most proposals that get discussed in the tech governance group, all the work is up front. People have been consulted. They've been told about it. Since we moved to working from home, we often find there are more than 50 people dialing in for these. There's usually a small group of people who are actively discussing the issue. All the rest are listening and they take that back to their groups. It's quite an effective way to share information.

Empathy

The third thing you have to have is empathy with your customers. You have to think about what would make them do whatever it is you need them to do. Here, you can borrow some ideas from behavioral economics, which is popularized by the book on the left, "Nudge," but I came across it when I read the book on the right, "Inside the Nudge Unit," which is about the UK government's Behavioral Insights Team, which is also known as The Nudge Unit. Nudging behavior is very popular with governments because it's about how to influence people without spending money, or without legislation. The nudge unit has a framework called EAST. Essentially, you need to make things easy, attractive, social, and timely. These are useful problems I find for thinking about how to engage with teams that you need to do work for you.

In terms of making things easy, you want to remove dependencies on you. Make sure that people can work asynchronously without having to wait for you to do anything. What that means is you have to have detailed, accurate, and friendly documentation. You need to opt for self-service for things like setting up keys. You need to be very clear on where people come to if they don't get things working. Respond quickly to those types of requests. Then, update your documentation so the next person doesn't hit the same problem.

You also want to show people how they're doing. If they can visualize their progress, it's much easier for them to know what's going on. We are developing a tool called Single System View. One thing it surfaces is progress of migrations. Here, these are the teams that are part of my group. I can see that two of my teams have completed this migration and the third team is making great progress. By far, the best thing you can do to make things easy is do the work for people. If it matters more to you than it does to them, this is a really good way to approach the problem. My colleague, Nikita, worked closely with teams in getting them to adopt the change API, writing the CircleCI Orbs, and integrating it into teams' deployment pipelines for them. That's easy.

The next thing is about making things attractive. You want people to find this something that they want to do. You want them to know about it. You want it to attract their attention. One thing you can do is explain why you're asking this work to be done. We're all grown-ups. It's not necessarily going to be, "It's amazing. You'll get a much better approach." Sometimes it is something like this, where we basically said, we all have to migrate our DNS because the vendor is turning something off. This is why we're doing it. Wherever possible, you should make the new thing obviously better for them. With the DNS, the new solution is all based around infrastructure as code. You can basically create a PR in a repository to make changes to DNS and get someone to approve it, and their UI is brilliant. The change API. Our new change API was much more resilient, so it wouldn't cause problems in people's pipelines. It had much better integration with our CI pipelines. Sounds attractive.

The third thing is about being social. We're very social species. We like to see what other people are doing and match them. If you can show people how they compare to others, that can be very effective. We have, again, in our Single System View, we can see the progress for each technology group at the FT. Generally, people will respond to what others are doing. This works well if some groups are making great progress or if you can nudge people into getting competitive. If no one is doing any work, this can backfire because everyone sees that no one is focusing on this work. If you can encourage a public commitment, that makes people a lot more likely to do something. We use OKRs at the FT as championed by Andy Grove at Intel, and John Doerr at Google. An OKR comprises an objective, which is a goal, and three to five key results. They are measures used to track the achievement of that goal. You want them to be measurable. They're a commitment because they're shared publicly, and at the end of the quarter, you decide how you did. If you can get another team to have an OKR key result that relates to something that you want to achieve, you are much more likely to have it happen.

The fourth thing is being timely. You want to pick the right time. Don't ask people to schedule work until you've got everything ready for them to go. Don't ask them when they are right in the middle of some massive deadline. There is a gap between intentions and action. It's shown that if you make a plan, you're more likely to bridge that gap and actually take the action. Help people make that plan, and OKRs help that too. People aren't very good at assessing cost and benefits over longer periods. Pick out the short term costs and benefits and tell them about that. This is EAST. It's a helpful framework: easy, attractive, social, timely.

Conclusion

If you strive for all of these three things, clarity, communication, and empathy, you're more likely to be successful in landing your migration project, or in realizing earlier that this is not a project you should be trying to do. If you have clarity on why you're doing the work, where it sits for the organization in their priorities and empathy with the teams that needs to do the work for you, you should find it easier and less traumatic to make the decision to not do this project if it comes to it. If you're a leader, you should be very clear that a decision to stop work on something is a sign of a mature organization, and celebrate that. Obviously, you also want to celebrate the projects that finish too. Projects that involve dependencies and other teams can be very hard. Making a success of them is generally much more about communication and management rather than an enjoyable technical problem. Lots of the projects we do sit in this area and if you get good at it, you will really benefit.

 

See more presentations with transcripts

 

Recorded at:

Jan 29, 2021

BT