InfoQ Homepage Presentations Breaking up Lyft’s Development Monolith

Breaking up Lyft’s Development Monolith

View Presentation

Speed:

Download

47:33

Summary

Jake Kaufman discusses how they use request context propagation to enable developers to safely share their staging environment.

Bio

Jake Kaufman is a Staff Engineer and Technical Lead for the Developer Experience Organization at Lyft. Most recently he helped transform Lyft's developer environment away from a single machine monolith to a system that allows developers to quickly test their microservice against a shared environment with simulated traffic. Previously he worked at foursquare.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Kaufman: I'm Jake Kaufman from the developer experience organization at Lyft. I'm going to talk about how we moved away from an isolated monolithic development environment over the last few years, and dive into the core technology that allowed us to make that change safely. We'll also chat a bit about how we measure the success of the change. The environments were very different from each other and this made it difficult to compare apples to apples, but we still needed to understand our customer success.

Workflow

Let's rewind the clock a little bit and imagine that it's 2020. You're working on a feature at Lyft, and you'd like to test your change before you merge to master. What does that look like? First, you'd spin up what we called a onebox. This could take upwards of an hour or more if you're competing with other developers to grab one. You go work on a spec or something else, maybe get a snack, and finally you get a Slack message that it's done. Maybe you use it for a few hours, but then you find that your test request is breaking, you waste more time trying to fix it. Is it the environment? Is it your change? Some other random service? You re-provision the environment a few times. Maybe things work, maybe things don't. Eventually you give up. You get a plus one for your PR from your coworker, and you just go test things in staging, and you feel frustrated that your developer environment has let you down.

Example Onebox

What is this onebox? Each onebox at the time was an r3.4xl EC2 instance, a very large machine. Each instance ran Docker, and then all of the services had lived in Docker images. These oneboxes would run an entire copy of Lyft and shared almost no code with the production stack. Dev infra maintained a very large fleet of these oneboxes for developer use. Originally, the system was designed to handle our monolithic PHP application and 10 to 20 microservices. At the time, it was a huge step forward. Developers previously would have to manually set up services on their laptop or even provision a bare EC2 instance. This was very time consuming and having a system for spinning this up automatically was a huge time saver. By 2022, we had grown beyond 125 services. What was once a quick way to run 10 or 15 services was now a horrible ball of pain, where one service out of the huge mass of them would fail and you couldn't understand or debug why. Things had reached their breaking point. Developers were frustrated with having to maintain these environments separately from each other and dev infra was struggling to support them.

New Dev Environment

What did the new environment look like? The main thing above all things was the ability to focus on and test a single service. No more needing to worry about every service working and thinking about whether the environment is broken or not. Developers would start by running and testing locally on their laptop with a single service with familiar and comfortable tools. As an example, Go developers would be automatically set up to just use the standard Go tooling and any IDE they'd like. If they needed to, they'd also be able to deploy a change from their branch in staging to test changes in an integrated environment from either their local laptop or their deployed change in staging. They'd be able to test requests with upstreams and then confirm that their business logic was working as they expected. This sounds great, but when I say staging, I think most people are probably thinking a mess. How does Lyft maintain a staging that works for this use case?

Shared Staging

The most important thing is at a baseline, teams were expected to treat their services running in staging similar to the production. By default, services were given SLOs and teams were required to set alarms and be responsible for responding to and fixing them in staging. This helped ensure that staging was healthy for everyone, and can scale as Lyft grows. Alongside those expectations for maintaining services, Lyft infra also maintains a platform for sending simulated traffic to our staging and production environments. We run rides or send requests to the API in order to simulate somewhat realistic scenarios. These scenarios are developed and maintained by developers across Lyft to help them validate their business logic. This helps us smoke out bad changes, and alert developers early. With these two things together, staging is relatively stable and helps catch issues before they affect real customers. This gives us a baseline working environment, but you can imagine some other potential sources of conflict.

Let's imagine Green. Green is trying to test it locally in a branch called green-fix, and they're working on a service that their team owns called API. API makes calls to rides and users in this example. Green is happily iterating locally with an IDE on their laptop. The API service is automatically restarted to pick up changes as Green works. Green is happy because they no longer need to think about an entire onebox full of those services. They're just focused on their service, and they can make calls to an integrated environment that everyone is working together to make stable. However, Magenta would also like to test a change. Magenta's team owns rides, and she is adding a new feature to rides and wants to deploy her service rides from a branch called magenta-test, but there's a problem. Magenta-test has introduced a small bug that causes rides to sometimes send 500s to the API service. This means that Green has to waste a bunch of time to figure out that rides is broken. Test request was working 5 seconds ago, and now it doesn't work. What is going on? They're frustrated. Green wants Magenta to roll back her changes, but Magenta needs to debug and figure out what's going wrong with rides. How can they both work at the same time?

You can see how staging is going to quickly look like the darkest timeline if we don't do something to stop developers from thrashing each other when they try to test. How? We need some way to allow developers to isolate requests to their new code, but share the rest of the test environment. Each developer should be able to opt in their requests that they're sending to their new code. Then other developers should be able to just ignore those and not be broken while they're trying to debug something else.

Staging Overrides Workflow

Neither Magenta nor Green is wrong here, we just need to build a way that they can work together at the same time. How do we do that? Something called staging overrides. Let's take a deep dive into how that works. At a high level, here's the workflow that developers will run to do this. A developer will deploy their service from a branch, as Magenta was doing from her branch, magenta-fix. Then we will add special metadata to opt in these test requests to route to the code deployed from their branch. Then we'll use that data to override routing decisions in a microservice mesh. How does that all work? Let's walk through some diagrams to explain how developers make use of this.

Let's start with the simplest example, just testing with staging as it is. Here we see Red. Red is a mobile developer and just needs to make sure that a new screen works well with regular staging data. They're not working with a backend engineer, and they don't need any changes to backend services. The hard gray circles that Red is sending requests to represent the normal fleet of a given service that runs in our staging environment. They're deployed from a branch called main by default. Red is happily testing and able to get to figure out that their screen is working with no problems because they're connected to a bunch of services that are guarded closely by tight SLOs and good alarms.

Now we have a developer, Blue. They work on the team that owns users, and they've deployed a copy of users from their branch called blue-test. Red is able to continue to test with normal staging without hitting any bugs that might lurk in Blue's new test code. Blue is able to iterate on their change to users, but they don't need to worry about brokenness from any other service or affecting other developers now, because they can opt in their requests to be sent to their specific copy of users. This is great. Now Blue and Red can work together and not affect each other and not need to worry.

We can also do even more interesting things. Let's see, Green here is a developer from the team that owns the API. In reviewing the spec for the change that blue is working on, Green realized that he needs to coordinate testing with Blue. Green overrides both services at the same time, and is able to ensure that a change in business logic is safe between the two services. This allows developers at Lyft to contest complex business logic changes that span multiple services without ever worrying about potentially thrashing each other.

Key Tech

That all seems really great, but how does this work? To make this happen, we're going to need a few key pieces. First, as we were discussing, we needed some way to attach override data to requests to allow these requests to get opted in to specific deploys that have been done from branches. We chose to add that data to a header on HTTP requests because we knew that we could propagate these headers throughout our entire microservice architecture. Second, we need to modify our service mesh to look for and then apply that override data when making routing decisions. Then third, we would need some mechanism in our Envoy to route that data to an arbitrary host.

Override Data

Let's go through the various pieces of this and see how it works. First, we need to figure out how to pack that override data. As we just discussed, we knew that the best way to attach our override data was to add it to a header. However, that header would need to be correctly propagated by all of our services, Envoy and at any other service boundary. We had a small list of headers that Lyft had already configured to be propagated, but the only potentially useful one was our tracing system. The rest were simple things like request ID, and it wouldn't be really useful for us to try to pack data into. Thankfully, when we reviewed the OpenTracing spec, we found that the spec provides for a blob of key-value pairs that it calls baggage to be propagated alongside the SpanContext. By encoding a small set of overrides, and then adding them as a key in the baggage, we could get to testing very quickly.

Let's take a look at what that baggage looks like. Here is an example of some of the data that we pack into the baggage. First, as you can see in this JSON, we store a list of Envoy overrides. Each one is a combination of an upstream cluster name to override and the IP address to point it to. We also store config overrides in this overrides list. However, this functionality allows developers to flip on and off configuration flags for a request without needing to modify turning them on globally. You can imagine how this would make it so that developers could test new features or other things without needing to break developers and mess up staging. Last, of course, we store some metadata. That's how we store data on the path of a request, but now we need to get into the details of how we use that data to make routing decisions.

Envoy

First, let's talk a little bit about Envoy. What is Envoy? Envoy is a configurable proxy, and it powers the service mesh at Lyft, and many other companies. Services at Lyft generally talk to each other or other services via Envoy. Let's take a look at how that might work. Envoy at Lyft, here's an example of how service to service communication might work. All services at Lyft talk to each other via a local Envoy. This allows us to enforce all kinds of useful things like authentication, or SSL, or RBAC rules. This also gave us the touchpoint required to modify routing decisions between services. Here for example, you can see how service 1 doesn't talk to service 2 directly. Service 1 sends its requests to Envoy, something we call the sidecar Envoy. That Envoy then talks to service 2's Envoy, finally before the request to service 2. As you can see, the services only talk to each other through Envoy.

Envoy Terminology

We need to roughly define some terms that are going to come up a lot so that you can understand what they mean. A cluster represents somewhere to send a request. An example might be a list of IP addresses, or a host name, things of that nature. A route is configuration that helps Envoy map requests to those clusters. This could be simple host rules that map foo.com to a cluster that represents a list of IP addresses, or even more complex things like a path rewrite. For example, our API might map get foo.com/v1/users to the user service over gRPC. These routes can perform fairly complex pieces of logic. Finally, a filter is just some code that's run as part of the routing decision that can modify the request. An example would be a compressor filter, which would gzip the request body and set the Content-Encoding required header, so the upstream knows that the content has been zipped up. This allows you to transparently zip large objects without needing either service to know that this is happening.

Envoy at Lyft

Getting back to some of this routing stuff. How does this actually work with Blue's requests from earlier? To correctly route Blue's requests, we introduced a filter that would look at this SpanContext header to trace and parse out any overrides, and then modify their clusters if any overrides match. In this example, blue-test request carries an override for the user's service. Requests to users should instead be sent to Blue's test branch. When rides Envoy makes that request, the filter looks at the request and sees that there's that override set and instead of sending it to the normal user service deployed from the main branch, it's sent off to the blue-test branch. This allows blue to opt in to the blue-test, but normal requests will still be sent to the normal user service.

Filter Logic

At a high level, the filter needs to get the override information from the request, modify the route when there is a match to point to a cluster that we control. Then we need to set that control to send data to our branch deploy code. How does that actual filter work? Here we go. Roughly speaking, here's the actual code from our Envoy filter. It's a lot. We're going to break it down and go through the pieces individually. Just to show you it fits on one slide, it's pretty straightforward stuff. I've eliminated some of the verbose error handling of long code to make it a little bit more digestible. Then, starting from the top, this is the decode headers function for a filter, that allows us to act on headers and modify the route before the routing decision is finalized, which is what we need to do. Let's go through the pieces of this code and talk through how it works.

First, we need to extract override data so that we can make a decision as to whether or not we need to actually change the routing decision. First, we look at the active span and see if there are any overrides in the baggage. Then if there are, we parse them out into a request context object, and that's going to give us the data that we need to make the routing decision. Then we iterate over any potential overrides, looking to see if the current upstream cluster, decoder_callbacks_>clusterInfo->name, is going to match one of the Envoy override that's set. In this case, imagine we're about to send a request to the user service, and we need to override it to send it to Blue's branch instead. Here, if the current upstream cluster does match, then we need to send out to get Envoy to choose a route that points at our special cluster, instead of pointing to the default user's cluster. We originally did this by modifying some of the request data to make Envoy choose a static route we had just created. However, this doesn't work quite right. For example, our API does a lot of path rewriting and route objects. The static route we created didn't have those rewrites, and thus, a request wouldn't be sent to the right path on the upstream service, and we would get a 500. What we really needed was a way to modify the actual route object chosen by Envoy, but only change the cluster that that route points to. However, at the time, the Envoy filter API did not provide a way to do this, so we needed to add the set route method. This allows us to create a route object that overrides all through the cluster directly, rather than creating a bunch of duplicate Envoy configuration that would have been needed to make this work without it.

What is this route override that we're setting? Something about Envoy routes is that they are not really safe to mutate, and they're prohibitively expensive to copy in the path of a request. We needed to introduce a way to wrap the requests and override only the things that we wanted. The way we do this is something called the delegating route class. An instance will delegate all calls to a route that is parsed as part of the constructor. Here, you can see that the first argument parsed to the constructor is something called decoder callbacks route. That is the route object that Envoy has chosen. This is going to wrap that. The second argument here is the original DstClusterName, which is what we're going to override the cluster name to. In our filter, we needed to override the cluster name with that, and so we subclass the delegating route, and created a series of classes that would only override the one thing and then otherwise forward everything.

Let's talk about the override delegating route class. The override delegating route class is a small subclass that overrides only the route entry method. In this class, we construct a similar delegating route entry to route the original route entry, and then return only our cluster name that we've already set. This is very similar code, and we only override the cluster name and always return a custom cluster name. This together results in a route object that always returns the value we want, the overridden cluster name, but otherwise is configured exactly the same as the originally chosen route.

Original Destination Cluster

What is this original destination cluster? I keep mentioning it, and it seems magical. Let's get into it. The original destination cluster is in theory intended to send the data back to the same address as the downstream connection, hence the name original destination. Think of it as a mirror that sends data back to where it came from. That's not very useful to us. However, when the use_http_header config flag is set to true, Envoy will instead of just sending data back to the downstream connection, will read an IP port combo from a special header called Envoy original DST host, and then forward the traffic there instead. We add this cluster statically in any Envoy running in our test environment. Now let's return to our filter code. As you can see, we override the chosen route to always return the wrap route with our original DST cluster. Then, we also set that special header to the IP address from our override data. With all of that, Envoy will now be able to route directly to our branch deploy when overrides are provided in the baggage of the trace.

Summary

Let's run down the logic again without the code so that we understand what's actually happening here. First, we need to extract overrides from the OpenTracing SpanContext header, or the x-ot-span-context header. Then we're going to wrap the route that envoy would have normally chosen in our override delegating route class, that always returns our original destination cluster, or the original DST cluster. Then we set route to the wrapped route instead of the original route. Then we need to add the IP address data to the original destination host header. Then Envoy will look at the route and forward all requests to the IP address and port combo specified in the baggage.

Context Propagation

Now, we have one last thing to talk about here, and that's context propagation. Without ensuring context propagation, none of this is going to work. We've put in a lot of work to ensure that spans are propagated both at the network request level, but also there are queue and event systems as well. This allows developers to test workers or code that consumes events without needing to do a full deploy to staging. However, we didn't want to play Whack a Mole, looking for services that didn't forward the SpanContext. To find them, we emit metrics in Envoy and at other propagation boundaries, like the aforementioned queue systems, when we detect a request does not have the SpanContext header. Given that we will know at that point, with the downstream and upstream caller, will usually give us enough information to then narrow down the problems to a specific service owner. Then, go work with them and get propagation working or maybe understand why they're not propagating requests. Our eventual goal is to have this ensured at the network level. We would like to be able to in Envoy or other places, if we see a request without a SpanContext, just return a 500 or otherwise reject the request. The process to get there has been slow because we need to be pretty conservative with these changes.

Measuring Failure

Let's move on to measuring the success of the pivot. The change we made was very massive to how developers work, and directly comparing the onebox environment to our new shared environment was more or less impossible. We knew that the bar for the previous environment was pretty low, but we did need to make sure that our customers were finding value in any of the new tools that we were building. Let's first talk a little bit about some failures and some things that didn't really work. I first want to call out surveys. They continue to be a valuable tool for us, but I call them out here because they're very expensive. They require a lot of work to get what we find as generally low value. They're a lot of work to write. You need to balance asking for more information with exhausting customers. Customers will often complain of fatigue, and it's hard to draw long term conclusions from what can sometimes be relatively small sample data. In past surveys, we've also tried to ask questions that will help us calculate an NPS score. We found that it just gives us very little information. Fundamentally, users have to use the tools that we build. People aren't necessarily recommending the tools we offer because they like them, they recommended them to their coworkers because they're the tools that they have to get work done. It's not really even clear to me what NPS actually means in that case.

Similarly, we also tried asking customers to give us a CSAT score, and we found that it basically gave us similarly no signal. The number didn't really track with our more quantitative metrics. We would see adoption go up or go down, but we would not see CSAT change in any useful way. Even during big launch windows, when we were getting good one-to-one feedback from customers, it didn't necessarily drive up our customer satisfaction scores. This just meant that it wasn't very useful, and that was a lot of work to keep track of. Plus, for all of these pretty small sample sizes, even with a fairly large org, we would get maybe tens of people responding to these surveys. With less than 10% of the whole organization, it's pretty difficult to derive any really conclusive data from that.

One last word of warning that I'll give you is, watch out for adoption metrics. While this was initially the focus, as we were rolling out tools, we figured, if people are starting to use the new tools, that's a good signal. We found that not every tool benefits from the kinds of work that might make users need to use them. For example, we've added tooling to make debugging simpler. Users of those tools use them sporadically, but it doesn't mean that they're going to be like with daily active users. Trying to track that can often trick you into thinking that either things are much better than they are or much worse than they are. Similarly, we have a portal internally that allows users to take actions on databases or against their services running in our production environment. We noticed that stats were going down and people were not using them as much. What we realized was, actually, we had just automated a lot of the workflows that customers had used before, which are strictly better, because now they don't need to use it at all. In general, adoption metrics can be useful, but they're very hazardous and you need to think hard about what you're actually trying to get customers to do. If you incentivize customers to adopt things that make them have a worse time, that's not improving the customers' lives.

I don't think that context matters a lot when measuring things. We launched all of our tooling to generally available in late last summer, early last autumn, but we noticed by the December timeline that a lot of our user stats were starting to drop off, or our growth was starting to slow. We were super worried that folks had found a lot of value early on, but maybe gotten tired of our new tooling, or they'd started to run into more issues and getting frustrated. As it turns out, what we're really measuring was the productivity loss of performance evaluations. People were busy shipping last minute things as fast as they could to finish out projects for the year, or they were caught in planning, spec writing, and other non-code things that we don't track as part of the development experience. We see similar changes in dips in their planning every half. Just something to watch out for when you're looking at these kinds of things, because non-obvious events can trick you into thinking that things are happening. These are all bad things.

Measuring Success

What did we find that helped us measure the success of our new environment? Early on in our move away from our old environment, we identified build times as being a major problem. Developers needed to maintain an entirely separate build just for development. Tracking the decrease in overall build time gives an ability to measure developers committing to the new environment. When a developer was ready to move to a new environment, we had told them that they could just remove the old builds, because that would make it difficult for them to go back. This meant that we knew that if developer build times were going down, that meant that more developers were moving away from the old environment, and depending solely on the new one. Not an exact measure of adoption, or anything like that, but it did mean that our developers at Lyft were feeling confident enough in the new environment to say, it's no longer worth having that old environment around, I'm done with it. Overall, we were able to reduce the mean build times by 50%. This was a huge benefit to our customers, and really helped speed up a lot of our work.

The other big bet that we made here was to move the inner loop to the laptop, assuming developers would use CI hopefully, to iterate less. We saw that developers adopting the new environment would end up having less errors in a PR, and so their usage of CI went down. This gave us confidence that developers were able to get more done locally without needing to fight with CI. One of the things that we tracked was on a given PR, on a given repo, are there fewer tests or lint errors making it to CI over time? After all, a linter found locally in your IDE or maybe even in a pre-commit hook is going to be much cheaper than needing to cycle between your local environment and CI and trying to deal with that. Closing that loop really helped us save a lot of time for our customers.

Past that, what have we been doing since then? Here I present some lessons that we learned in trying to better measure and understand our customers. The number one thing and this is obvious, is to capture metrics at every touchpoint you can. We found that it is indeed worth investing significant time in capturing data from places that are difficult. We added ways to emit metrics and structured events from laptops. This was a huge collaboration between our teams and security to figure out ways to do this safely. We also have to get unconventional sometimes. For example, we had to write a tool to parse pips output, and generate data on the time it took to download code and build wheels. Because we needed to understand better which wheels customers were having and when. In particular, we needed to understand better when customers were pooling pip packages that didn't have wheels. It was not always necessarily easy to understand that from the abstract data that we could find in our PyPI install.

Additionally, one of the most important things we found is enriching your metrics with things like a SHA, or a branch, or architecture, or even team information can really help understanding the flow of changes to the development lifecycle. The key here is that tools across your infrastructure are emitting the same set of metadata. One of the mistakes that we made early on is that we asked teams around infra to send as many stats as they could, but we never actually agreed on what those stats should look like, and what the metadata attached to them should be. Some teams added things like the SHA, and information about the language or architecture, but some teams went a different direction and added team information, and were more focused on timings and things like that. This meant that it was not actually any easier for us to understand how changes were propagating throughout the development lifecycle. The work that it took to harmonize all of that data was quite a struggle, and it meant that we actually don't have as deep history as we'd like, because a lot of the data before we made these changes is not particularly useful.

However, once we did do that, being able to look at a vertical or horizontal slice of the data has been particularly valuable to understand outliers. For example, we can use this data to find problems in combinations between Python architecture and different wheels. For example, we realized that in Python certain architectures and gRPC wheels weren't always available. This didn't really affect the baseline or mean number, but it was still a huge problem for those services' owners. Their builds were 10 to 15 times slower, because they needed to build the gRPC wheel every single time when they were trying to do a build. This led to a lot of frustration to them, but we would never have been able to figure this out without the ability to pinpoint, this build is from Python on this version on this architecture with this wheel. Then being able to slice vertically has also been very useful for us in being able to track changes making it through and how long they take from conception on the laptop, all the way making it to the deploy. Being able to find services that generally take a lot longer, help us figure out areas where we can invest to improve our auto deploys, improve our testing, things like that. In general, it's helped us really figure out what is priority and what is not.

Mixing the Quantitative with Qualitative

One last topic that is a little difficult is, how do you explain the value of a change when it's difficult or even impossible to measure the time spent? One of our earliest versions of this when we were making the change from our previous environment was, how long does it take for a developer to sit down and onboard their service into our new development environment? When we started, it involved adding a new configuration file, maybe making a few changes, and testing to make sure that it worked. This would usually occur maybe over a few PRs over the course of a few days. Realistically, we knew that developers weren't spending 14 or 15 hours on this, they were probably spending an hour of their day each day. The key thing that we found for this kind of thing is to focus on the order of magnitude. Does this step take a minute, 30 minutes, an hour, a day, a week? It's less important to know the specifics when you can't measure them and more important to find an order of magnitude that will help you convey what the actual value of making this change is going to be. Then, quantify places that you can. Find simple things like a number of actions taken, or maybe how often something's run, things of that nature. That's going to allow you to take something that's not quantifiable, but maybe you can find an order of magnitude, and then put it against something that you can quantify and show direct changes for.

For example, when setting up a new service, we found that observing users and maybe white-gloving some of our services took maybe an hour in total. We also knew that there were about 15 steps a user would have to take in [inaudible 00:35:54] to correctly set up a repository. Although each of the steps didn't take the same amount of time, by reducing the number of steps, we could get a rough estimate of the amount of time that our developers were saving. It would help us show, to perhaps someone in leadership or someone who's never needed to set one of these services up, the impact that we were making, and help us communicate that we're saving about this much time for every developer who has to perform this action. We can track how many times this has to happen. We can track how often it has to happen. Then estimating some of the orders of magnitude allows us to turn all of those things together into something that other people can understand. Really, the whole goal of this is to show that we're making progress and that ultimately our users' lives are better.

Resources

If you'd like to know more about other aspects of the work that I didn't have a chance to talk about, please give a read to the blog post we published late last year and earlier this year. I've linked them here. There's four of them. The first one goes way more into detail about onebox and the history and how we got here. The second one spends a lot more time talking about the tooling that we built to make working on the laptop super simple, and goes a bit into how we actually generate requests and send them to our staging environment. The third one is on a similar topic to this. Then the fourth one is a deep dive into how we redid our integration tests, and got rid of them, basically, and replaced them with some of the smoke tests that I hinted at here.

Branch Development, Operating On the Same Data

Originally, that was actually a huge concern for us, as we started to roll this out. Obviously, the services and the code is not shared, you're only targeting the code that you've modified, but we don't have any isolation at the database level, so it's still writing in the same fields, still writing in the same records. One of the things I hinted about but that we deeply invested in was a suite of tooling that allows developers to acquire or generate drivers and passengers, or rideables because we also have bikes in the system, and lease them and use them or refresh them, and things like that. That is how we do the isolation between users for data so that users aren't trying to move the same passenger around the map or what have you. We found that's like generally been fine. We haven't had any issues, knock on wood, with data corruption because we have pretty good practices already for migrations and that kind of thing. Definitely a tough problem.

Questions and Answers

Synodinos: How would the test set get all the baggage details of his deployed branch, is there a service discovery or is it done manually?

Kaufman: We actually have two versions of this. One, we built this local tool that's a transparent proxy that adds the necessary data to the header. The header it's like basic, we won't code it, we don't have people manually adding that to their curls, or whatever. You can point the mobile phone or our local CLI tooling at that proxy, and then it will forward things and set things up correctly. One of the things we're actually rolling out now was effectively a subdomain that maps to whatever configuration you would like. For example, I might wire up my phone to jake.staging..., and that will attach whatever services or configuration flags that I would like to send on the wire.

Synodinos: It was really interesting listening to you about the journey that Lyft has gone through and evolving the way developers develop. What do you see in the future, if this is state of the art today, what is coming in the next two, three years?

Kaufman: The big thing that we're working toward now is data isolation. One of the things that keeps me awake at night is a lot of our most important data is actually not really stored in databases or whatever. It's the sort of ephemeral, event-driven analytics data that every little service along the path emits. If you're just testing one flow, or whatever, you're going to generate events that don't correlate with anything. That's not that big, good deal, generally. As we're leveraging this more, I think one of the things we're worried about is, are we polluting our analytics events? Or how do we make sure that we're not breaking the market, because there's some signal going in that a person's requesting a ride over again, or something like that? That's definitely one of the big areas is, how do we push the concept of context all the way through into our data systems and ensure that I'm not breaking things? Also, that I can test those. I think one of the things that's right now just generally difficult is testing models. It's a big scary thing for I think most companies, particularly for us because those models power a three-sided marketplace. If we get that wrong, people can lose money.

Synodinos: Were there any challenges in moving from onebook style development and staging to scaling to any developer can deploy as many test branches? Any challenges with managing cost, quota, space?

Kaufman: The branches were so much cheaper, that you would have to deploy so many to equal the cost, but that's not a particularly satisfying answer. The biggest focus was actually enabling people to work directly on their laptops without needing to deploy. Even on a good day, you have to push to GitHub a couple minutes at the earliest for the build, and then you have to pull it and deploy, and that's another few minutes. If you're wasting eight minutes to be able to do the tightest loop of just trying to test your feature, that's such misery. The focus was really trying to get people on just being able to test directly from their laptop with staging. One of the things that we're working on right now is closing that loop so that you don't even have to deploy. The laptop is the overridden deploy, and that should allow developers to just iterate unbelievably quickly.

Synodinos: Lyft is a big organization. It can afford having a dedicated developer experience team. We had a presentation from Monzo, and from what I understood, these initiatives were more grassroots, like different teams would make decisions that would affect everyone. Is there a certain size for an organization that would really require a dedicated developer experience team?

Kaufman: I don't know where that starts. I joined Lyft to help found the DevEx team. At that point Lyft had maybe 400 developers. I think that that was maybe a little late to begin investing. Lyft had under-invested historically. I think my general advice for that is, you always want to be investing in the ability for developers to ship. I think that one of the mistakes is thinking about it only in technology. I think, especially earlier in an organization where you're just going to have less code and less process and things like that, you don't have to worry as much about, are the builds fast enough, because you're just not building that much code. In there, you want to think more about, are there other processes enabling developers to ship at the rate that they need to be, to get the products going? One of the keys I think there is just investing early in at least measuring some of that stuff, even if it's not generating anything actionable. Because you really don't want to be in that situation where you can only deploy once a week, or it takes two or three hours just to get a PR built and ready to merge. Things like that, I think are when things get really bad.

See more presentations with transcripts

Recorded at:

Oct 07, 2022

Jake Kaufman

InfoQ Software Architects' Newsletter

Breaking up Lyft’s Development Monolith

Summary

Bio

About the conference

Transcript

Workflow

Example Onebox

New Dev Environment

Shared Staging

Staging Overrides Workflow

Key Tech

Override Data

Envoy

Envoy Terminology

Envoy at Lyft

Filter Logic

Original Destination Cluster

Summary

Context Propagation

Measuring Failure

Measuring Success

Mixing the Quantitative with Qualitative

Resources

Branch Development, Operating On the Same Data

Questions and Answers

Related Sponsors

This content is in the Culture & Methods topic

Related Topics:

Related Editorial

Popular across InfoQ