Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Stress Free Change Validation at Netflix

Stress Free Change Validation at Netflix



Javier Fernandez-Ivern discusses why a high confidence change process for code bases is needed, how zero-noise diffs help close the confidence gap, and recommended practices for building a diff system


Javier Fernandez-Ivern is a member of the Playback Experience team at Netflix, where he is responsible for ensuring that customers always enjoy their favorite shows with the best video, audio, text, and other features available. After trying out management at Capital One, he returned to his software engineering roots and joined Netflix.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.


Fernandez-Ivern: Welcome to stress free change validation at Netflix. Changing critical systems is scary, and the bigger the change, the scarier it gets. Changing systems is usually unavoidable. There's features to add, bugs to fix, and the occasional big migration or refactoring required to keep the system current and technical debt low. My name is Javier Fernandez-Ivern. I am a software engineer in the playback experience team at Netflix. I love tinkering with systems to make them better. I'm going to talk about a system I built at Netflix, which has made refactorings low risk, fast, and stress free.


Let's start with the motivation, why we felt like we needed to build this system. We're going to outline what exactly the problem is this system was trying to solve. We're going to build a simple solution. Then I'm going to show you how I iterated on a simple solution to make it better until it was really good for what we needed. Finally, I'm going to walk you through some stuff we've done in the past, some big migrations we've done with this tool, and also how we use it on an everyday basis for smaller jobs.


Before I tell you what we built, let's talk a little bit about why we built it. I mentioned that I'm in the playback experience team. What exactly is a playback experience? It's the set of features that describe how a Netflix customer will experience a movie or show that they want to watch. If you want to watch Sandman with subtitles on, that's part of the experience. If you want audio in a different language, or if you're watching in 4K HDR video, that's also part of the experience. In order to calculate the optimal experience for a customer, we need lots of different inputs that describe what's actually available for the particular movie, which country it's being played in, the customer's preferences, the device's capabilities, A/B test allocations, and many other things. There are so many possible combinations of these inputs, that it's very difficult to achieve full test coverage. We have a great test suite, but we know that there's always a possibility that it will miss the next bug.

Calculating the playback experience is an essential part of starting playback. It's mission critical for Netflix. Our system needs to be highly available, because if it doesn't work, then our customers can't play movies. We don't just want to be highly available. If suddenly, every playback started with standard definition video, or with the wrong audio language for a customer, we wouldn't be giving our customers the best possible experience for them. We also want to be highly accurate in calculating the optimal experience. We need to do this at Netflix scale for over 200 million customers. This system has been around for a long time. Like most such systems, it's evolved organically as a collection of features, designs, and layers, cobbled together by many different people over the years. We've invested significant time in simplifying the architecture, adding great tests, and cleaning up technical debt. We would like to avoid having tech debt pile up like that in the future, and instead refactor and clean up the system opportunistically as we work on it. We want to be able to enthusiastically maintain a critical software system without breaking it. There must be a way to make big changes without setting everything on fire. You might think, tests are good, so let's rely on tests to avoid breaking the system. We can also add some canaries. Canaries send a small percentage of production traffic to a candidate version of the service. Then they compare errors, latency, and other metrics against a known good version of the service before proceeding to a full deployment. You eventually find out that it's not enough. There's always some unusual combination of inputs that isn't covered in your test suite, and that slips by your canaries. The result is a production incident.

The Problem

We talked a little bit about the motivation. Let's identify the problem we're trying to solve. This is the problem, bugs. Bugs are the problem. We want to be able to change a system without adding new bugs. In other words, we want to make sure the change does what we intended, and that also it doesn't do anything else. Let's talk about new kinds of bugs. Tests are great. That's true. What's hard is coming up with test cases. The reason for that is that we have a lot of possible different input combinations. We can't possibly think of all the different input combinations that might cause a problem. Usually, we think about it for a while, we come up with likely candidates. Then we write tests for those likely candidates. The more input dimensionality, the more this is likely that there's going to be some other case that you didn't think about. What ends up happening is that we keep finding bugs that we didn't think about. This is made even harder by the fact that some bugs are rare. By that I mean that they only happen for some specific input combinations that don't happen very often. They're harder to find. Because if a bug happens all the time, it'll show up on everyday usage. It'll show up in canaries. You probably already thought of this bug, and you probably already wrote a test for it. The other thing that happens is when you see them in production, they just look like noise, they pop up here and there. They're mixed up with all kinds of things that happen operationally. There's timeouts here and random errors there. The database has a weird thing over there. Here's a thing that's weird, but it's actually a logic error, but you can't tell it apart from the other stuff. Again, the more input combinations are possible, the more of these that you have that are latent, and waiting for you to discover them.

Building a Solution

Let's build a solution. Here's the theory for how a solution to both of these problems would look. We can catch new kinds of bugs if we just compare the result of a known good service with a new change that we're trying to make to the service. We're looking for unexpected differences. That is not a canary, where we're sending random samples of requests to one or the other service and compare statistical results, like how many errors are coming out. Here, we're actually replaying the exact same request and finding out, is there any difference in the response? Is it unexpected or is it related specifically to the change that I was intending to make? We can also catch rare bugs if we compare enough of these requests, if we compare a large enough sample size. There's a little bit of a caveat here, which is, as long as you don't replay or compare every possible input, there's probably one input combination out there that you're not going to catch. Extremely rare events might slip by you with this technique. In practice, we find that most bugs that we've caught so far have popped up within 30 minutes or so, or a couple hours of replaying a reasonable sample of requests. Again, [inaudible 00:08:00] matter. If you think that there's a bug and you're trying to catch it, then you probably need more samples, if you have already a chance to suspect there is a bug. If you don't think there's a bug, then you might not need to replay that many requests to get the amount of confidence that you're looking for.

Let's look a little bit at how this looks like. We get a client. It makes requests at a service. The service returns a response. What we do is we take a sample of those requests, and we write them to a request stream. Then we have a job running that consumes that request stream, and emits two pairs of identical requests to version A of the service, which is a known good version of the service, for example, what's currently in production, and version B of the service, which is the candidate version we're trying to test. When the responses come back from these services, we compare them. We don't just compare them when they're successful. If they fail, we also compare the failures. The reason that's useful is sometimes they'll fail, but with different errors. If you didn't intend for an error to change, that might point out to a different bug. For example, I accidentally broke one of our clients, so that all of our error codes became a single error code 1002. That was not intended. It was a good thing we found it out because it would have confused the clients of our servers quite a bit. They have alerts and all kinds of things, and dashboards that look at specific error types, and those would have broken with this accidental change.

Finally, we take whatever differences we found and we put them in Elasticsearch. This allows us to not just display how many differences we're seeing at any point in time, but also we can filter and search for them. For example, if we already know that we added an extra audio language, in Canada, for example, we can actually filter out requests that match additional audio language in Canada, and see if there's any diffs left after that. That way we can little by little rule out differences that have a known explanation to really find out the ones that don't have a known explanation and might be bugs.

Noise, Noise, and More Noise

We wrote that. We actually ran it, and we had a lot of differences. Our service has no side effects, is deterministic, and so we needed to find out why we're seeing so many differences. Why was there so much noise in the signal we were looking at. The very first thing that popped up was that we were adding a unique ID to every response. That was pretty obvious. This is intended. It's not just that it's changing, but it's expected that it would change. We're never going to get two responses with the same ID. The good news is, we didn't really care what the ID was. This is not an unusual thing. You might have something in your response that's nondeterministic, and you're lucky enough that you don't care what the value is. In a case like this, what you can do, for example, is you can filter it out. Sometimes you have timestamps or IDs. If you don't care what their value is, you can omit them from the comparison. There are some neat tools that do this automatically for you. We're aware of an open source tool called Diffy, which actually tries to detect this field automatically for you. What it does is it does two requests to the baseline, and only one to the candidate. It uses the two baseline requests to figure out which fields vary with the same set of inputs. They consider those fields to be nondeterministic. Then you can use that to interpret the response that you're getting back. The reason we didn't go in this direction is that, to solve other sources of noise, we needed to know a little bit more about the service. If we needed to do that, we might as well know that manifest IDs can be omitted and just filter them out.

What happens when we can't ignore a noisy field? If this is going to happen, it's going to vary on every request, and you cannot ignore it, you should try to find a way to test if these two outputs are equivalent. An example that we have in mind here is, for example, DRM licenses. DRM licenses are cryptographically secure, and have all kinds of timestamps inside of them. What you're actually getting back is data that will change on every request. It's functionally identical, and if the inputs to the calculation are the same. What you can diff instead, instead of actually comparing the license outputs, you can compare the parameters to the license and make sure those didn't change. Then consider those two licenses to be equivalent. If the output was not intended to vary in every request, you should try to find out why it's changing then, and try to see if you can control the reason for the change.

Here's another example. Sometimes outputs are equivalent because there's something nondeterministic going on in your server that you don't care about. We have this example here where the baseline and the candidate both return this thing called bitratesInKbps. It's related to audio encoding. You don't need to know what it means. What you do need to know is that we get an array of numbers back and they're unordered. We don't care what the order is, and it's nondeterministically ordered. In this case, for example, because we know about our service, and we know that these arrays are unordered, we can actually ignore their ordering when we're comparing the responses, and consider these two things to be equivalent, rather than an actual difference. What these things have in common is what we call response normalization. It's basically, you should remove variability, if it doesn't matter. Whenever possible, and that's important, remove any fields that are noisy. You should sort any collections that are unordered. You can remove duplicate entries in a collection as long as they don't matter. We had some of that.

After we did that, we still saw a good number of diffs, and we had to dive deeper. One thing we saw was that there were spikes in the diff record, just around the time we would launch a new movie. What happened is, movie launches are controlled by a launch date. If there was a slight time difference between the baseline request and when the candidate request was processed, then sometimes these requests would have a different view of whether a movie had launched. What it comes down to is, this was the wall clock causing noise, and it's a very common source of noise we've found. The problem here is that, if we just look at a request, and a server is processing the request and returning a response, we don't have a complete mental model of what's going on. The request is a set of inputs, and those are explicit inputs. There's other inputs to the calculation that the service is doing besides the request. We sometimes call them implicit inputs or hidden inputs. One really great way of finding these inputs is to do a diff run, a comparison run, where both the baseline and the candidate run exactly the same version of the service. Then any fields that are changing, there's something else going on because you know the request is the same, so now you know what to look for. In the case of the clock, it's as simple as, let's expand our mental model. The clock is also an input to the calculation. That's what we did. We actually modeled it, and we actually allowed parsing in an explicit request time that our service uses, so that we can guarantee that the baseline and candidate requests are processed with exactly the same time. I got rid of that noise.

However, most of these implicit or hidden inputs aren't quite so easy to fix. We're going to look at what to do about those, because after pinning the request time, we still had plenty of diffs to explain. Let's look at a few examples and see what we can learn about them. One thing we saw was periodic spikes that aligned with every time the movie catalog metadata would update. The movie catalog metadata is a service that tells us everything we need to know about a particular movie, or show: what video streams are available, what languages the audio is in. All sorts of things that we need to use to find out what we can select from to create the experience. This service updates frequently. It updates several times an hour. Every time it updates, we would see sometimes the baseline and the candidate pair for a request, would see different versions of this data would result in diffs. We needed to find a way around that. Sometimes you get lucky. This service has a version that you can request. You can say, I want to request this exact same version of the data for both of these requests. In this particular case, we couldn't do that, so we had to find a different way to control for this problem.

Another source of diffs is anytime you're essentially getting data from a database, like reading mutable state from somewhere. In our case, for example, we have to find out what languages a viewer prefers. When a customer is trying to watch a movie, let's say the original language is English, we need to know what does this customer like to do for movies in English? Do they like to watch it in the original language? Do they like to watch it dubbed in a different language? Maybe they want subtitles, or they don't want subtitles. These changes ask customers to view movies and select one or the other option. Then we learn what they did, and we try to apply it again in the future. Every time that changed, we also would see a difference here. Finally, we had another class of services that weren't really expected to change. These are services that don't change very often. We still see noise in the inputs that we got from those services, that is, the responses would sometimes be different even when the service wasn't really changing. What's happening is that there would be occasional timeouts or failures calling this service, and then we would get a default value. Our system would compare the default value with the real response that the other request got, and it would notice a difference. We need to control for that too.

This happens a lot, essentially anytime you have a distributed system. Anytime you have requests that can fail. Anytime you're comparing a success with an error, or a fallback, or anytime you have eventual consistency in your data, and your two requests might see different versions of that data. What it comes down to is that, essentially, distributed dependencies, they create their own noise. Even if you're calling a service that is idempotent, you have to account for the fact that something might happen along the way to cause you to see it as different. How do we control for all of this? We did this thing called creating a lineage for a service response. We need our service to give us a hint, to tell us what inputs it used for its calculation. This required us to tweak the response that we get from this service, and add this little map that we call the lineage. For every dependency, for every input that is not part of the request, we log some data that allows us to compare two responses and see if they got the same inputs. If you're lucky, the service we're using is versioned. A lot of our metadata services are versioned. They'll have a big timestamp, and we just use that. We just say our catalog version is this number. For anything that doesn't have a version, what we did is we just added a checksum or a digest of the response that we got back. It's really efficient. It didn't use much CPU. It allowed us to compare two responses from our service and see, for example, did these two executions see the same collection of A/B allocations, for example? Did they see the same user preferences? The upshot is, when we're actually comparing responses from our service, if they're different, we can find out, are they different because one of the inputs was different, or are they different because of some other reason that we have yet to explain?

Now that we know this, which I think is the most interesting thing of this talk, if you have one slide to remember, I would recommend remembering about lineages. You'll see why, because we want to see how the comparison actually works. Let's assume we get both responses back from the server. If both are failures, we just compare the actual errors we got back. We try and see if we got the same error back, or if we get different errors back. If they're not the same, we actually go ahead and compare that and write that to our Elasticsearch database for the dashboard. If only one of the requests fail, and the other one succeeded, we also log it with a different tag, we call it a success difference. This is interesting. This happens sometimes, any distributed system calls can fail. It's useful to log it because we're looking for unusual patterns. If suddenly, our candidate is failing a lot, there might be something else that's wrong with the system. Finally, if both responses are successful, we compare them. If they're not different, we're done. That's not a diff. We're happy with it, we can move on. However, if they are different, the next thing we do is we compare the lineages. If the lineages are equal, then we have an actual real diff. We can actually tell, not only was the request identical, but also all the other hidden inputs to this calculation were also identical. This is a real diff. We should probably examine it and make sure that it is intended, that it does what we want. Otherwise, if the lineages are not equal, we still log this, but we log it annotated as a lineage difference, which means there's an explanation for this difference that is more likely than we changed some code. The expectation is, is if you actually broke something, it's going to show up when the lineages are equal. You will see it on your real diffs.

After we accounted for lineage, noise dropped to zero. We stopped seeing diffs in the absence of changes, essentially. This was super useful because it allowed us to diff really frequently, without wasting time hunting false positives. We weren't seeing diffs that weren't directly attributable to some problem with the change that we made, or maybe not a problem, but some change in our code or our configuration. One thing to note is that if you're not expecting any functional changes at all, for example, if you're doing a refactoring, or if you're doing a migration, and you're expecting the behavior of your system to remain unchanged, it's even easier. You don't want to see any diffs. Any diffs at all are probably a problem. Those are the easiest ones to interpret. If you are seeing some deviation, and it is because you did make a functional change, then you at least know what you expect. You might expect, for example, that the video resolution changed but nothing else, because you made some change about video resolution. If you suddenly see diffs on the languages side of your response, this is something that you were not expecting, and it's worth investigating.

What about side effects? What if your system has side effects? The biggest problem is when those side effects, let's say, for example, your system writes to a database but it also reads those same things that are wrote to the database as part of its calculation. For example, your system reads user preferences, but then also writes user preferences. What will happen is now you're going to have interference between your baseline and your candidate, because let's say the baseline gets in first, you have a race condition. It writes a thing and now the candidate sees a different input than the baseline did. We didn't have this problem for playback experience, however, because it doesn't write to customer preferences, it only reads from them. Unfortunately, if we have this scenario where you write to mutable state, and you read back that mutable state, as part of a single calculation that you're trying to compare, I don't currently have a solution for this. It's an unsolved problem that I'm super interested in, but don't yet know how to work around. The other issue with side effects is when they're non-idempotent. Any situation, for example, we're doing multiple writes. That's undesirable, because let's say you're incrementing a counter, or, in some other way, having those extra writes be bad. It's particularly bad if you're running this in production. If that's the problem you have, then you need to work around it. You can maybe, for example, prevent these replay requests from altering production state by just turning off the write, or potentially by writing to a different environment. This is something that as an application owner, if you're interested in implementing a solution like this, and you have non-idempotent side effects, you're going to have to figure out what your service needs to do in this case.

Diffing in Action

Since now we have zero diffs, let's see how we use this in our everyday work. The first thing we did was a couple of really big migrations. The first use case we had was we had to move from a Guice based injection platform to Spring Boot. It's not just the dependency injection container that we were migrating. There's a whole platform of libraries, including libraries for data access, libraries where you're using Kafka, authentication, you name it. Part of this migration involved switching all those dependencies as well. We wanted to make sure that we did this migration without breaking our system. We were actually able to do this really well, by breaking the migration into a large number of small mini-migrations. We will migrate one support library at a time, for example, let's migrate our data access for Cassandra first and then we do this and that. Once we did the preliminary migrations, we essentially switched the injection frameworks, and they are going to chase down every single remaining dependency. Every time we did that, we actually were able to get really quick feedback from our diffing framework that ensured that we didn't actually change anything functionally.

We ran this several times a day. I got a lot of quick feedback loops. We were able to fix anything that was broken. By the time we were done, very quickly we knew that functionally we had changed nothing. We still had to run some canaries and make sure that latency and other non-functional considerations were also good. It was pretty amazing, because normally this thing requires a lot of trust in your test suite. That it'll catch all these weird little corner cases that you're looking to do. In this case, for example, we actually had to migrate the test suite to work with the new platform. That's always scary, is if you're changing your tests, how are you sure that they're actually going to work exactly the same way they were working before? That wasn't the problem with diffing.

Then the other thing we migrated is we actually migrated the service we used to get movie catalog metadata. In this case, what was interesting is that this service, when we migrated to the new version of it, it was new. Some parts of it had been recently implemented. Having this comparison framework allowed us to verify not only that our service worked correctly, but also that that service worked correctly as well. We actually found a pretty rare bug in it. It was a very rare bug, because it was only happening in one country and it was only happening with one language. It was actually only happening for one or two movies. It's because one particular language code KM, which stands for Khmer, in the new framework was coming back as KHM, which is also a valid code for Khmer under a different standard, it has three-letter codes. The new service was using the three-letter code conversion for language codes, whereas most of the rest of Netflix depends on the two-letter language codes. That was pretty great, because we were able to discover it in the diff, fix the issue with the service, and deploy again with no incidents, and very quickly. That was unexpected. It's not what we build the framework for, but we were able to use it to test the dependency of our service.


Finally, this is one of my favorite quotes. It's by Kent Beck. It says, "Make the change easy (warning: this may be hard), then make the easy change." There's a lot of reasons why making a change easy could be hard. It could be architecturally difficult to make a refactoring. Maybe it's just difficult to wrap it around your head and decide what you want to do. This isn't going to help you with that. Once you know what you want to do, one reason why making changes easy can be hard is that refactorings introduce risk. They take time. They're hard to validate. This can help you with that. We do this every day, most of the work I do daily. Because I'm changing an old system with a lot of code in it, when I come in and I want to introduce a new feature, there's usually some cleanup, some refactoring, and some rethinking of the code that I want to do to make my additional feature fit better into the code base. It's great. I break my work into two parts. Part one, I do the refactoring. Then I diff it. I run the job. I make sure I broke nothing. Second step, I make the changes that I intended to make, run another comparison job. Make sure that the only thing that changes is the actual thing that I was looking to change. That's it, really. That's the secret. We do this pretty much every day. I think most of my team pretty much every day of the week, most days, somebody is running one of these comparison jobs for some work they're doing. This approach is garnering some interest. Few services are looking to onboard and apply the same approach.

Questions and Answers

Tucker: Do you have any examples of a rare bug that made it through your diffing framework? Could be L1, and what gaps you close to address that. Have you run into that yet?

Fernandez-Ivern: Not really. I can use this as a stepping stone to talk about some other things that do happen. Basically, where a rare bug will fit through is if you allow noise to come back in and then you confuse the rare bug for noise. As long as you can keep noise at zero, it's not going to happen. What we have found is that as we progress with the system, every once in a while somebody forgets. Let's say they'll add some collection, fill in a response and forgets to either make it deterministic or normalize it away. One of those two things has to happen, otherwise, you will get noise back. Another thing that happened is, we added another use case, and it was hitting the same clusters that we use for this. I started seeing more success diffs. I was ignoring it, who knows? It must be transient. I traced out that we were actually seeing some throttling because we were overloading the little, many clusters that I had set up for this. If you give this some minor love and maintenance, so that you don't let that noise creep back in, then you don't get these rare bugs. That's why we haven't really seen this happen.

Tucker: One of the things you talk about is side effects. I think a pretty common source of side effects could be metrics and logs, and that sort of thing. Did you all have to deal with that in the system so that you don't accidentally alert yourself from the shadow traffic and things like that?

Fernandez-Ivern: Yes. We did have to make sure the alerts don't fire on these, particularly because there are clusters that you generally just don't care about so we can send all the traffic we want at them. It makes no sense to hold them to production SLAs, even though they do run in the prod environment. That's one thing, you got to make sure your alerts don't fire. The other thing is, we have pretty extensive insights. Some of those insights are surfaced by tools used by other teams. For a practical reason, we didn't really want to pollute those insights with a whole bunch of essentially not real requests. That's the other thing we had to do, is essentially make it so that shadow traffic doesn't go there. We chose to run this in a production environment, so you got to make sure you don't impact production.

Tucker: In my experience, we found that one of the best ways to catch rare bugs was through code reviews. Has this been the case for you, too? Do you feed information from rare bugs found in testing back to the developer team as something to look out for in code reviews?

Fernandez-Ivern: In this case, the developer team is us as well. We do, by all means, if we find something interesting and rare, we will not only fix it, but also let the team know about it so that we can apply it. Code reviews, we definitely leverage them. They can catch a number of bugs. When reviewing code, there's definitely plenty of stuff that I don't catch. It's necessary, but I found it to not be sufficient step to catch everything.

Tucker: What layers in the testing period does your team implement?

Fernandez-Ivern: We do unit tests and integration tests. There's also a fair amount of smoke testing, but not end-to-end tests, for the practical reason that it's quite difficult to do that, especially in a non-noisy way from our layer. We understand our domain really well, which is a mid-tier service that does one thing very specifically. End-to-end tests of let's say, for example, firing up a device and actually getting a movie playback to work, it requires knowledge of a lot of things that our team doesn't have great visibility into, so there's a specific team that does that kind of thing. Definitely, there is that at Netflix and it is executed routinely, but we don't do this as part of our software development lifecycle.

Tucker: Does your service need to handle authentication, authorization of the request? If so, how does that play into the replay requests and determinism and normalization.

Fernandez-Ivern: We don't. That all happens on the edge that calls us. We have the benefit of an orchestration layer that, among other things, handles things like authentication, authorization. By the time we get a call to our service, that's all taken care of, and we don't have to worry about it.

Tucker: Do you have any thoughts on how that would play out?

Fernandez-Ivern: What would the challenges be? If the authorization was good for only one call or something like that, if there's a ticket-based thing where you would need a new one, and then there will be a change in the input. If that's not the case, if it's just some authentication token that can be reused, then you're doing a replay attack against yourself. That's ok. That should work. I think as long as you don't have a situation where you somehow have to mint some new authentication token for both requests, in which case, I would have to understand more about how that process works. Then, can you create two equivalent tokens with the same inputs that you could trust to have the same effect? Maybe you could do that.

Tucker: You have this tool. It's working great for you. If you were going to invest in additional features for shadow diffs, what would they be, and why? What do you still want to build out there?

Fernandez-Ivern: There's two things that I would love. One of them I think we'll probably get to doing and the other one we'll see. The first thing is, even though one of the motivations for building the system was that I didn't know how to build a really great differ, a smart one that would know what's important and what isn't. It was built so that you could do a literal diff and still get no noise. I would love to have a better differ. Something that's visual and actually shows me very easily, even in a deeply nested part of a request, what exactly the difference is, so that as a developer, once I've discovered something, I can more easily rule out what's important, and what isn't? Rule out diffs and say, "No, this isn't fine." I would love that. I would love a better differ. I think we're going to get that done. I've been talking to a couple people at Netflix who might be interested in solving the same problem.

The other thing that I would like is a better workflow for discriminating between changes that are expected and good, and changes that are unexpected and interesting or bad. There's a few ways this could be done. It could be done in an a priori way where you can figure out, "I changed languages in France." You can somehow describe that that's an expected change, and then the system automatically works with everything. The more realistic thing that I think we might end up doing is something along the lines of a system that allows me to just check off a thing. It's going to ignore things like this. I'm looking at my list of diffs, and I can look, I already checked this one, it's ok, click, and just make anything like this disappear, show me things that are not like this. Then you can whittle it down to either nothing, which means you're good, or a small number of worrisome diffs that you have to explore. I love that.

Tucker: How long do these kinds of tests take to run on your system? On which stage of the lifecycle do you run them?

Fernandez-Ivern: We run a continuous difference between production and test. That runs all the time. Anytime anything gets merged, pushed to test, you can at any time go to a dashboard and see how different is test from prod. That is one thing. We like that for a number of reasons. One of them is that we can always go back also, even if we missed something, we can go back and we can very easily see which version introduced the change and what exactly the change was, and when. We keep a historical record for a month. It's really useful. The other thing we have is we have the ability to launch this on-demand to compare a branch that we're currently working on with the latest ahead batches in test. That one, we run it usually for 30 minutes to a couple of hours. It starts very quickly and you get results immediately. It's more about how many samples do you want before you feel happy about your change. Two hours is a fairly long diff. Every once in a while, we'll do something where we want to have a lot of confidence, then I might run it for a day, if it is something really massive, but it's very rare.

Tucker: If you were going to roll this out for a stateful system, what are some of the things that you might start thinking about?

Fernandez-Ivern: One of the things we could do is, can we isolate the effects in some way, either by having the canary and the baseline perform their effects in different spaces? Then you can actually compare what they did, or do something like that, like let's say two different database clusters, or keyspaces, or something like that. That's one thing to do. It strikes me as if these are undoable things, and only if they're undoable, if they're reversible, you should be able to also use that in some way, either by reversing the effect of a thing by applying the other. Or even thinking about things like, if I have two things that are supposed to be equivalent and they're reversible, applying one and reversing with the other side, should give me the same thing. Isolating the fix would work for sure, but it might not be feasible for some systems. If you can isolate, go for that.


See more presentations with transcripts


Recorded at:

Jun 27, 2023