Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Dark Side of DevOps

Dark Side of DevOps



Mykyta Protsenko discusses the trade-offs that companies face during the process of shifting left, how to ease cognitive load for the developers, and how to keep up with the evolving practices.


Mykyta Protsenko is a senior software engineer at Netflix. He is passionate about all things scalable, from coding to deploying to monitoring. He can be found speaking at a variety of conferences - OSCON, DevNexus, Devoxx (Ukraine, Belgium, United Kingdom), GeeCon and others.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.


Protsenko: My name is Mykyta. I work at Netflix. My job is basically making sure that other developers don't have to stay at work late. I call it a win when they can leave at 5 p.m., and still be productive. I work in the platform organization, namely in productivity engineering, where we try to abstract toil away for the rest of engineers. Where we try to make sure that the engineers can focus on solving business problems instead of dealing with the same boring technical issues over again.

Let me ask you a few questions. How many of you work in companies that practice, you build it, you run it philosophy? How many of you are happy that you don't have any gatekeepers between you and production, that you can deliver features and fixes faster? How many of you have ever faced a situation when you're dealing with a production incident, and you're wordless what to do, and you wished it would better be somebody else's problem? Let's be honest, because I've been there quite a few times myself. Yet, I don't want to go back to the times of having a big old wall between development, QA, and operations, where you write the code and you throw it over the wall to QA, who find bugs. Then they throw it back to you, and so on, until you both agree, it's good enough. Then you throw the code over another wall to operations, and then you expose it to users. Then your users find bugs. This whole process repeats, usually, being really painful.

Accelerate: State of DevOps 2021

It's not just my opinion, the State of DevOps survey highlights year after year that companies with a higher degree of DevOps adoption, they perform better. They have a better change failure rate. Basically, their deployments fail less often, and that happens even though they're doing much more deployments in general. However, this achievement, this win comes with a price, because operating your own code in production means that you have to wear a pager. You need to be ready to respond to incidents. You may need to actually communicate with real customers who may provide valuable feedback, but they also require time and attention. As a result, you can definitely ship code faster. Your code quality improves. The burden of operational duties, it's non-negligible on top of that. As the shifting left movement is getting traction, we see even more load being added to the developers' product.

Shifting Left

How does it happen? Let's review it on the example of shifting left the testing. The traditional software development lifecycle has testing activity as one of the very last steps. Majority of the testing happens after the code is written. By using things like test-driven development, by automating our tests, we can shift the majority of testing activities to the earlier phases, and that reduces the number and the impact of the bugs. Because the earlier we can catch a bug, the less expensive it is. A bug that is caught at the development phase, it means that no users have to suffer from it. A bug that is caught in the planning phase, it means that we don't even have to waste time and resources writing a code that is faulty, defective by the very design.

Of course, it's not free. To achieve that, developers have to learn things like testing frameworks, testing methodologies, things like JUnit test driven development, Jenkins' pipeline DSL, GitHub Actions. They're just a few things that now become an everyday part of developers' work. Yet shifting left doesn't stop with shifting left the testing, the next step may be shifting left security because, yes, on one hand, we need to care about security, especially as we provide developers freedom to design, deploy, and operate on code in production. On the other hand, it means that developers and security specialists now have to work together and developers have to learn and take care of security related things like static code analysis, vulnerability scanning. Shifting left doesn't even stop there, more processes are being shifted left as we speak, for example, things like data governance. We have a problem, on one hand DevOps as a concept introduced shifting left to all these different practices. It removes barriers. It let us deliver code faster with more confidence. On the other hand, we've got ourselves a whole new set of headaches. We need to navigate between two extremes now. We don't want to introduce a process so rigid that it gets in the way of development and shipping code, yet we cannot avoid the price that we pay to ensure this constant flow of changes. We can and we should try to minimize this price, make it more affordable. Let's talk about how we can do it. What can we do to make the complexity of DevOps, complexity of shifting left more manageable? What are the problems that we may face along the way?

Looking At History

First, let's take a look at the history of the problem to see if we can identify the patterns, to see if we can come up with solutions. I'm going to pay attention to several things when I will be reviewing each step of our DevOps journey. I'm going to take a look at communication, basically, how does the problems that we face affect the way we interact? Who's interacting with who in the first place? I'm going to take a look at tooling. What tools work best for each step of the journey? How do we even set up those tools? I'm going to look at cognitive load that happens during bootstrapping, during creation of the new projects, and during migration. Basically, I'm going to see how easy it is to set things up and how easy are the changes. I'll also try to illustrate the journey with some examples from my day-to-day work where I can, to make sure that the things I'm talking about are not just hand wavy stuff.

Ad-Hoc DevOps

How does DevOps adoption start? You may have a small company, or a startup, or maybe a separate team in a larger company where people realize they want to ship things fast, instead of building artificial barriers. They want to identify, automate, and eliminate repetitive work, whatever this work is in their case. Now we may have a cross-functional team that works together on a handful of the services. The members of this team, they exchange information mostly in informal conversations. Even if they have documentation, even if this documentation exists, it's usually pretty straightforward. They may have a Google Doc or a Confluence page that people can just state it as needed. Tooling the automation is also pretty much ad hoc.

Here's a quick example. We might start with setting up a few GitHub repositories to store the code. We can create an instance of Jenkins service to monitor the changes in those repositories and run the build jobs to create Docker images, and publish them to Artifactory. We may have another set of Jenkins Jobs that we can set up to test the code and another set to deploy them to our Kubernetes cluster. At this stage, we can totally get away with managing those Jenkins Jobs and Kubernetes clusters, manual or semi manual, because it's not a big effort. It's literally a matter of copy pasting a few lines of code or settings here and there. At this point, the goal is more about finding the pain points and automating the toil, removing this manual work from the software development process. Those few scripts and jobs that you set up, they provide an obvious win. You have an automated build and deployment process, which is so much better than deploying things manually. Moreover, putting more effort into managing Jenkins and Kubernetes automatically may require more time compared to those small manual changes. After all, at this point is just a few projects and a few people working together. When something is deployed, everybody knows what happened. It's not like you have hundreds of projects to watch and investing a lot of time into automating something experimental, something that you're not even sure it's going to work for you in the long run, it may not be worth it at all.

What about making changes about migration? Sure, let's spice things up. What if we decide that regular Kubernetes deployments are not enough? What if we want advanced canary analysis and decide to use Spinnaker instead of Jenkins? Because Spinnaker provides canary support out of the box, so let's go ahead and use it. It's a great tool. It's still pretty straightforward. We can decide to do an experiment. We can choose one of the projects and we can keep tinkering until we get everything right, then we can just proceed with the rest of the services. We can keep disabling Jenkins Jobs and enabling Spinnaker pipelines, project by project, because the risk of something going wrong, the risk of something being left in a broken or incomplete state is low. Because you literally can count the number of places where you need to make changes on one hand and you can do this change in one seat. If you have five microservices, and if you need to upgrade a pipeline for each of them, it's not too hard to find the owners. It may be just a couple of people in your team or company. As for the tooling, basically anything that works for you, if it solves a problem. It basically doesn't matter yet how you set it up. Yes, there is some additional work for all involved parties but it's not overwhelming yet because the size of the effort itself is limited. There is an immediate noticeable win. The developers don't have to worry about manual steps, about repetitive tasks, they can concentrate on coding. That approach reduces cognitive load for developers at this stage where you just have a handful of services.

Here's a real-life example. Even in Netflix, with all the best practices, with all the automation, there is still room for solutions that are managed manually, for example, this particular application has some caches, and those caches have to be warmed up because the application has to make sure that it has warm caches before it can start accepting production traffic. If we don't do that, its performance drops dramatically. This pipeline was just created manually by the team who owns this application, and its mind features these custom delays before starting to accept live traffic. Since this is a single pipeline that serves a single application, maintenance and support is not a huge deal, because updating or tuning this pipeline once a quarter or less is not a huge deal. Since the team that manages this pipeline is small, it's not hard to find the people who are responsible for maintaining it. Updating it is also pretty straightforward, because the pipeline is pretty much self-contained. It doesn't have any external dependencies that people need to worry about. They can just make whatever changes they need. This manually maintained pipeline provides an immediate benefit. It solves the performance issues that are caused by cold caches, and it removes the cognitive load created by shifting left those operational duties.

What's the Next Step?

However, as the number of applications keep growing, as they become more complex, we face new challenges because it's now harder to keep track of what users want. We may have a few applications that are written in Ruby and the rest of applications written in Java. Some of the Java applications may be on Java 11. Some of those Java applications may be on Java 17 already. On top of that, you have more people managing those applications. Even with bootstrapping, when you have new developers who want to create a new service, it's not easy for those new developers to find the right example, whatever right means in their particular context. Even finding the right person to talk to about this may be challenging.

The maintenance of applications become more problematic as well, because if I have an improvement that I need to share across the whole fleet, I need to track down all their services, all their owners. That's not easy anymore. For example, if we take our migration scenario, where we set up Spinnaker pipelines with canary deployments, and we keep disabling all Jenkins Jobs, it looks doable on the first glance. However, when we start scaling this approach, we quickly run into our next issue. Once you're trying to migrate dozens or hundreds of applications, you cannot just keep doing this manually, because you have too many things to keep in mind, too many things that can just go wrong. When you have hundreds of applications, somebody will forget to do something. Somebody will mess up this setting. They will start the migrations, get distracted, will leave an incomplete, broken state. They will forget about that. Imagine what if the production incident happens and you need to figure out what state your application infrastructure is in? People will be just overwhelmed with this amount of work. Just tracking all the changes, coordinating all the efforts is not an easy problem anymore. That's exactly the problem we started with. We have created more cognitive load for the developers by shifting left those operational duties, and that cognitive load doesn't bring any immediate benefits.

The Paved Path

Here we come to the concept of paved path. It is known under different names. It's called paved path in one company. It's called golden path in other companies, but the result is the same. The paved path is an attempt to find a balance between old school gatekeeping and a free-for-all approach. Usually, once it's clear that a free-for-all approach produces too much chaos, creates too much of a cognitive load, people will start looking for alternative, for guidance. They start looking for a better solution, and for more centralized solutions. People start to feel a need for experts that can just solve problems for them. Those experts may be just a few volunteers originally or they may compose a separate platform team. Those experts can provide the rest of development with curated solutions, selecting those solutions like curators select art pieces for an exhibition. Here we start to see a separation of responsibilities. Experts choose solutions that work out of the box. They guarantee a certain level of support. Developers are free either to use those solutions, or to experiment with other tools, as long as they understand the consequence of those experiments, as long as they take the responsibility for those experiments.

For example, if Cassandra is provided as a curated solution, and if there is something wrong with your Cassandra clusters, developers may just rely on experts to have a set of metrics, set of alerts to warn about the problem. Developers may rely on experts to diagnose and fix those Cassandra specific problems. Basically, developers don't have to worry about becoming Cassandra experts themselves. Yet, developers are still free to use any non-standard solution. If they feel like they need to use Neo4j for their particular application, they can do it as long as they are willing to take care of it themselves. Basically, we see the change in interaction, this stage introduces a clear separation of responsibilities. It doesn't matter yet if this separation is formal, as long as everybody has their understanding, clear expectations set up between those two different roles.

What about the tooling at this stage? Let's start with bootstrapping. When it comes to bootstrapping, creating a new service, developers don't want to worry about rediscovering how to bootstrap an application, how to create a new infrastructure each time they create a new service. Having a single point of entry helps here. If you have a developer portal or a command line tool that you just use every day to find your applications, operate them, and create them, it makes that so much easier to find this paved path. You just use this tool or portal every day. You build your muscle memory. Common sense tells us that developers usually have less experience creating new services than maintaining or operating existing ones. If both creation and operating the services uses the same entry point, developers don't have to remember how to do this less frequent thing that is creating a service, setting up all the related infrastructure. Yes, developers may not remember all the steps. At least they remember that there is a big but, and cold create a new service, and they know where to find it. Or if you have a command line tool, developers can know how to run a help command and find the actual command they need to run.

Let's talk about the next step. We can use this single point of entry to create whatever infrastructure, whatever resources are needed for the project bootstrapping. What is done exactly? To do it properly, we need to treat our infrastructure as code. Not just operational infrastructure like Kubernetes clusters, load balancers, databases, we need to take care of all infrastructure. Including infrastructure that covers different aspects of software development, different things that are shifted left, like build and deploy, like testing, like security and all other aspects of your software development. We can definitely do it now. We have multiple tools. We have things like GitHub Actions, Jenkins pipeline DSL, other tools that we can just configure as code and use those solutions to address our problems. Why is it important? Why do we want to have that code? Because this way, we can reuse this same code for different projects. Code can and should take inputs, and we can abstract implementation details away. We can let developers answer high level questions, and the code can use those answers as inputs to create the infrastructure we need, eliminating the need to do these repetitive tasks by hand again. It removes the cognitive load that exists, then infrastructure is managed manually.

What code works best here? In my experience, imperative code may not be a good fit. Because, yes, you can write a script to create a new deployment pipeline, for example. Yet it also means that in your imperative code, you have to account for each and every edge case because, basically, you're saying, please change the current state of the world this particular way. If the current state is not something that you expect, you might face an error because what if the deployment pipeline you're trying to create, what if it already exists? What if it exists, but has a different configuration, what do you do then? How do you reconcile those states? You need to account for them. I think that declarative approach works so much better for infrastructure. With a declarative approach you can describe infrastructure resource, any resources, basically, by saying, I want the state of the world to be like that. Then your tools figure out whatever changes are necessary to achieve that particular goal. Declarative tools, they give you a benefit of idempotency, because no matter how many times you run them, they will leave your world in the same state each time you run them. You don't have to create unique deployment scripts. You don't need to adjust those scripts each time you change something. If you need to make a change, you change the declaration itself, and the declarative tools will take care of whatever steps are needed to make it happen. If you need to provision a pipeline and it doesn't exist, with declarative approach, it will be created. If it exists already, nothing will happen. If it exists, but if there is a configuration discrepancy, it will be reconciled, and you don't have to do anything special, the process is the same in all the cases.

Let's apply this approach to our paved path where we have a separation of responsibilities, where we have experts who work on solutions that developers can use. Now developers can define our paved path using declarative tools. They can define Jenkins Jobs using declarative pipeline DSL, for example. They can describe those jobs as code. They can set up automation that commits this code to the repositories that decide to use the paved path, and developers don't have to do anything about it. It just works. As for migrations, we can go back to our examples when experts decide to introduce canary deployments, when they decide to migrate to Spinnaker instead of Jenkins. They can set up different declarative code that describes those Spinnaker pipelines, and they can keep using the same approach. Basically, they can keep delivering zero effort migrations to the developers. They just change the declaration and it modifies the infrastructure transparently. Yes, we added a bit of work that a handful of experts have to deal with to reduce a lot of serious cognitive load that unstructured manual solutions create, and scale for everybody, first and foremost, for the developers.

Let's try to make this example more concrete. If we go back to the concept of single point of entry. In Netflix, we use a command line tool called Newt. We use it for development, because it provides entry points for most common actions for different kinds of projects. Projects like Java projects, Node.js projects, you can still use Newt to build them, or to run and debug them locally. It's easy to find a command to bootstrap to create a new project. Even if you don't remember that, you can always call Newt help. When a new project is created, Newt asks a bunch of human friendly questions, human readable questions, and then converts them into the infrastructure we talked about, like Jenkins Jobs, for example. Behind the scenes, it integrates with another in-house declarative tool that in turn takes care of talking to Jenkins, Spinnaker, other solutions. Now experts can codify and templatize those resources that they need to create, like Jenkins Jobs and Spinnaker pipelines. Tooling can help developers configure Jenkins Jobs individually for each particular app. Some settings like application name, or Git repository URL can just be set up automatically in the newly created jobs. On top of that, whole blocks, steps of the process, parts of the jobs can be configured conditionally using the inputs provided by developers. That's just a small part of the templated engine. Then those templates are rendered to provide whatever output is needed by the system they talk to, like Jenkins. They can use XML to describe Jenkins Jobs. They can use JSON to describe Spinnaker pipelines, and so on.

Then, that declarative tool can take care of creation or updating the resources as needed. Here we have a piece of templatized XML that can take inputs and can produce a proper Jenkins Job. That just solves the problem of bootstrapping. On top of that, it solves the problem of migration too. If experts need to make the change to infrastructure, they can just make the change to this particular template. It can be something as serious as replacing a Jenkins Job with your Spinnaker pipeline, like in our theoretical example, or it can be something much more targeted, like replacing a Jenkins plugin in a specific Jenkins Jobs that we do here. Or it can address security issues in AWS metadata service like in this particular example. When you manage fleets of applications, you know that such changes are regular, they happen often. They happen every week, if not every day. If you have to do them manually, they just turn into death by a thousand cuts. If you can automate this change, if you can just update the declaration and roll it out to the whole fleet, you can save yourself days or weeks of herding the cats, of trying to pull away developers from their regular work, from coding, from solving business problems. You don't have to force them now to do infrastructure migrations that are not in their duties anymore. When you have dozens of applications, that's already a significant win, and when you have hundreds or more, this approach is a lifesaver.

One Paved Path?

However, as the company grows, one paved path may not be enough, because for more homogeneous companies, for more homogeneous teams, it may be ok if all developers are working on the same set of problems, it totally makes sense to stick to the same set of solutions. Yet, if your company is big enough, it may no longer be the case. For example, in Netflix, developers who work on streaming projects, they have to deal with high load, high availability systems with clear traffic peaks, because a lot of people come home after work and start watching movies starting at 6 p.m., Eastern time. On the other hand, developers who work on studio automation, those developers solve problems that are no less important. Those problems have totally different aspects, different traffic patterns, different availability requirements, different latency requirements, and so on. What works for studio may not work for streaming and the other way around. For example, canary deployments, where you roll out a new version of your service to just a few percent of your user base. It works fine in streaming, but it may not provide a good reliable signal in studio, just because the size of the user base is so different, it may be different by a few orders of magnitude.

Paved Paths at Scale

A single shared path no longer works in this situation. To make the situation even more complicated, the paths of different paved paths may intersect, for studio, if streaming prefer Java for their backend, it makes sense to implement that part of the paved path only once and then compose those parts needed. What's the best way to do it? How do we approach composable paved paths? What do we need to implement to enable it? First of all, a single set of nodal experts is not going to work anymore, because, yes, definitely some of these experts can and should work on the pieces that can be reused across the board. Multiple composable paved paths, they actually require specialized knowledge. They require experts who are familiar with a particular domain area, with the problems of a specific organization. We're going to see an additional separation of duties here. Now we have organization level experts that can work on solutions to be used in their particular team, particular organization. In turn, those experts, they can rely on common platform solutions that are provided by company level experts like a platform organization. If we go back to our example of streaming versus studio, we may have streaming experts who solve problems that are specific to streaming, like, what does it take to safely redeploy a massive fleet of servers while handling regular or daily traffic spikes? Platform experts can provide building blocks that can be reused by both streaming and studio. For example, they can provide managed database solutions or managed security solutions.

Second, when it comes to actual implementation, we're going to have paths that are identical, like running unit tests, and paths that are different, like using canary deployments, which work much better for high traffic applications. Implementing each path separately means that we are creating a risk of implementing identical things twice, which increase the maintenance burden because things have to be implemented twice at the very beginning. The change also has to be duplicated. In the standard paved path approach, we can just codify the best practices using declarative tools. However, when we have multiple paved paths, when they share certain practices, we need to split the code as well to separate those different practices to keep them apart. For example, if we have a paved path that has a distinct step like build and deploy, then each step should be codified and versioned separately. If those steps depend on each other, the dependencies should be set explicitly like that, because deploy depends on the build, not the other way around.

By doing this, we are going to get a quick win. Because if we know which version of infrastructure is used for which app, we can avoid the mess, like I don't know what's happening within app x. We can avoid strict gatekeeping, like it has to be this way because it's written like that, no other way is possible. We can separate different paved paths and provide different ways of doing things. What's most important, we can avoid big bad migrations, like let's migrate everything first because this is the way things should be done, because we only have one paved path and we cannot do anything about it. Instead, we can try new solutions, we can try them with a few select apps for us to see if we have any issues. If there are issues, we can safely go to the previous good version since we're using a declarative approach. We're saying, I want the state of the world to be like that. We're not doing imperative migrations, where we say, do action z with my existing resources to change their state from x to y, because this imperative approach might require another imperative migration to go back to the previous state in case something is wrong. That can be especially tricky if the original rollout was just partially successful, and the actual state is something between x and y.

Let's see how this approach plays out with our original migration example where we migrate deployment pipelines from Jenkins to Spinnaker. We can have two big infrastructure blocks like build and deploy. We can codify them separately. We can version them separately. We can define version requirements. Since we're making a breaking change, we can even increment a major version of our deploy step and roll it out first to just a small subset of our applications. Then when we make sure that it doesn't break anything, we can roll it out to the whole fleet. This way, we can minimize the blast radius, we can maximize our confidence because versioning each part of our infrastructure code separately gives us better visibility, better control. If we decide to keep splitting our paved path into two or more, we can keep deduplicating the common paths and we can keep separate paths separate. Basically, we apply the standard coding practices, versioning, code reuse, controlled rollouts with our paved paths being represented as code. This approach is not limited to updating deployment pipelines. Any solution that you can codify declaratively can be managed this way, be it security, be it testing, be it something else. The result is reduction of cognitive load during migrations for both developers and the experts. Organization level experts don't have to reinvent the wheel for each separate path. They can reuse common building blocks and they can focus on solving problems, providing curated solutions that are common to a particular team, particular part of their organization or company.

Example: gRPC vs. GraphQL

One example I can mention from my Netflix experience is, splitting regular gRPC applications that are mostly used in streaming and GraphQL applications that are common in studio into two different paths. This way, organization level experts are able to focus on solving their specific task without worrying about the single paved path becoming too bloated, too complex to maintain and manage. Those paths have significant differences. I'm only showing those significantly different paths here. For example, gRPC applications publish the protobuf, the gRPC scanner, the gRPC contract as a part of the client publishing and promotion process. It's actually included in the client's jar, maybe a candidate jar or a release jar. Since those applications usually have more predictable traffic, they rely more on canaries. On the other hand, GraphQL application deployment process is more simple. Those applications, they need to rely on a special pipeline stage to publish their GraphQL schemas, the GraphQL contract because they are not published as jar, they are published as standalone GraphQL registry. If we kept all the solutions as part of one spaghetti-like paved path, we don't have configurations options, this approach would be next to impossible. Separating those paths, teasing them apart, it allowed experts from different organizations to work independently without stepping on each other's toes.

Things to Watch Out For

You might ask me, it all sounds great, but what are the downsides? What can possibly go wrong here? That's a great question. That's a loaded question, because any model looks great in theory, but in practice there's always that catches you off guard. One of the most common problems in software engineering is solving an interesting problem instead of solving a problem you actually have. Because many people want to do things like Google does, or like Netflix does, and very often it's the right thing to do. A lot of times, other companies, smaller world companies maybe don't even have these same kinds of problems that larger companies suffer from. When it comes to DevOps, when it comes to choosing the right tool and then the right approach, it's the same. We have been talking about different stages of DevOps journey. The tools you're going to need are different too. It definitely makes sense to first see where you are by just starting, you only have a handful of services to manage. Maybe you already have hundreds of services. It also makes sense to check the dynamics. If you're going through explosive growth, and you think it's very likely that 15 of your services will turn into 200 pretty soon, then it may make sense to start planning for the next step, to make sure that you don't commit to the solution that will hold you back in the near future.

Also, you want to find the right level of abstraction, because too much abstraction may not be a good thing. As we hide things away, you may end up hiding things that may not be essential for day-to-day operation, and you think it's a great idea, it is going to reduce cognitive load. Those things might be critical when you're dealing with a production incident. When you're investigating it. When you're looking at a broken service. When you're trying to find out what is wrong. As my colleague Lorin, once noted on Twitter, it's impossible to find an abstraction that is going to work for everybody. You can make it possible for people to navigate through layers of abstraction, to understand what actually happens. How do you navigate abstractions? Two things are important there. First, you need to understand the effects of your change. You need to understand, what is going to happen if I make this change? Which systems are going to be affected? Second, you need to understand the cause of the change that has already happened. I need to understand things like, if I see my system in this particular state, what was the change to cause it? If your abstractions can help you with that, then you have the right level of abstractions. I just wanted to highlight that ensuring observability in your abstractions may actually improve cognitive load in both day-to-day operations and during production incidents, during the investigation.

Where Do I Start?

You may ask, where do I start? That's another good question. I'm not going to advocate for any specific tools. I can give you a few examples that you can start with, you can research them. You can see if their competitors are a good fit for you. You can start with Terraform. It's a well-known player when it comes to infrastructure as code. Its plugin system is extensible. If you need to add custom solutions, it's easy to do with its plugin system. Its modularization, its versioning approach is worth looking into. It can provide the way to version your infrastructure to manage your paved path. Another declarative tool that I wanted to highlight is Spinnaker's managed delivery. It focuses on continuous delivery and related infrastructure. I believe that it's certainly worth looking into. Of course, there are things like GitHub Actions, or Jenkins pipeline DSL. Those things are more generic. They may require other tools to solve your particular problem. They also provide a good degree of flexibility. You can start there. You can research those tools, you can research their competitors, and you can see what is the best fit for your particular situation.

Am I Going the Right Way?

You also may ask, I have started working on my solutions, on my infrastructure, build, deploy, test, and security, whatever problems you're trying to solve. You want to say I'm working on improving my developers' experience and reducing cognitive load. How can I say if I'm going the right way? How do I know if my efforts are actually useful? There are metrics that you can use to see if your efforts are producing the right results. Most of your set of metrics is the change rate and the change failure rate. Basically, you can check how fast your deploy changes. How often do deployments fail? Those are proxy metrics. They are affected by other factors, but they also measure a really important thing, like they measure your ability to deliver quality code. Another set of metrics may be time to create, to bootstrap a service and the time to reconfigure the service. You may measure the time it takes to update the infrastructure, time to perform a migration. You may measure it across a single service and across the whole fleet of services if you choose the right abstractions. If you actually reduce the cognitive load, you should see the reduction of these times as well and vice versa. If it takes longer to make a change, then your abstractions might need some tuning.

Questions and Answers

Tucker: Is there a path at Netflix from a custom solution that used to be an experiment to becoming a curated solution? Is there a concept of a grassroots curated solution?

Protsenko: There are two sides to this problem. Definitely, I can think of examples where developers at Netflix started working on some solutions, and they realized it works for their teams, and it was scaled. At Netflix, we actually have a notion of local central teams, of teams and developers that are responsible for helping the developers in their particular organizations, like local organizational level experts that have the expertise. That does correlate to the grassroot problem to scaling grassroot solutions. One thing that wasn't in the question that I'd like to elaborate on, yes, if you have the tooling in place, if you can define your infrastructure, if you can scale it, basically, you may be free to scale your homebrewed solution. There is a caveat, it's not just about scaling a particular technical solution and approach, because these solutions have to be maintained. The developer will have to wear an expert's hat, and they will have to be responsible for doing customer support, performing migrations, and that can be a barrier.

Unfortunately, there is no magic way to get rid of the barrier. We can try to lower the threshold, especially when we're dealing with support at scale. When we have dedicated support teams helping those developers/experts to scale the approach, but it's something that you have to be aware of before you go into this territory. Another thing that I saw examples of it being used, and being encoded, incorporated as a paved path solution, is when you have an escape hatch in your paved path. When your paved path defines some custom hooks, some custom plugins that other people can lean into. When experts see these Jenkins Jobs being defined and called by other customers during this custom hook, and there is a lot of need for that, maybe we should take the burden off and incorporate and maintain it themselves. It's less about the tooling. It's more about the communication and the support burden. That's the stuff I wanted to emphasize here.

Tucker: Can you share an example of a real-world migration that you've simplified or enabled using the building blocks you described? How did that go? What was the impact of that?

Protsenko: With AWS metadata service introducing a new security option that had to be enabled across the whole fleet, just flipping this flag to true across the hundreds and thousands of applications, is a task that takes probably months to perform manually. When those applications are codified, just changing this flag, and switching it, things like introducing IPv6 for services, or enabling specific types of canaries when you have to replace a certain stage inside a Spinnaker pipeline is another good example. The change by itself is not hard, like there is a new canary type being rolled out. Canaries are already incorporated. We're just doing the replacement. Sending the email to the developers and asking them to do it will lead to huge delays. They have other better stuff to do. When you have everything that is codified, we just make the change, roll it out to a set of applications to make sure nothing is broken. Then we propagate the change to the rest of the applications using this paved path.

Tucker: Do you think of your paved path as a product where a team is accountable? I want to elaborate on that a little bit. Could you walk us through for your use cases like the streaming, and the studio use case, who was responsible for curating that paved path? How did that go?

Protsenko: With both examples, we have separate teams responsible for maintaining their respective paved paths. I'd like to focus on the product part, because I already mentioned wearing a product manager hat. I have to admit, as a part of the team that is actually responsible for curating streaming solutions, we don't always have an access to dedicated product managers. We did have to wear that hat. We went and we interviewed the customers. We actually created the framework, trying to assess the importance of different problems. We used the dimension as the impact, like, how important a specific problem is. The blast radius, the reach, how many customers are affected by any specific problem, and looking through these dimensions? We just tried to sort whatever the main pain points are. Because there are more problems that we can solve at any given time. Just sorting and prioritizing them is important before you start actually incorporating them into the paved path. That is the biggest challenge when you're approached by paths of product.


See more presentations with transcripts


Recorded at:

Jun 23, 2023