InfoQ Homepage Podcasts Clare Liguori on Automating Safe and “Hands-Off” Deployments at AWS

Clare Liguori on Automating Safe and “Hands-Off” Deployments at AWS

Bookmarks

Feb 22, 2021

In this podcast Clare Liguori, Principal Software Engineer at Amazon Web Services, sat down with InfoQ podcast host Daniel Bryant and discussed: the implementation of continuous delivery at AWS, the use of automation and deploying to multiple test environments, and the benefits of canary releasing.

Key Takeaways

At Amazon, a typical continuous delivery pipeline has four major phases: source, build, test, and production. For a particular microservice, there might be multiple different pipelines that are deploying different types of changes, e.g., application code, infrastructure code, operating system patches etc
Every change that goes out to production at Amazon is code reviewed, and pipelines enforce this. With automated “full CD” there are no human interactions after a change has been reviewed and pushed into the source code repository before it gets deployed to production.
There are multiple pre-production environments at Amazon: alpha, beta, and gamma. Alpha is typically scoped to the changes that are being deployed by a particular pipeline. Beta starts to exercise the full stack. And gamma is intended to be the most stable and production-like as possible.
Amazon uses canary deployments fairly broadly across all applications. Typically, a team will break down production deployments into individual region deployments, and then even further into individual zonal deployments.
When implementing continuous delivery, start small. This might be as simple as consistently deploying to a test environment and running integration tests.

Subscribe on:

Transcript

00:21 Introductions

00:21 Daniel Bryant: Hello, and welcome to the InfoQ podcast. I'm Daniel Bryant, News Manager here at InfoQ, and Director of Dev-Rel at Ambassador Labs. In this edition of the podcast, I had the pleasure of sitting down with Clare Liguori, principal software engineer at Amazon Web Services. I've followed Clare's work for many years, having first seen her present live, at AWS re:Invent back in 2017. I've recently been enjoying reading the Amazon Builders' Library article series published on their AWS website. Clare's recent article here, titled "Automating Safe Hands-Off Deployments" caught my eye. The advice presented was fantastic and provided insight into how AWS implement continuous delivery and also provided pointers for the rest of us that aren't quite operating at this scale yet. I was keen to dive deeper into the topic and jumped at the chance to chat with Clare.

01:02 Daniel Bryant: Before we start today's podcast, I wanted to share with you details of our upcoming QCon Plus virtual event, taking place this May 17th to 28th. QCon Plus focuses on emerging software trends and practices from the world's most innovative software professionals. All 16 tracks are curated by domain experts, with the goal of helping you focus on the topics that matter right now in software development. Tracks include: leading full-cycle engineering teams, modern data pipelines, and continuous delivery workflows and patterns. You'll learn new ideas and insights from over 80 software practitioners at innovator and early adopter companies. The event runs over two weeks for a few hours per day, and you can experience technical talks, real-time interactive sessions, async learning, and optional workshops to help you learn about the emerging trends and validate your software roadmap. If you are a senior software engineer, architect, or team lead, and want to take your technical learning and personal development to a whole new level this year, join us at QCon Plus this May 17th through 28th. Visit qcon.plus for more information.

01:58 Daniel Bryant: Welcome to the InfoQ podcast, Clare.

02:00 Clare Liguori: Thanks. I'm excited to be here.

02:02 Daniel Bryant: Could you briefly introduce yourself for the listeners please?

02:04 Clare Liguori: Yeah. I'm Clare Liguori. I'm a principal engineer at AWS. I have been at Amazon Web Services for about six years now, and currently I focus on developer tooling for containers.

02:17 Based on your work at AWS, what do you think are the primary goals of continuous delivery?

02:17 Daniel Bryant: Now, I've seen you present many times on the stage, both live and watching via video as well, and I've read your blog post recently. That's what got me super excited about chatting to you when I read the recent AWS blog posts. So we're going to dive into that today. What do you think are the primary goals of continuous delivery?

02:32 Clare Liguori: I think for me, sort of as a continuous delivery user, as a developer, I think there's two things. I think one is just developer productivity. Before I came to Amazon and in Amazon's history as well, we haven't always done continuous delivery, but without continuous delivery, you spend so much of your time as a developer, just managing deployments, managing builds, managing all of these individual little steps that you to do, that really just take you away from what I want to be doing, which is writing code.

03:04 Clare Liguori: Then I think the other side of that is just human error that's involved in doing all of these steps manually. It can really end up affecting your customers, right? You end up deploying the wrong package accidentally, or not rolling back fast enough manually. And so continuous delivery, it takes a lot of the human error and sort of the fear and the scariness for me, out of sending my code out to production.

03:29 Daniel Bryant: I like that a lot. I've been there running shell scripts, never a good look, right?

03:34 Clare Liguori: Exactly.

03:34 What does a typical continuous delivery pipeline look like at Amazon

03:34 Daniel Bryant: What does a typical continuous delivery pipeline look like at Amazon?

03:36 Clare Liguori: At a high level, I like to describe it as four basic stages. One is going to be sourced. That can be anything from your application source code, your infrastructure's code, patches that need to be applied to your operating system. Really anything that you want to deploy out to production is a source and a trigger for the pipeline. The next stage is going to be build. So compiling your source code, running your unit tests, and then pre-production environments where you're running all of your integration tests against real running code, and then finally going out to production.

04:13 Clare Liguori: One of the things that is great about the Amazon pipelines is that it's very flexible for how we want to do those deployments. So one of the things that I talk about in the blog post is how we split up production to really reduce the risk of deploying out to production. Before I came to Amazon, one of my conceptions of pipelines was that deploying to production was really sort of a script, a shell script or a file that has a list of commands that need to run, and deploying to production was kind of an atomic thing. You have one change that starts running this script, and then nothing else can be deployed until that script is done.

04:52 Clare Liguori: At Amazon, we have such a large number of deployments that we split production into to reduce the risk of those deployments and do very, very small scope deployments out to customer workloads, that it's super important that we don't think of it as this atomic script. It's actually more of a lot of parallel workflows going on that promote changes one to the other. And so production might look like literally 100 different deployments, but we're not waiting for all 100 deployments to succeed from individual change. Once that first deployment is done, the next deployment can go in. And so we can have actually 100 changes flowing through that pipeline through those 100 deployments. So that was one of the most unique things that I found in coming to Amazon and learning about how we do our pipelines.

05:41 How many pipelines does a typical service have at Amazon?

05:41 Daniel Bryant: I liked the way you sort of broke down the four stages there. So I'd love to go a bit deeper into each of those stages now. I think many folks, even if they're not operating at Amazon scale, could definitely relate to those stages. So how many pipelines does a typical service have at Amazon? As I was reading the blog post, I know there's a few. Does this help with the separation of concerns as well?

05:59 Clare Liguori: I would say the typical team, it is a lot of pipelines. For a particular microservice, you might have multiple different pipelines that are deploying different types of changes. One type of change is the application code, another type of change would be maybe the infrastructure as code, and any other kind of changes to production, like I mentioned before, even patches that need to be applied to the operating system. Some of those get combined into, we have a workflow system that lets you do multiple changes going out to production at the same time. But the key here is really the rollback behavior for us. For me, pipelines are all about helping out with human error and doing the right thing before a human even gets involved. And so rollbacks are super important to us.

06:50 Clare Liguori: With the pipelines being separated into these different types of changes, it's really easy to roll back just that change that was going out. You made a change to just the CloudFormation template for your service, so you can just roll back that CloudFormation template.

07:04 Clare Liguori: But one of the things we're starting to see and why some of these pipelines are getting combined now is that application code is also going out through infrastructure as code, especially with things like containers, with things like Lambda functions. A lot of that is modeled in infrastructure as code as well. And so the lines are starting to blur between application code and infrastructure as code. And so a little bit, we're starting to get into combining those into the same pipeline, but being able to roll back sort of multiple related changes.

07:37 Clare Liguori: So if you have a DynamoDB table change and a ECS elastic container service change going out at the same time, the workflow will deploy them in the right order and then roll them back in the right order. So it becomes a lot easier to sort of reason about how are these changes flowing out to production, but you still get that benefit of being able to roll back automatically the whole thing without a person ever having to get involved and make decisions about what that order should be.

08:06 How are code reviews undertaken at Amazon, and how do you balance the automation, e.g. linting versus human intuition?

08:06 Daniel Bryant: You mentioned that there's the human balance and automation. How are code reviews undertaken at Amazon, and how do you balance the automation, that linting versus human intuition?

08:17 Clare Liguori: Every change that goes out to production at Amazon is code reviewed, and our pipelines actually enforced that, so they won't let a change go out to production if it somehow got pushed into a repository and has not been code reviewed. I think one of the most important things about what we call full CD, which is there are no human interactions after that change has been pushed into the source code repository before it gets deployed to production. And so really the last time that you have a person that's looking at a change, evaluating a change is that code review. And so code review starts to take on multiple purposes.

08:59 Clare Liguori: One is you want to review it for just the performance aspects, the correctness aspects, is this code maintainable, but it's also about is this safe to deploy to production? Are we making any backwards and compatible changes that we need to change those to be backward compatible? Or are we making any changes that we think are going to be performance regressions at high scale? Or is it instrumented enough so that we can tell when something's going wrong? Do we have alarms on any new metrics that are being introduced? And so a lot of teams, that's a lot to sort of have in your mind when you're going through some of these code reviews, that's a lot to evaluate. And so one of the things that I evangelize a bit on my teams is the use of checklists in order to think through all of those aspects of evaluating this code, when you're looking at a code review.

09:52 How do you mock external services or dependencies on other Amazon components?

09:52 Daniel Bryant: How do you mock external services or dependencies on other Amazon components?

09:58 Clare Liguori: So this is an area which differs between whether we're doing unit tests or whether we're doing integration tests. Typically, in a unit test, our builds actually run in an environment that has no access to the network. We want to make sure that our builds are fully reproducible, and so a build would not be calling out to a live service because that could change the behavior of the build, and all of a sudden you can end up with very flaky unit tests running in your build. And so in unit tests, typically we would end up mocking, using, my favorites are Mockito for Java. We would end up mocking those other services, mocking out the AWS SDK that we're using, and making some assumptions along the way about what is the behavior going to be of that service? What error code are they going to return to us for a particular input? Or what is the success code going to be when it comes back? What's the response going to be?

10:57 Clare Liguori: The way that I see integration tests is really an opportunity to validate some of those assumptions against the real live service. And so integration tests almost never mock the dependency services. So if were calling DynamoDB, or S3, the service would actually call those production services in our pre-prod stages during integration tests, and really run through that actual full stack of getting the request, processing that requests, storing it in the database, and returning a response. And so we get to validate what is the response from DynamoDB going to be for a request that comes from the integration test in a pre-prod environment

11:38 Could you explain the functionality and benefits of each of the pre-prod environments, please?

11:38 Daniel Bryant: In your blog post you mentioned the use of alpha, beta, and gamma pre production environments. Could you explain the functionality and benefits of each of the pre-prod environments, please?

11:47 Clare Liguori: I think looking at the number of pre-prod environments that we have, it's really about building confidence in this change before it goes to production. And so what we tend to see with alpha, beta, and gamma is more and more testing as the pipeline promotes that change between environments, but also how stable those environments tend to be.

12:11 Daniel Bryant: Oh, okay. Interesting.

12:13 Clare Liguori: Typically, that first stage in the environment, it's probably not going to be super stable. It's going to be broken a lot by changes that are getting code reviewed, but missing some things and getting deployed out in the pipeline. But then as we move from alpha to beta, beta to gamma, gamma tends to be a bit more stable than alpha, right? So alpha is typically where we would run tests that are really scoped to the changes that are being deployed by that particular pipeline. What I mean by that is running some very simple, maybe smoke tests or synthetic traffic tests against just that microservice that was being deployed by that pipeline.

12:53 Clare Liguori: Then as we get into beta, beta starts to exercise really the full stack. These systems are lots of microservices that work together to provide something like an AWS service. And so typically, beta will be an environment that has all of these microservices in it, within particular teams' service or API space. The test will actually exercise that full stack. They'll go through the front end, call those front end APIs that we're going to be exposing to AWS customers in production. And that ends up calling all the different backend services, and going through async workflows and all of that, to complete those requests. The integration tests really ended up showing that yes, with this change, we haven't broken anything upstream or downstream from that particular microservice.

13:42 Clare Liguori: Then gamma is really intended to be the most stable and is really intended to be as production-like as possible. This is where we start to really test out some of the deployment safety mechanisms we have in production to make sure that this change is actually going to be safe to deploy to production. It goes through all the same integration tests, but it also starts running things like monitoring canaries with synthetic traffic, that we also run against production. It runs through all of the same things like canary deployments, deploying that change out to the gamma fleet and making sure that that's going to be successful.

14:21 Clare Liguori: We actually even alarm on it, at the same thresholds as in production. We make sure that that change is not going to trigger alarms in production before we actually get there, which is very nice. Our on-call engineers love that. We really try to make sure that even before that change gets to a single instance or a single customer workload, that we've tested it against something that's as production-like as possible.

14:46 How do engineers design and build integration tests?

14:46 Daniel Bryant: I think you've covered this a little bit already, but I was quite keen to dive into how do engineers design and build integration tests? Now, I've often struggled with this, who owns an integration test that spans more than one service, say? So who is responsible? Is it the engineer making the changes, perhaps the bigger team you've mentioned, or is there a dedicated QA function?

15:05 Clare Liguori: So typically within AWS, there are very few QA groups. So really, we practice full stack ownership, I would call it, with our engineers on our service teams, that they are responsible for not only writing the application code, but thinking about how that change needs to be tested, thinking about how that change needs to be monitored in production, how it needs to be deployed so they build their pipelines, and then how that change needs to be operated. So they're also on call for that change. And so integration tests are largely either written by their peers on their team or by themselves for the changes that they're making.

15:48 Clare Liguori: I think one of the interesting things about integration tests in an environment like this, where there's so many microservices, that it is very difficult to write a test that covers the entire surface area for a very complex service that might have hundreds of microservices behind it. One of the things that tends to happen is that you write integration tests for the services really that your team owns. And so one of the things that becomes important is understanding really what's the interface, the APIs that this team owns, that we're going to be surfacing to other teams? And making sure that we're really testing that whole interface, that other teams might be hooking into, or customers might be calling if there are front end APIs for AWS customers, and then making sure that we have coverage around all of those, that we're not introducing changes that are going to break those callers.

16:41 Do all services have to be backwards compatible at Amazon to some degree? And if so, how do you test for that backwards compatibility?

16:41 Daniel Bryant: I'm really curious around backwards compatibility because that is super hard again, from my experience, my Java days as well. Do all services have to be backwards compatible at Amazon to some degree? And if so, how do you test for that backwards compatibility?

16:55 Clare Liguori: One of the funny things that I tend to tell teams is, diamonds are forever. We've all heard that. But APIs are also forever.

17:03 Daniel Bryant: Yes. I love it.

17:03 Clare Liguori: When we release a public AWS API, we really stand by that for the long term. We want customers to be able to confidently build applications against this APIs, and know that those APIs are going to continue working over the long term. So all of our AWS APIs have a long history of being backwards compatible. But we do have to do testing to ensure that. And so that comes back to the integration tests are really documenting and testing, what is the behavior that we are exposing to our customers, and running through those same integration tests on every single change that goes out.

17:46 Clare Liguori: One question that I get a lot is, "Do you only tests for the APIs that have changed?" No, we always run the same full integration test suite because you never know when a change is going to break or change the behavior of an API. Things like infrastructure as code changes could change the behavior of an API if you're changing the storage layer or the routing layer. Lots of unexpected changes could end up changing the behavior of an API. And so all of these changes that are going out to production, regardless of which pipeline it is, run that same integration test suite.

18:21 Clare Liguori: Then one of the things that I've noticed that helps us with backward compatibility testing is canary deployments, so deploying out to what we call one-box deployments, deploying out to a single virtual machine, a single EC2 instance, or a single container, or a little small percentage of Lambda function invocations. Having that new change running really side-by-side with the old change, actually helps us a lot with backward compatibility because things like if we're changing a no SQL database schema, for example, that one is very easy to bite you, because all of a sudden you're writing this new schema that the old code can't actually read. You definitely don't want to be in that state in production, of course.

19:08 Clare Liguori: Going back to our gamma stages, also doing canary deployments, they are exercising those both code paths, old and new, during that canary deployment. And so we're able to get a little bit of backward compatibility testing just through canary deployments, which is fun.

19:24 Do you typically perform any load testing on a service before deploying it into prod?

19:24 Daniel Bryant: Something you mentioned there, the one box deploys, I'm kind of curious, do you typically perform any load testing on a service before deploying it into prod?

19:31 Clare Liguori: Load testing tends to be one of those things that depends on the team. We give teams a lot of freedom to do as much testing that they need to for the needs of their service. There are teams that do load testing in the pipeline as an approval step for one of the pre-production stages. If there are load tests that they can run that would find performance regressions at scale and things like that, some teams do that.

19:58 Clare Liguori: Other teams find it hard to build up sufficient load in a pre-production environment compared to the scale of some of our AWS regions and our customers. And so sometimes that's not always possible. Sometimes it looks like, when we're making a large architectural change, doing more of a one-off load test that really builds up the scale to what we call testing to break in a pre-production environment, where we're trying to find that next bottleneck for this architectural change that we need to make. So, it really depends.

20:29 Clare Liguori: A lot of what we do at AWS is really, we give teams the freedom since they own that full stack, to make choices for their operational excellence for that. But generally, the sort of minimum bar would be unit tests, integration tests, and the monitoring canaries that are running against production.

20:47 Do you use canary releasing and feature flagging for all production deployments?

20:47 Daniel Bryant: Do you use canary releasing and feature flagging for all production deployments?

20:51 Clare Liguori: We use canary deployments fairly broadly across all of our applications. Canary deployments are interesting in that there are some changes that are just difficult, to impossible to do as a canary deployment. The example that usually comes to mind is something like a database, or really any kind of infrastructure. So making a change to a load balancers is a little hard to do as a canary deployment in confirmation. But generally speaking, wherever we can, we do canary deployments in production. The reason for that is that it limits the potential impact of that change in production to a very, very small percentage of requests or customer workloads. Then during the canary deployment, once we've deployed to that what we call one box, we'll let that bake there for a little bit.

21:43 Clare Liguori: One of the things that we tend to see is changes don't always trigger alarms immediately. They often will trigger alarms maybe 30 minutes after, maybe an hour after. And by that point, if you're doing a canary deployment and you deploy to this one EC2 instance, that might not take very long, and then you kind of deploy it out to the rest of the production fleet pretty quickly, and all of a sudden you've now rolled out this change that is triggering alarms to a much broader percentage of your fleet. So bake time for us has been a really important enhancement on traditional canary deployments in order to ensure that we're finding those changes before they get out to a large percentage of our production capacity.

22:27 How do you roll out a change across the entire AWS estate?

22:27 Daniel Bryant: So how do you roll out a change across the entire AWS estate? The blog post mentioned the use of waves, I think, and I'm sure I wasn't super familiar with the term and listeners may not be as well. I've also got a follow-up question, I guess there, is do the alert thresholds change as more boxes and AZs, regions come online with this new change?

22:47 Clare Liguori: Rule number one in deployments in AWS is that we never want to pause a multi-region or multi AZ impact from a change. Customers really rely on that isolation between regions. And so that really drives how we think about reducing the percentage of production capacity that we are deploying to, at any one time with a new change.

23:14 Clare Liguori: Typically, a team will break down those production deployments into individual region deployments, and then even further into individual zonal deployments, so that we're never introducing a change to a lot of regions at a time, all at the beginning. Now, one of the challenges is that as we've scaled, we have a lot of zones, we have a lot of regions. Doing those one at a time can take a long time. And so we're really having to balance here the speed at which we can deliver these changes to customers. Of course, we want to get new features out to customers as fast as possible, but also balancing the risk of deploying very, very quickly, globally across all of our regions and zones. And so waves help us to balance those two things.

24:05 Clare Liguori: Again, going back to this idea of building confidence in a change, as you start to roll it out, you start with a very, very small percentage of production capacity. Initially we start with the one box through the canary deployment, but we also start with just a single zone, out of all of this zones that we have globally. And then we roll it out to the rest of that single region. And then we'll do one zone and another region, and then roll it out to the rest of that region. And between those two regions that we've done independently, we've built a lot of confidence in the change because that whole time we've been monitoring for any impact, any increased latency, any increased error rates that we're seeing for API requests.

24:48 Clare Liguori: Then, we can start to paralyze a little bit more and more, as we get into what we call waves in the pipeline. We might do three regions at a time, individually picking an AZ from each of those regions and deploying it so that each of these individual deployments is still very small scoped. We do a canary deployment in each individual zone. And so we're continually looking at how to have small, small scope deployments, while being able to parallelize some of it as we're building confidence. By the end, we might be deploying to multiple regions at a time, but we've been through this whole process of really just building that confidence in changes.

25:29 Do any humans monitor a typical rollout?

25:29 Daniel Bryant: I like it. I'm definitely getting the confidence vibe and that makes a lot of sense. Do any humans monitor a typical rollout?

25:36 Clare Liguori: Not typically. And that is one of my favorite things about continuous deployment at Amazon, is that largely we kind of forget about changes once we have merged them, merged the pull requests at the beginning. And so there's really no one that is watching these pipelines. We're really letting the pipelines do the watching for us. Things like automatic monitoring by the pipelines, all the rollback by the pipelines become super important because there is no one sitting there waiting to click that roll back button. Right?

26:09 Clare Liguori: And so often what we find is that when an alarm goes off and the on-call gets engaged, usually if it's a problem caused by a deployment, the pipeline has already rolling back that change before the on-call engineer is even logged in and started looking at what's going on. And so it really helps us to work on being able to have as little risk as possible in these deployments as well, because we not only scope them to be very small percentages of production capacity, but they also get rolled back really quickly.

26:44 Do you think every organization can aspire to implementing these kinds of hands-off deployments?

26:44 Daniel Bryant: You mentioned that the pull request being the last moment where humans sort of review. Do you think everyone can aspire to implementing these kinds of hands-off deployments?

26:52 Clare Liguori: One of the important things to remember as I talk about how we do things at Amazon, is that we didn't always have all of this built into our pipelines. This has been a pretty long journey for us. One of the things that we've done over time, and we've sort of arrived at this kind of complex pipelines when you look at them, is that it's been an iteration over many years of learning about what changes are going out, what changes have caused impact in production, and how could we have prevented that? And so that's led to sort of our discovery of canary deployments, looking at how do we also do canary deployments in pre-production? How can we find these problems before customers do? And so monitoring canaries help us to exercise all of those code paths, hopefully before customers do.

27:48 Clare Liguori: And so I think these are achievable really by anyone, but I think it's important to start small and start to look at what are the biggest risks to production deployments for you, and how can you reduce that risk over time?

28:02 Have you got any advice for developers looking to build trust in their pipelines?

28:02 Daniel Bryant: Yeah. And that sort of perfectly leads onto my next question, because it can seem when folks look at companies like Amazon, it can seem quite a jump. How would you recommend, or have you got any advice for developers in building trust in their pipelines?

28:15 Clare Liguori: One of the things that we do naturally see at Amazon in new teams is kind of a distrust of the pipeline. But it's interesting, it's also sort of a distrust in their own tests, their own alarms and things like that. When you build a brand new service, you're always worried, is the pager not going off because the service is doing well, or is the painter not going off because I'm missing an alarm somewhere that I should have enabled?

28:42 Clare Liguori: One of the things that's pretty common when we're building a brand new service is to actually add what we call manual approval steps to the pipeline to begin with. Just like you build confidence in changes that are going out to production, you also need to build confidence in your own tests, your own alarms, your own pipeline approval steps. So just putting that manual approval and having someone watch the pipeline for awhile and see, what are those changes that are going out? Are they causing impact to production? Looking weekly at, what do the metrics look like and should we have alarmed on any of these spikes that are happening in the metrics weekly?

29:21 Clare Liguori: So really validating for yourself, what is the pipeline looking at, and would I have made the same choice, really, in deploying to production, can help to build that confidence in a new pipeline or a new service. And then over time, being able to remove that once you feel like you have sufficient test coverage, you have sufficient monitoring coverage, that you're not as a person, adding value by sitting there watching the deployment happening.

29:47 Would you recommend that all folks move towards trying to build pipelines as code?

29:47 Daniel Bryant: Would you recommend that all folks move towards trying to build pipelines as code?

29:52 Clare Liguori: One of the things that I really like about pipelines as code is that similar to any kind of code review, it really drives a conversation about, is this the right change? So as you're building up that pipeline, you can very naturally have conversations in a code review if it's pipelines as code, about how you want to design your team's pipelines, and whether it meets the bar for removing that manual approval step after launching that service, or whether we're running the right set of tests, things like that.

30:24 Clare Liguori: One of the things that I did find when joining Amazon and starting to build out a lot of these pipelines is also that it's very easy to forget some of these steps. There are a lot of steps in the pipeline. There's a lot of pre-production environments, a lot of integration test steps. Across all of our regions and zones, there's a lot of deployments going on. And so as the pipeline just gets more and more complex, you just want to click less buttons, in order to set these up and to have consistency across your pipeline. If you've got 10 different microservices that your team owns, you want those to be consistent.

31:00 Clare Liguori: Pipeline's code, especially with tools like the AWS Cloud Development Kit, where you get to use your favorite programming language and you get to use object-oriented design, and inheritance, what's really common is to set up a base class that represents, this is how our team sets up pipelines, these are the steps that we run in the pipeline because we're running the same integration tests in all of these pipelines across that full stack. And so that really helps to achieve some consistency across our team's pipelines, which has been super helpful and a lot less painful than clicking all the buttons and trying to look at the pipelines to see if they're consistent.

31:36 Could you share guidance on how to make the iterative jumps towards full continuous delivery?

31:36 Daniel Bryant: Excellent. Yeah. I definitely like clicking buttons. I've been there for sure. Maybe this is a bit of a meta question, a bigger question, as a final question, but do you have any advice for how listeners should approach migrating to continuous delivery, continuous deployment, if they're kind of coming from a place now where they're literally clicking the buttons and that's working, but it's painful? Any guidance on how to make the iterative jumps towards full continuous delivery?

32:01 Clare Liguori: I think again, start small. I think you can break up pipelines into a lot of different steps. We have this sort of full continuous delivery across source builds, test, and production. What I see with even some AWS customers is kind of tackling a few of those steps at a time. It might just be continuous integration to start consistently building a Docker image when someone pushes something to the main branch. Or it might be consistently deploying out to a test environment and running integration tests, and getting a little confidence in your tests, working on test flakiness and things like that, and working on test coverage.

32:44 Clare Liguori: Then, getting to iterating further to production and starting to think about, how do you make these production deployments smaller scoped, and reliable, and monitored, and auto rollback. But I think it's important to start small and just start with small steps, because it really is an iterative process. It's been an iterative process for Amazon and something that we continue to look at. We continue to look kind of across AWS and look for any kind of customer impact that's been caused by a deployment. What was the root cause and how could we have prevented it? And we think about new solutions that we could add to our pipelines and roll out across AWS. And so we're continuing to iterate on our pipelines. I think they're definitely not done yet, as we learn at higher and higher scale, how to prevent impact and reduce the risk of impact.

33:31 Daniel Bryant: That's great. Many of us look up to Amazon. It's nice to hear you're still learning too, right? Because that gives us confidence that we can get there, which is great.

33:38 Clare Liguori: Yeah, absolutely.

33:39 Outro: where can listeners chat to you online?

33:39 Clare Liguori: So if listeners want to reach out to you, Clare, where can they find you online?

33:42 Clare Liguori: I'm totally happy to answer any questions on Twitter. My handle is Clare_Liguori on Twitter. So totally happy to answer any questions about how we do pipelines at Amazon.

33:55 Daniel Bryant: Awesome. Well, thanks for your time today, Clare.

33:56 Clare Liguori: Thank you.

Mentioned

Automating safe, hands-off deployments

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and YouTube. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Previous podcasts

Generally AI - Season 2 - Episode 3: Surviving the AI Winter

Orchestrating a Path to Success - a Conversation with Bernd Ruecker

Generally AI - Season 2 - Episode 2: Fantastic Algorithms and Where to Find Them

AI, Rust, and Resilience: Key Software Trends Seen by the QCon San Francisco 2024 Program Committee

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?