BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Architecting for High Availability in the Cloud with Cellular Architecture

Architecting for High Availability in the Cloud with Cellular Architecture

Bookmarks
47:37

Summary

Chris Price discusses cellular architecture, its merits, design options with cellularization, and how to effectively isolate at the level of an AWS account.

Bio

Chris Price is a Software Engineer at Momento, where he helped design and build Momento’s cellular architecture and the corresponding CI/CD automation. His interest in infrastructure automation began at Puppet Labs, during the genesis of the DevOps movement. After 5 years in a tech lead role at Puppet, Chris moved on to AWS, where he helped build the foundation for the AWS MediaTailor service.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Price: I am going to be talking about cellular architecture in the cloud. I want to do a little divergence first, and talk about automation. A quick little tour through some history of automation. In 1436, the invention of the printing press allowed us to distribute information so much more quickly than we had before. 1475, a little less celebrated, but still a very important invention of the cookie cutter that allowed us to no longer have to cut out every single gingerbread man individually. 1901, the assembly line. People usually associate this with Ford Automotive and the ability to mass produce vehicles at a much larger scale than we ever had before. Then last but not least, 2019, a path to escape from YAML and JSON as our infrastructure for code. You may be asking, I thought this was a talk about cellular architecture. Why is this guy talking about automation? One of the points I want to convey in this talk is that I actually think that to achieve a really maintainable and scalable cellular architecture, automation is a really important piece of the puzzle. The advancements that have happened in our infrastructure as code space in the last few years have really done a lot to make this a simpler problem to solve. I'm going to end up tying these two things together a little bit more later in the talk.

Background

My name is Chris Price. My last few jobs, I worked at Puppet for about five years, which is where I got my introduction to DevOps and infrastructure as code. It was a little more on-prem type stuff. Then I moved to AWS. When I was working at AWS, I was on a team that built a greenfield video streaming service that was used as part of Amazon's broadcasts of Thursday Night Football. One of the things that was really awesome about that job was getting to learn from all of the other teams inside of AWS that had built these Goliath services over all these years, and getting to take lessons from them, like how they managed cellular architecture for their services, and apply that to our new service that we were building as a greenfield thing. Now I work at Momento, which is a company that provides serverless caching and Pub/Sub products. I realized there's some irony in this talk, as I'm going to be talking to you guys about how we built our infrastructure when my company's whole purpose for existence is preventing you from having to have as many things involved in your infrastructure. When I joined Momento, one of the first tasks that I did was help build out the infrastructure for our architecture, including some of how we do cellular stuff.

Partial Availability Zone Failure (Recap)

Cellular architecture is not something that we at Momento invented or anything like that. Peter Vosshall gave a talk at re:Invent in 2018, about how AWS minimizes the blast radius of failures for the various AWS services. He talked a lot about cellular architecture in that talk. If you're interested in more of the motivation for why to do it, that's another good talk to watch. I'm not going to talk quite so much about motivation, I'm going to talk a little bit more about practical things that you can do if you want to incorporate some cellular architecture into your own application, and you want to automate some of the infrastructure around it. I will do a quick recap of some of the important things from Peter's talk. The things on the right-hand side there were really the focus of his talk in terms of breaking down the units at which an AWS service provides isolation, to allow you to minimize the blast radius when some failure does occur as they inevitably do. Regions and availability zones are the fundamental building blocks of most AWS services. They give you a mechanism to choose how you want to set up your own architecture, to be a little bit more fault tolerant along those same lines, so that if something fails in one AZ or one region, it doesn't bring your whole service down. Then after that, he talks about cells, which is the next level at which you can isolate things for your service. A cell, as a simple definition of it, it's nothing more than a single isolated deployment of your application. It can run and handle requests completely independently, and it's totally isolated from other cells that you have. You can choose to build your cells as regional things, or as zonal things. Or you can create multiple cells inside of an AZ or multiple cells inside of a region. It's really up to what's the best fit for your product and your customers. I liked this example that he used for blast radius here, in the red squares on the middle at the bottom there, he shows an example of a service that was built with independent cells inside of each availability zone. He talks about when a poison pill request or a black swan event happens, which is some unexpected input or traffic that you're getting from a given customer, usually this is for a multi-tenant service. This is where this concept is most useful, most applicable. You get in some escalated traffic or some problematic requests from a certain customer, and your service handles it in an unexpected way, and things start to fail. If you have a design like that, the blast radius is limited to that one cell, which in this case is only a part of a single availability zone, and all the traffic that you might be serving in all of those other cells is unaffected. That's the general concept.

Cellular Architecture: Requirements

Now we're going to talk about how at Momento, we built out our initial foundational infrastructure to allow us to easily manage cells. There are five key problems that I think you have to solve if you want to automate your cellular infrastructure. You need to figure out how you're going to achieve isolation. You need to figure out, what's the process look like when you need to bring up a new cell? You need to solve the problem of how you're going to deploy, whenever there's a change to some microservice that makes up a part of your application, how are you going to deploy the changes to that microservice across all of the cells that you have? You need to figure out permissions. There's going to be inbound permissions to the cell and outbound permissions from the cell to allow it to access resources like private VPCs, or an ECR image that has the Docker image for a part of your application. You have to figure out how to manage all those permissions. Then, lastly, you got to figure out monitoring. When your operators are on-call trying to manage your service, they need an easy way to get a holistic view of what's going on in all the cells without having to go look in a bunch of different places for it. Those are the five key problems that we're going to try to solve. I'm just going to be talking about some tools that we use and some patterns that we use at Momento to solve each of them.

Standardization

Before I get into those individual tasks, I want to zoom out a little bit and talk about standardization. The key thing when you're thinking about a cellular architecture is that you're going to end up having a whole lot of different deployments of your application code. The only way that's ever going to scale is if you have some way to perform generalized tasks across all of those different cells. You have some generalization mechanism that allows your infrastructure to manage all the different things that are going on in each of those cells in a similar repeatable way. There are some things that you can do up front to standardize certain parts of how your application is built that will make that infrastructure a lot easier to build.

Standardization (Not Homogenization)

I do want to call out when I'm talking about standardization, I'm definitely not talking about homogenization. I think this is super important to call out because most applications these days are made up of several discrete microservices. It's almost impossible over time for any mature engineering org to try to keep those all using the exact same tech stack. You're going to have certain services that have different needs from other ones, or you're going to learn new technologies and find that they're better. You're going to end up having multiple different flavors of tech stacks that make up the components of your application. That's totally fine. Here, we can see a few different examples. We've got one on the left, that is deployed via AWS CloudFormation. It's got some serverless resources like API Gateway, lambda, maybe some Dynamo tables. You could also have some Kubernetes services that deploy load balancer or application containers, database containers. There's a few more AWS examples on the right. These don't have to all be the same as each other, but we do want to try to recognize some common patterns of how we need to manage these things when we're managing cells and find ways that we can generalize our solution across these even though they're using different tech stacks.

Standardization - Deployment Templates

At the end of the day, when we are trying to figure out how to deploy a change to our application code, to all of our cells, it's going to be a process that looks something like this. You're going to have a commit. A change goes into your code. You're going to need to do some build operation to produce the binary artifact for that change. You're going to need to release that somewhere, that may be a Docker image that you're pushing up to a Docker repo, maybe a JAR file that you're pushing somewhere, maybe a ZIP file that you're sending up to S3 to be the backing code for a lambda. There's a few different flavors of things that can happen there. It's a common pattern that there's always going to be this release step. Then you've got some steps where you're actually doing the deployment to your individual cells. In actuality, we probably want something a little bit more sophisticated than that. One of the reasons we would opt for cellular architecture in the first place, is to minimize the blast radius if something breaks. It just so happens that right after you deploy your code, is a very common time for something to break. What we've changed in the bottom part here is we've added this step where we're actually going to deploy to a staging cell or pre-prod cell before we deploy to the real cells that have customer traffic in them. Then we're going to add this bake step in. The bake step is going to be monitoring canaries, metrics, other indicators of health, so that it can tell whether that deployment went smoothly or not. We're going to sit in that state for some amount of time before we decide that it's safe to move on to actually deploy to our first production cell. Then once we do reach that point, we can just repeat those two steps out across all the different cells.

We can talk about the different flavors of tech stacks that we might have used for microservices, and what individual technologies we might be able to use to achieve these series of steps. I'm going to be focusing quite a bit on AWS tools just because that's what we ended up using at Momento. It's easier to talk about this stuff if I have some concrete examples to give. You can imagine swapping out almost any of these technologies with something that fits better for your environment. Here we've got an example of what tools we use to do this CloudFormation flavor of deployment. You've got AWS CodePipeline at the beginning, AWS code Build to do the actual build of the change, then maybe we're publishing a Docker image to the Elastic Container Registry. Then we call AWS CloudFormation to do the actual deploy to our staging cell. Then we use Step Functions. Step Functions are a really cool thing to use for this baking step, because you can have a lambda that just quickly checks all your metrics and make sure that everything looks healthy. Then you can just go back to sleep for a while, and the Step Function can just loop on it, and allow it to keep checking every so often. If it ever encounters a failure, then it can just go ahead and kill everything right then, let your operators know that something's wrong and prevent that deployment from going any further down the pipeline. If it reaches the allotted time interval that you configure for how long you want your bake to last, then it'll allow the pipeline to proceed. Then again, we just repeat those same steps for the rest of the cells. Another cool thing you can put into one of these bake steps is if you want to limit your deployment so that they only happen during a certain set of business hours, because you want to reduce the risk that some breaking change rolls out in the middle of the night, you can build that into your Step Function as well and say like, is it after 6 p.m. right now, then we're going to just keep this bake going until tomorrow morning. It's pretty easy to code stuff like that up in a Step Function.

Here's the Kubernetes flavor of the same thing. We're using almost all the same tools, except instead of CloudFormation, maybe we just have a lambda that we call in that step. The lambda is just making an API call to Kubernetes, letting it know that there's a new Docker image available that it should do the deploy with. Then we can still use basically the same Step Function logic for the bake. Then we just repeat that series of Kubernetes calls and bakes for the rest of the cells. We've got basically this generalized list of steps that we need to do, to do all of our deployments. We have some flexibility here where they're almost the same, even though some of our technology stacks may be very different from one another.

Standardization - Build Targets

Another piece of the puzzle that we need for standardizing this stuff in order to make it all fit together, is we need some standardized build targets for our different microservices. I'm just using makefiles here. There's a ton of ways you could solve this problem. Makefiles are pretty simple, and they've been around forever. They work for this purpose just fine. On the left, you can see a snippet from a makefile for one of our two microservices on the right, a snippet from a makefile for the other one. If you can see the commands that are in here, the ones on the left are doing some Gradle calls to build one of our Kotlin services. Then some npm calls to build up the infrastructure. The one on the right, the commands are very different. It's calling cargo to build a REST service, and then some shell scripts that build up the infrastructure. The important point here is that we have the same exact list of targets on both of these. For example, we have a pipeline build target here, that is going to control what happens in this build step of the deployment process for that particular service. Then we have some targets for a cell bootstrap and GCP cell bootstrap, because for Momento we can deploy to either AWS cells or GCP cells. Again, the makeTarget names are the same. What this means is that other pieces of our infrastructure that are operating outside of these individual services now have this common lifecycle that they know they can rely on the existence of inside of each of the components that they need to interact with when we're doing things like deploying.

Standardization - Cell Registry

Another building block, a place we can standardize, is what I call a cell registry. This is just some mechanism for giving us a list of all of the cells that we've created and the important bits of metadata about them. Here, we did this in TypeScript. We have about 100 lines of TypeScript that just define a few simple interfaces that we can use to represent all of the data about ourselves. We've got one interface that just has a little bit of DNS configuration. The interface on the bottom left is probably the most important one, it has the information about a given cell. We have things like the scale of the cell. Is this a prod cell or a developer cell, that kind of thing? Region, DNS config. Then cloud provider config is where we control whether it's an AWS cell or a GCP cell. This is just a way of modeling all the metadata that we need to know about a given cell. Then on the right, we have a bigger interface that represents our entire, like Momento organization. The important piece I want to highlight from that one is basically we just have an array there, called cell accounts, that just has a list of all of the cells and the metadata about them. This is just a simple data model, nothing fancy about it. This is an example of the actual data that we're using that model to create. When we have a new cell that we want to add, we come in here and we adjust this array, and we add the new cell. We have a name for it, in this case, it's alpha. We've got an account ID, region, DNS config, just a simple representation of the metadata that we need to know about that cell. Now we have all of this data about our cells, all we need to do is just publish it somewhere that we can make it accessible from the rest of our infrastructure. You could do something really fancy with this. You could put it in a database, if you wanted to. There's lots of stuff you could do with it to make it accessible from other parts of your code. In our case, we didn't need anything that sophisticated. We literally just saved this data to S3. Then we have a library where we can suck this data down and basically turn it back into the TypeScript object in our library, a TypeScript library. That means that any of the rest of our infrastructure at any time can basically just use that library to pull down all of the metadata about all of our cells. That allows us to start building some generalization patterns that allow us to do a lot of cool things in managing our cells.

Standardization - Cell Bootstrap Script

Last piece of standardization that I wanted to talk about, we have a thing that we call the cell bootstrap script. We've talked about how your application is probably made up of several microservices. There's probably a few other things that make up your application as well, like some core infrastructure, like networking things, shared networking components like VPCs, NAT gateways, maybe some shared database resources. In our case, we also have a separate little project where we can keep the infrastructure code for our DNS records. DNS records are a much more risky type of resource to change than almost anything else, so we like to keep them separated in their own project. We have a little bit more control over. At the end of the day, we have this list of Git repos that each has some component that is a part of our application. If you're a company that uses monorepos, these might be all in the same Git repo together and just be in different directories. The concept is still probably the same. Now that we know this, with the other building blocks we've talked about, we can create this cell bootstrap script that's just this really simple and naive and generic way to take actions across an entire cell. What I've got here is one line of code that just pulls the metadata out of our cell registry for a given cell, so now I know the AWS account ID and the DNS information and stuff like that. Then I've got a line of code that just checks out a copy of all of the source code from those Git repos. Then we've got a loop, we just loop over the Git repos in the correct order. For each one, we run this makeTarget called cell-bootstrap. With this simple little five lines of code, I now have this super generic way to just walk through all of the components of my application and deploy a new cell of it. This is really generic and really extensible as we add new components to the system.

Isolation

Now we've got some standardized building blocks that we can use to go back and solve these five problems that I originally brought up. We will start off with isolation. We found that the best way to isolate cells is to create separate AWS accounts for each cell. That can sound daunting at first, if you've never done it before, because you can end up with a really long list of AWS accounts. The tooling around it is actually pretty mature these days, and the strengths, I think, pretty strongly outweigh the disadvantages. This is also a pretty common pattern inside of AWS. In my mind, the main most compelling motivation for this is that you basically just get the isolation for free. When you create a new AWS account and deploy a cell in it, it's just isolated by default. You would have to do some extra work on your end to break the isolation from that cell to one that existed in another AWS account. It's certainly possible to create two cells in the same AWS account together. When you do that, you're opting in to figuring out the isolation story on your own, which usually involves a bunch of messing around with IAM policies to try to make sure that the resources in between the two cells don't have permission to interact with one another. That can be pretty tricky to get right. Anybody who has spent time creating and managing IAM policies in AWS before? Have you found that task to be fun and easy? That by itself is just a really compelling advantage to do it this way. You just don't have to mess with the IAM stuff.

Another really compelling advantage is you get the same benefit on the billing side. If you've never used AWS organizations before, the tools on these are pretty mature now. You can create these hierarchies for your account. We have a folder called cell that we put our cell accounts in, and then one called developer that we put accounts in for our individual developers. There's APIs for all of this. You can manage all of this stuff programmatically as well. Once you have all these accounts in your AWS organization together, you can go to the billing screen, you can go to Cost Explorer, and you can do a group by on the linked accounts. That gives you, for free, a way to see the relative cost of all of your different cells right there against each other. You can see if one standing out is costing way more than others. Again, you can implement this yourself if you put two cells in the same AWS account together, but then it's on you. You have to put tags on all of the resources and then come into this Cost Explorer, and set up a query to group by the tags. If you make any subtle mistakes in how you're tagging the resources, then your billing data might not be quite right. Here, you just get it for free. That's why we went with accounts for isolation.

Another problem you have to solve that deals with isolation is routing. You have to have a way for traffic to get into these cells. A simple way to do it and one that works well for us at Momento is just, when you create a new cell, just create a new DNS name for it. You can see in my DNS names up there, they've got the region, but they've also got effectively a cell ID, a 1 and a 2 in the DNS name. That's really simple, there's no routing problems involved there. This works for us, because all of our customers interact with our services using SDKs that we provide to them. We can bake some information about what cell a certain customer is supposed to be in into this Auth token that we provide them. Then they use our SDK to access the service, and we can figure out the right DNS name for the cell that they're in. They never see any of it, but we get them routed to the cell that we want them to be in. This won't work for all use cases though. If you have an app that people are just accessing from a browser, you don't want to have to give out 100 different cell URLs to all of your customers and make sure that they pick the right one in their browser. You can't do it that way. In that case, you've got to have a single DNS name that's in front of multiple cells. You got to build in this extra routing layer that knows how to take those requests and map them to the right cell. You want this layer to be as thin as humanly possible. It shouldn't do anything other than just the simplest possible routing.

There's pros and cons to this approach. The cons are obviously it's another piece of your infrastructure you have to maintain, another place where something can go wrong. Another con of it is that it impacts your scalability story. You might have a database in your cell that you know can only scale horizontally to a certain extent. Then that's the maximum capacity that that cell might ever be able to have. That's not necessarily a business problem for you, because you could just spin up another cell when you're getting close to capacity on that one. You can build out more cells and scale horizontally by adding new cells, and they're independent of one another. As soon as you have to put this routing layer in, now it has to be able to scale to handle however many cells you're putting in there. It's just one more thing you got to think about, about scaling. Pros, though, one big pro of this approach is, if you ever have a situation where you need to migrate a customer from one cell to another, without them knowing about it, or being involved in it, this routing layer gives you a place in your own code that you control, where you can write some code that does that. It's not an easy problem to solve, depending on what kind of data you need to migrate, but it gives you a mechanism that you can possibly achieve that without your customer knowing about it. Whereas in this architecture, you have to switch them over on the client side. You're probably going to have to involve the customer more. That can be a business decision, as to which of these two works better for you. That's isolation. We solved it with AWS Accounts and AWS Orgs. Then either cellular endpoints or a routing layer.

New Cells

We move on to new cells. This one is going to be really fast based on the building blocks that we talked about earlier. All we have to do is create a new AWS account in our organization. Add that to our account registry, and then run our cell bootstrap script. That's it, and we have a new cell. Again, the components that make the cell bootstrap script work, we've got this data on the left in the account registry that gives us all the information about this cell. On the right, we've got the standardized makefile system, where all of the components of our service have the same build targets. Then, at the bottom, we've got the actual cell bootstrap code that's able to just loop over those components and call to makeTarget. This gives us a super easy and fast way to spin up new cells.

Deployments

Now we can move on to deployments. This is going to be the meatiest part of the talk probably. This is where I'm tying back into those early slides that I had about infrastructure as code and automation. I like to call the trends that have been happening in infrastructure as code in the last few years IaC++, because I'm a dork. I think the big breakthrough that's been happening is that we're starting to have all of these tools at our disposal to define our infrastructure using real programming languages. Before, it was always like a big ball of YAML, or a big ball of JSON, or a big ball of HashiCorp's Config Language, that could get really long and really tedious. Then people started trying to bolt on these solutions where you could use template engines or these other bespoke tools to be able to do loops and reuse some of your code. It always felt to me like I was coding with one hand tied behind my back, because all these constructs already exist in your programming languages that you're using to build your application. Why do we have to invent new ones for the infrastructure as code? In 2019, AWS released the Cloud Development Kit, CDK, which is the first time that you could use basically your favorite programming language to produce CloudFormation infrastructure. It's really powerful. I was very excited when this happened. A year later, they released cdk8s, which is basically the same thing, same similar syntax, but it works for Kubernetes environments instead of CloudFormation environments. Then in 2021, they did what I think might have been the coolest contribution that AWS has made to open source in quite a while, which is they pulled this common library called constructs out of those two projects. It has all the underpinnings for managing these graphs of resource dependencies, but it's not tied to a specific deployment technology. It's a building block that people can use to build their own. Pretty quickly after that, HashiCorp released CDKTF, which is CDK for Terraform, and gives you the same programming constructs, but for your Terraform infrastructure. With these tools now at our disposal, there's a lot more stuff that we can do to generalize our infrastructure code across these repetitive things like cells.

This is back to my original slide. On the left is a giant ball of JSON that has part of an IAM policy for an S3 bucket. At the bottom, you can see I had to cut about 200 lines out of this so that it would even remotely fit onto this slide. On the right, you see the same thing with CDK TypeScript. In TypeScript, I'm able to just create this array called orgPrincipals that has all the organizations that I need to manage these permissions for, and just loop over them with a regular TypeScript loop and call grantRead to give them access to the bucket. Then, this is probably the most powerful thing in my mind is that you can create libraries now that have these chunks of code in them that you can use and distribute in exactly the same way as you would with any other library that you're using in your application. Here, I've got some CDK TypeScript code, it's not super important the contents of it, but it's a class called OTel gateway, which is an OpenTelemetry gateway that extends this construct thing that AWS provided for us. In the class, all I'm doing is building up this group of resources that makes an OTel gateway. There's a security group, an ingress rule, later on, you would see an autoscaling group and a few other things that we need for our OTel gateway. I can write this little bit of code and put it in a library and then just release it to the same place that I would release any other TypeScript library that we're using for our application. In our case, that's to AWS CodeArtifact, but it could just as easily be npmjs, or an Artifactory server, wherever you store your binary artifacts. This isn't limited to TypeScript. You could do this in Java or Python, they have support for most of the popular programming languages. Wherever you store your libraries, you just create a library and put this stuff in there. Then from anywhere else in your infrastructure, you can just call new OTel gateway, and it'll create all those resources in that part of your infrastructure. It's the first time that we could really reuse all of the same programming constructs that we were already using for our applications, but now for our infrastructure.

Now we're going to revisit this diagram. We're trying to figure out how to achieve these steps for all of the different components of our services, even if they have slightly different tech stacks involved in them. With CDK code, what I can do is I can create an AWS CodePipeline, just by constructing this object. I have this array called stages. I can call stages.forEach, and then just add each stage to the pipeline. The interesting part of this is not on the screen, it's how did I populate that stages variable. Basically, what that stages variable has in it is that list of steps that you see right there on the bottom. Here's an example of a piece of code. The function here is called cacheServiceStages. This is the function that's responsible for providing that list of stages for one of our microservices. Again, here, we've just created a simple TypeScript interface that we can use to represent the differences between our different services. I've got this array here called releases. That is going to control what happens in this release stage right here. In this one, all I'm doing is one Docker release. I could have done more Docker releases, or an S3 release, or whatever, by just adding them to this array. For this one, I just got this one Dockerfile. Then the rest of our generic infrastructure code, when it's building stuff, can just read this and be like, ok, I need to add a Docker step here.

Then I've got an array called stacks. We have one stack for if we're deploying to an AWS cell, and it's a CloudFormation stack, and one stack if we're deploying to a GCP cell, and it's a Terraform stack. Again, this is just capturing the things that are important about this one microservice for its deployment lifecycle. Then all of the generic code that we have in the rest of our repo can just loop over this and do all the hard work. We actually have this one repo that we call pipeline of pipelines. Its whole job is to build a deployment pipeline for each of the other components in our service. When it runs, it creates a pipeline that deploys our DNS stuff, one that deploys our core infrastructure, and then one for each microservice. Each one of those pipelines that it produces, just looks exactly like this, except for with the slight differences if it's a Kubernetes stack, versus if it's a CloudFormation stack. That means we have one repo that has all of our pipeline infrastructure code in it.

That means it's really easy to know where to go if we need to change something about how we're deploying. It also means if you're an engineer who cares a lot about what's happening in that repo, you only have one place to keep your eyes on to make sure you know when important changes are happening there. It allows us to reuse deployment steps across all the different projects. I only have to build this code that creates a Docker release step one time and then I can just reuse that in the pipeline for all of the different services that need to do a Docker release. It gives us a single source of truth for the deployment order across all of our different pipelines.

This is another little data structure that exists in that repo. Basically, what's happened here is we pulled all the cell information out of the account registry. We have a variable for each of the cells that we want to deploy to. We're dividing them up into waves. At the top in the first wave, where you have a wave that includes our AWS alpha cell, and our GCP alpha cell, those are our pre-prod cells where we just test things out before they go to prod. Then the second wave deploys to one prod cell, one that doesn't have a ton of traffic so that if there's a problem there, then we're not impacting that many customers. Then we just keep adding waves as needed that expand to the rest of the more higher traffic production cells. This little data structure is defined in one place in the code. Then it's used to build all of those pipelines out so that they all have the same exact set of stages, and they're deploying to the cells in the same order as one another. That's how we solve that deployment part. We use our account registry, and CDK, and AWS CodePipelines, and this standardization that we've come up with for the deployment pattern.

Permissions

For managing permissions into and out of the cell, we actually rely pretty heavily on AWS's SSO, which I believe that they have renamed to IAM Identity Center. If you haven't seen this before, this is the splash page for SSO. If you're using AWS Orgs, and you have a whole bunch of different accounts that you want to get into, you can come to this splash page, and it'll show you all the accounts that you have access to and what roles you have, and you can log into individual accounts. You can see here, we've got our cell accounts, and also our developer accounts. We were able to tie this into our Google identity, so that actually how our developers access this page is just through their Google identity SSO, which is pretty convenient. This also works on the command line and in the SDKs too. It isn't limited to just the web. This is a really handy tool. It also has API so you can automate the management of these roles and stuff. This is an example of what it looks like if you go into the management screen for one particular account. This cell account, we can see what users have access to it and what roles they have access to. We have a read only role. Then we have a cell operator role that has a little bit of extra permissions for developers to access logs and stuff like that. The account registry has all of the information in it about all the cells and all the developers. That's all we really need in order to automate the management of the permissions for all of the cells, both inbound and outbound. In terms of outbound, we can just loop over the cell accounts in our CDK code, we just get the data from our account registry and loop over them. Then we can set up all the permissions that we need to, to give them access to the ECR images or to a private VPC, or whatever. In the other direction, we can loop over all the developers in the cell registry and give them all access to access the logs and stuff in the cell account. That's all we really did for permissions.

Monitoring

For monitoring, this one's really simple. There's nothing really fancy about what we've done here. The key thing is, you need to make sure that your cell name is a dimension on all of your metrics. You need a way to centralize your metrics across multiple accounts, so you're not having to go look into every account to see what the metrics look like. If you're using CloudWatch metrics, there's a way that you can set up a central CloudWatch account that's pulling metrics in from other accounts. We're not using CloudWatch metrics here. It wasn't the best fit for our use cases. There's tons of third-party solutions for this as well: Datadog, New Relic. We're using Lightstep. As long as you have a way to configure the code in your cells to be emitting the metrics data to a common location. Then, as long as that metric sink has a way for you to group by the cell dimensions, which they all will. This is what one of our Lightstep dashboards looks like. In this screenshot, I've moused over a traffic spike from one of our services. I can see in the tool tip that it was in our U.S.-East-1 cell, because I emitted that as a dimension that I'm able to group by here. Then in the background, you can see some faded lines that Lightstep grayed out because I was focused on this one. Those are all traffic data from other cells. Nothing super fancy here. Those are the five things that we were trying to achieve. That's the patterns that we use to solve those problems at Momento.

Additional Benefits

A few additional side benefits I want to mention. The ability to add new cells as quickly as what I described in this cell automation script is a pretty big thing for business agility. If you have a big customer that comes in, and he's like, we really want to use you guys but we have a production workload that needs to go in this particular AWS region this week. If you don't already have a presence in that region, which we don't have a presence in every region because we're a startup, the ability to say, yes, we can do that for you, by spinning up the cell in a matter of hours is huge for business agility, and it might win us a deal that we might not be able to win otherwise. Some of the previous places that I worked that did have cellular architecture, even AWS, the way that we deployed new cells there was we had this spreadsheet that had 200 lines of manual steps that you needed to perform, to build up the new cell, go into this account, create a new Dynamo table, whatever. Whenever we had a new cell that we need to bring up, because the customer was asking for it, some poor soul on the engineering team would get handed this spreadsheet and be like, this is what you're doing for the next two weeks. They would have to just walk through all these manual steps one at a time. Every time that happened, one of these steps would fail somewhere, and they'd have to back up five steps and redo some things to get to the finish line in two weeks. This power is really valuable, especially if you're a smaller company where agility can make a big difference. Likewise, it means that if we have a big customer that we know that their load is going to be really big, and we want to make sure it's isolated from other customers, we can spin up single tenant cells really quickly. Or if they need like a private link to their VPC, we can spin that up and not have to do it in a public cell.

Then this is my favorite benefit of this. This pattern allows your developers to spin up a whole copy of a cell in their own dev account. Sometimes when you're trying to debug something that has to do with an interaction between two different components of your infrastructure, it can be almost impossible to do that on your laptop. Really the only place you can debug some of those kinds of things is like in an actual working environment. A lot of companies will try to solve that by having this big shared dev environment that everybody can poke around in together. Inevitably, when you take that approach, two developers start trying to use it at the same time and one breaks something that this person was trying to use, and the other one breaks something that this person was trying to use. Neither one of them even knew that was happening. They're just like, why is my code not working? Then you've wasted three days of them accidentally stepping on each other's toes unknowingly. With this pattern, we can spin up a small version of our cell in a developer account, really quickly, allow them to test and debug their stuff, and then just tear it down and throw it away when they're done, which is really powerful. There's no one-size-fits-all for this. For all of the different kinds of tooling that I've talked about that we chose to use to build our solution for this at Momento, there are tons of different options out there. The major cloud providers will all have an analogue to everything that I described from AWS. There are tons of cool third-party solutions for all this stuff out there as well. You can pick the stuff that fits best for your environment.

Key Takeaways

Then, the key takeaways I wanted to hit. Cellular architecture can really benefit your customers in terms of availability and making sure that you're hitting your SLAs. It's also really valuable for your businesses' agility, and your engineering velocity. Automating this stuff really only requires solving a few key problems that we went over, and a little bit of work to standardize some things across your application components. The automation is somewhat simpler today, thanks to the changes that have been happening in the infrastructure as code space, as long as you just take those opportunities to standardize a few things about how you define your components. Again, there's no one-size-fits-all thing. That's true, not only in terms of which tools you're picking, but also how deep you actually want to invest in this. You don't have to invest in exactly the same level of depth as what we did here, you could do a subset of this. You could do more than what we did here. You got to find the right fit for your company and your business. Hopefully, this was a decent pattern to help you think about some ideas of where you could apply some of these things that would make sense for you.

Questions and Answers

Participant 1: Can you talk to us a little bit about the routing layer and more specifics on what you do there, [inaudible 00:44:34].

Price: The routing layer is typically going to be some code that you have to write. Because you have to have a way to identify when a request is coming in, what customer it's associated with. Then you're going to have to know how to make a decision about for that customer, what cell do they go to. If your app is HTTP, then you're going to have to have a little web server there. Maybe your customer ID gets embedded as a query parameter in the URL or some other part of the request. Your routing layer is going to have to just parse that much of the request to figure out what customer it is. Then maybe you have information in a database that tells you which cell that customer belongs to. You're going to have to make sure that layer is fast, so you're going to want to cache things from that database so that the requests can get routed really quickly. That's the general pattern that you're going to have to do there. It's very business specific, so there's not really any like off-the-shelf tools that'll solve that problem for you.

Participant 2: Do you see any significant risks or single point of failures with the pipeline of the pipeline?

Price: No, not really. There are a couple of risks associated with it, but they're not really of the single point of failure type of nature. Because actually, what happens with the pipeline of pipelines is, there's no way that you could write your CDK code that wouldn't work this way, I don't think. It's doing this in an iterative fashion where it's walking through each of the components that make up your service, and then rebuilding the pipeline for them. If it fails at some point in there, it will stop. That will have just left all of your other pipelines in their previously existing state. If they were working before, they'll keep working. You may have temporarily broken one of your pipelines if you had a bad code change there, but, again, that doesn't impact your cell. It just impacts your ability to deploy to your cell. You do have to rush to fix that so that in case you need to urgently deploy something you've got the means to do it. It's not going to cause an outage for your customers, because it's only managing this deployment lifecycle, not what's actively going on in the cell.

 

See more presentations with transcripts

 

Recorded at:

Apr 10, 2024

BT