Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Lessons from Leading the Serverless First Journey at CapitalOne

Lessons from Leading the Serverless First Journey at CapitalOne



George Mao discusses their journey into serverless, the best practices they picked up, the lessons learned along the way, and the optimizations for Lambda.


George Mao is a Senior Distinguished Engineer at Capital One. He is the Lead DE for Capital One's Serverless strategy and leads the effort to transform the company into a serverless-first organization. He leads the Serverless Center of Excellence and is responsible for setting enterprise patterns, best practices, and the developer experience.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Mao: I currently work for Capital One. I lead our Serverless Center of Excellence. I'm the Senior Distinguished Engineer for our serverless program. I'm responsible for helping Capital One adopt a serverless-first mindset. Before I joined Capital One, I spent about seven years at AWS, where I was the head of solutions architecture for serverless computing. I had the opportunity to help see the launch of many of the services that are under the serverless umbrella, so lambda, API gateway, and then a lot of the services that you might be using. This is what we're going to be covering. On this talk, we'll start a little bit high level, talk a little bit about Capital One, and who we are, and why we're adopting serverless. I'll talk about the technology and the things that we've learned, and hopefully share some lessons that have hurt us a little bit. You can take back and hopefully you can avoid the problems that we went through. Then I'll leave you with a bunch of best practices that you can use back when you get home.

Capital One - Who We Are

Who is currently operating production workloads on any cloud provider? Who has a production workload on any cloud provider? Who's operating on AWS? We have a very large AWS presence. Who has built and operated a production workload using serverless on AWS Lambda? Capital One is one of the large, big banks in the United States. We're one of the top 10 big banks. We're super unique, because we're really the only Fortune 100 bank that has moved all into any cloud provider. In 2020, we completed our data center migration out of our last data center, and moved all into AWS. That was a really tough journey. What it got us was agility. In 2020, we moved into AWS. Then now the journey that we're on is we're trying to modernize all of our tech stack and move all of the old infrastructure into a more microservice based architecture, and into a lot more AWS cloud native technologies. This is like a little chart of where we are and where we're headed. In 2020, we closed the last data center. In 2021, we decided to be a serverless-first organization. Then, 2022, we started to become a platform organization as well. Now we're starting to sell software as part of our business model, instead of being a bank. It's just a little bit about what we are, who we are we. We currently have about 10,000 engineers working on this stuff on a daily basis.

Why Is Capital One Going Serverless?

First section, I want to talk a little bit about why we're going serverless. In order to talk about that, this is the golden standard for an AWS architecture for operating on EC2 in the cloud. It's basically, you pick EC2 instances, whatever type you need, and you deploy them to multiple availability zones. AZs are basically physical data centers that are separate in cases of disaster, and one does not impact the other. Then you autoscale these EC2 instances. You have to set up your autoscaling policy to match your traffic. Then you configure all kinds of stuff, like networking, you have to deploy your images to these instances, security, and handle everything on your own. Amazon handles all of the hardware and physical aspects of these instances. At the end of the day, these types of workloads are generally waiting for work. They're up and running all the time, they're elastic, but they're waiting for work to arrive. In general, in a pre-serverless architecture, you run on the cloud, could be any cloud provider. In this example, it's the AWS Cloud. In a pre-serverless architecture, normally, what happens is, you choose a cloud provider, on Amazon, you have to create your EC2 deployment fleet. When you're doing that, you have to create instances, you have to choose and create your autoscaling policy. You have to set up the networking. You have to configure your ACL rules. Then you have to configure the AMI, which is an Amazon Machine Image that lands on that instance, and then it deploys whatever software that you need.

Basically, along in that same process, you have to choose the operating system that you're running on those machines. Most of the time, it's Linux based. Sometimes we have Windows based deployments as well. Then our engineers typically deploy containers on top of your operating systems. Then app servers go on top of that. Then finally, your business applications. These are the applications that your customers actually use and care about. In the AWS world, they have a terminology called undifferentiated heavy lifting. If you've worked with AWS before, they always say this term, undifferentiated heavy lifting. At Capital One, we call it arbitrary uniqueness. Basically, all of those things at the bottom are things that really don't add any value, the things that you have to do in order to even get to the top of the stack, which is your business application. Developers don't really care about the stuff at the bottom. I spent about 15 years as a Java developer in the J2EE world. When they told me, I needed to configure some big application server running something on a data center, I had no idea what the spec should mean, I just randomly guessed.

When you move into the serverless world, the whole thought process changes from wait for work to something that's event based. A typical serverless deployment or an application is made up of one of three different event models. It could be a synchronous event, which is, for example, a load balancer drives some traffic to a lambda function, or an asynchronous event, which is something that's happened in your environment, and you have to respond to it. For example, every one of you who is taking a picture right now, that picture, if you want to post it on to social media, it may land in an S3 bucket. Then that social media provider, they're not taking that 10-Meg image that you just took, and then sending it to millions of people. They're processing that image and they're optimizing it for mobile, for computers, for web, whatever it is, and then they're distributing it out. It's reacting, asynchronous reactions to things that we're doing. Then the last thing is poll-based. When you go to, and you submit an order, the order doesn't happen instantly, that's because they're storing it in a durable datastore, and they're processing it later. When Amazon scales up for Prime Day, it doesn't matter how many orders are being submitted, they can process those orders later on. It decouples really tight systems, and it creates better architectures. This is the serverless world.

Then when you work in serverless architectures, all that stuff from the previous slide at the bottom is just in the cloud. When I talk about serverless, we're generally talking about serverless compute technologies, which is lambda, and then Fargate. When you're working with lambda, the first thing that your engineers touch is application code. They only touch application code. They don't have to worry about any application servers. There's no networking configuration. They're not even configuring CPU. All they're saying is, I need this amount of memory and I deploy this code to my lambda functions. AWS will scale those lambda functions widely, as much as your application traffic needs. This is basically the same concept across all three major cloud providers. Azure and Google Functions as a Service are exactly the same thing. They're designed to be event driven. They're designed to be small. They're designed to scale horizontally very quickly. The developer just starts with business logic. What this means is that our developers are a lot more productive, because they no longer have to spend 50% of their time worrying about all of those blocks at the bottom. Everything at the bottom becomes AWS managed. If there's a problem with the infrastructure, AWS will self-heal and load balance across their infrastructure. A really good example is, a few years ago, does anybody remember the Intel hardware bug called Spectre and Meltdown? Those two things, when they came out, a lot of us, if you were on paging duty, or whatever, you might have been paged, because these were severe vulnerabilities. Any customer running a lambda function basically just sat and watched Twitter, because AWS handled everything for you. You didn't have to do anything. That's the power of moving to a serverless architecture.

The Benefits of Serverless

Just at a very high level, these are the three main benefits that we really believe in for serverless. At Capital One, we call something called Run The Engine, RTE. That's all of the heavy lifting that goes behind the scenes to patch, manage, scale vulnerabilities, networking. Everything that we have to handle when you're deploying EC2 instances, we want that to be as low as possible. With lambda functions, it's almost zero. The only thing that we're managing is application library vulnerabilities, which is unavoidable. Developers should focus on building applications not managing infrastructure. That's our core reasoning for going serverless. Usually, you can lower your infrastructure cost, so the pricing model with serverless changes to become pay for use, instead of pay for idle. That's really important, because in my experience at AWS, most customers aren't optimizing EC2 properly. Even if you are, there's going to be idle somewhere. If you're utilizing your EC2 instances, CPU at 30%, you're wasting 70%. Even if you get that up to 70%, you're still wasting 30%. You can't really go any higher, because you don't want your EC2 instances spiked out all the time, because that will degrade performance. Then at the end, serverless forces us to become microservice oriented, you cannot deploy a gigantic monolith onto a lambda function, so we have to break apart both security and functionality.

Lessons we Learned

I want to jump into stuff that's a little bit more technical and lessons that we learned that hopefully can help you if you're trying to make a serverless change. Capital One operates AWS at a massive scale. I've worked with a lot of customers in my time at Amazon, and we're operating thousands of AWS accounts. Across all of them, we have tens of thousands of lambda functions. For those who have built a lambda function, it's very easy to build and roll them out. It's very easy to get them into production, you can do it within a minute. You can just go to the console, deploy a function, write some code, and it goes to the cloud. All of those boxes at the bottom are really tough challenges that we have to handle. Being a bank, we have very clear compliance and controls and regulations that we have to follow. Those are mandatory. Those aren't things that are optional for us. People in different industries might have a little bit lower levels of regulation and have fewer controls you have to build. For example, our lambda functions are not publicly invokable. None of them are publicly invokable. That's a requirement and a standard. We have to set standards. You have to determine where your metrics and logs go. How long you keep them. You have to understand how to do maintenance. When you roll something out, it doesn't have a vulnerability and the vulnerability is detected a week later, then what do you do? What is your response model for that? When vulnerabilities roll out, the CVE levels are 10, 9, 8, all the way down to 1. Every level should have a different response. These are all things you have to worry about, across thousands of accounts, millions of lambda functions. This is the challenge that we were faced with as we grew into our serverless world.

At Capital One, we operate multiple distinct lines of businesses. I work for retail bank, and retail bank build all of the software that customers use to interact with the bank. The mobile app, the web app to pay your bills, deposit your checks, the cafes that you might go to, all of that software is built by retail bank. There are other kinds of businesses, you might have a credit card, the Capital One Venture Card. That is the card business. Early on in our journey, we were making decisions siloed in each line of business, and we didn't understand how those decisions would impact other lines of businesses. What we did was we created a Center of Excellence. That Center of Excellence has representatives from every line of business. Our charter is to basically set standards that work for everybody, and then write these standards, set guidance, create examples, and also deploy subject matter experts across the company. When you switch from a traditional development model to a serverless model, you're going to need experts who can help unblock problems and challenges for your engineers. Then, it just reduces our risk for making a bad decision. It also reduces tech debt. If those thousands of lambda functions, everybody's doing them differently, configuring them improperly, at some point, you're going to have to go back and fix them. The Center of Excellence is intended to help improve all of that experience across the board. We also advise our senior leadership on direction and which services that we should adopt.

This is the most important thing for us, there's going to be a learning curve when you move into lambda development. Early on in the lambda world, you can write lambda functions directly in the console. Has anybody done that? You just go write in the AWS console, and you literally can write anything that's not one of the compiled languages. Lambda supports seven different runtimes, basically every runtime that's out there. The non-compiled ones can be written directly in the console, it's saved and deployed to AWS in less than a minute. The question is, a beginner serverless developer does that, then how do you even debug and test your lambda function? There's no way to do that. You don't have an IDE, and you can't do remote debugging. Basically, what you end up doing is you write console log statements all over the place, and you run your function and read the logs. That's how you're doing debugging. Print statements go everywhere. This isn't really the right way to do it. This is the biggest learning curve that your developers are going to work through, and you're going to need a local environment to help you debug, deploy, and iterate all locally before you touch AWS. If you're on GCP, they have the same thing. They have local emulators, as well as Azure. You're going to need new tools.

In order to make this happen, the tool that we selected was the AWS Serverless Application Model, or SAM. If you've been to re:Invent before, you might have seen them get on stage with like a gigantic squirrel. That's their mascot. SAM provides two features. The first one is a CLI. That CLI allows you to run your lambda functions locally, and emulate exactly what you get on AWS. Obviously, you're not going to get the IAM permissioning, or the performance emulation on AWS, but you'll get everything else. I'll try to demo that in the next couple slides. Then you have CI/CD. SAM can actually deploy your application to the cloud on your behalf. Behind the scenes, it's just CloudFormation. You can write, build, and deploy, and sync your changes straight to the cloud, without ever looking at the console. That's super important, because when you get to big scale, none of your deployments should actually be manual, and none of them should be through the console. If that's happening, human error is going to happen somewhere. This is an example of a shorthand notation for what a SAM template will look like. All SAM configurations start with a CloudFormation template, where you define your resources, and there's a shorthand syntax for CloudFormation. For those who have worked with CloudFormation, you know how verbose it is. The same template in CloudFormation right here, will probably be maybe 10 pages long. This one is about 20 lines long, and it creates 2 lambda functions, one called foo function, and the other called bar function. You can see the runtime specified for both of these is node16, and then the memory for the first one, 128, and the same thing for the second one. That's it. This will literally create a lambda function. I can run both of them locally, and then test, iterate, and eventually deploy to the cloud.

This is Visual Studio code. This is generally our favorite IDE at Capital One. If I look at the template.yaml file, you can see at the top on line two right here, if you see a directive called Transform, this basically tells CloudFormation that this is a SAM template and it needs to be transformed by AWS before deployment. The transformation is handled by AWS, so you don't have to worry about any of this stuff. If we look down here, under the resources block, line 11, we're creating a single serverless function. That serverless function has code base underneath the source folder right here. It uses the runtime, nodejs16 runtime. We're specifying 256 Megs of memory. Who knows the largest memory size that a lambda function can specify? A lambda function today can deploy up to 10 Gigs of memory with your function. When AWS first launched it, I think it was 512 Megs. You can see how big they've scaled out their lambda infrastructure. Basically, this is it. In 30 lines of code, I have a lambda function defined. It's got an event source, that's an API. It's actually going to be served by an API gateway. That's it.

In order to run this function, you can see the source code is right here, this is where my lambda function begins. This doesn't really do anything. It just does a bunch of logging. If I bring up my terminal at the bottom here, this is the SAM CLI, sam local invoke. I hit Enter. What this is doing is, behind the scenes, there's a Docker image that's running on my machine. That image is provided by AWS and it simulates the lambda environment. As you can see here, it generated a bunch of logs. These are the exact types of logs that this function would generate on AWS. What you get is a really nice output to see exactly what this would look like in the cloud. The most important part of any lambda log is the report line, as you can see here. We can see that this memory size for this function is 256 Megs, and the duration of that invoke was 301 milliseconds. If you're doing this in the cloud, you can see exactly the same results. Here, I can do it all locally. I highly encourage you to look at local emulation technologies for any of the cloud providers that you're using. This will really speed up your application development.

The next thing that we learned is that concurrency is going to be a new concept for everybody. In the old world, the only thing that we cared about for that traditional architecture was TPS, transactions per second, or RPS, requests per second. TPS, RPS drives how wide we need to scale that traditional architecture, how many EC2 instances we need, how big of a load balancer we need. In the lambda world, or any Function as a Service world, the new terminology is called concurrency. What concurrency represents is the number of concurrent instances of your function that are running at any given time. AWS generally doesn't care about how fast you're invoking, or how many TPS you're driving to your function, the major limit they care about is the concurrency limit. By default, you get 1000 concurrent functions per account. It's a soft limit, you can increase it. We have accounts up at like 30,000, 40,000 concurrency. This is just an example. When you invoke your function, functions are cold. They don't exist when they're not running. The first time you invoke your function is called a cold start. AWS has to warm that function up, bring in all of the code base that you wrote, and then run that code base. There's going to be a little bit of a delay, but every invoke after that is warm. It's going to be really fast. You'll see warm starts are generally hundreds of times faster than a cold start. You want to optimize for warm starts. In a production environment, you're generally not going to have that many cold starts because you have stable traffic that's causing these containers to stay warm throughout your business cycle. Most engineers transitioning to a serverless world don't really understand concurrency, and how concurrency impacts downstream applications. Let's say your lambda function, all of a sudden gets a traffic spike and scales up to 1000 concurrent functions, and the backend is a relational database. There is no relational database in the world that can handle 1000 concurrent new connections all bursting at one time. It doesn't matter what you choose, there's no database that does that. You just have to be careful about this and understand that concurrency is your main concept and main limit as well.

In order to measure concurrency, AWS has a formula. It's just average requests per second, multiplied by the average duration per second. If you look at three examples here, first one, if you get 100 requests per second, each request runs for an average of half a second, so it's always measured in seconds. That means you're actually driving down your concurrency need, so 100 TPS, half a second runtime that you only need 50 concurrency. If that same request per second, same traffic, it starts taking longer for your functions to complete, so let's say it's a second now. Now you have 100 concurrency for this function. As you ramp this up, let's say your function takes 2 seconds, you're not going to need 200 concurrency. You can see, it's two formulas, and you have to be careful about which piece of the formula is the one that's causing more concurrency.

I talked about development standards. I think every organization should be setting development standards from the very beginning so that all of your developers have something to work off of. These are just some examples of what we do. The first one we do is, lambda has a feature called aliases. It's just a pointer to your function. You can use this pointer anywhere in AWS. What we do is we require all deployments to use a standard alias called live traffic. That alias must point to the live version of that function. All of our functions are version controlled. What that means is, I can jump into any development team across the entire Capital One ecosystem, and I can understand where the function starts, where the entry point begins. If there's any problem, any rollback that's needed, this is where we start. That's super important for us. Next is, there's something called tagging. Tagging is basically metadata. It's metadata that you can attach to any resource. It just key-value pairs. Make sure you have a tagging standard. What tags should have is, who owns this resource? Who gets paged when something happens to this resource? Potentially, billing, because you may have multiple applications running in the same AWS account, and you want to be able to bill to the correct finance department. The most important thing is security. In AWS, security is governed by IAM, which is the identity access management control. You never want your engineers to be able to directly access IAM, because that allows them to create policies that are non-standard. AWS, if you look at their documentation, most of their docs just have wildcards everywhere, like give me wildcard permissions to all of S3. That's not something that we allow. We actually use something called Cloud Custodian to govern all of these things here. We've created an API on top of IAM that all developers go through in order to manage, create, update, delete any IAM policy. We really don't allow wildcards in most places.

The next thing is, standardize your account management standards. We have thousands of accounts, and we actually use AWS organizations to manage those accounts. Because when you do that, you can apply organizational-wide permissioning model across all of these accounts. You can set a safety threshold that says, nobody can expose a lambda function to the public, even if any account underneath that organization tries to, it'll be blocked by the SCP, the Service Control Policy. It's always better to create multi-accounts. Most cloud customers who get into the cloud early on in their journey, you create one big account, and you do everything in that account. Or you might create three accounts, one for each environment, and you do everything in there. That's just the nature of learning in the cloud. What we've learned is that limits apply at the account level, and blast radius applies at the account level. If you have to increase the limit, you have thousands of applications sitting in one account, and they all share the same limits. Remember I talked about the 1000-concurrent lambda limit? If you have 3 accounts, and you spread your applications across 3 accounts, you now have 3000, right out of the box. What we do is we vend every application team their own accounts, and we give them at least three. Just the different levels of environments, Dev, QA, prod, and sometimes additional performance test accounts. When you do that, you end up with thousands of accounts, so make sure you're using organizations. If you're using Amazon-specific stuff to manage those, there's tons of third-party vendors that can help you with account management as well.

The last thing we learned is that, don't reinvent the wheel. Amazon provides tons of tools and libraries called Powertools. These are proven to be used by tons of engineers around the world: logging, metrics, tracing, standardization for all of these things all come right out of the box with these tools. They're not pre-installed with lambda, so you have to download them. I highly recommend using these Powertools instead of writing your own. It can save you tons of time. Then, lastly, there's actually a training called The Serverless Certification. I highly encourage you to walk through these trainings Amazon provided, and it ends with basically a knowledge test, and you get this little badge. It gives you a foundation and a baseline for engineering knowledge for serverless. When you move into the cloud, one big change that I noticed was that developers now have direct power and control of the cost of their application. This is no longer something that you write and you throw over the wall to an SRE team where they manage the infrastructure. Your developers now manage the cost of that lambda function.

What is the lambda pricing formula? How does AWS price lambda? Anybody know? The lambda pricing formula is basically the amount of memory you configure for your function, times how long the function ran for. The key here is the orange right there. It's the amount of memory configured not the amount of memory you use. In many other places, it's actual memory consumption. Let's say you configure 256 Megs of memory, and you use 1 Meg, and you get billed at 256 Megs. Be very careful about that. Every single invocation of lambda generates a log that looks like this. We saw it in the SAM demo earlier. There's a start and an end, but the most important one is going to be the report line at the bottom. That report line has all of this information. It tells you everything about that invoke, how long that ran for, how much memory it used. Then just a quick fact, lambda bills you at 1 millisecond intervals. It's the most granular billing that AWS offers on any service. This is an example of a real live runtime. You can see this memory size of 256 Megs, and we used 91 Megs. It's a little bit of waste here. However, memory controls CPU power as well. In some cases, if you have a I/O bound workload, it might be better to have more memory and run faster, and bring down the second component of that formula. This is something you're going to have to test. You're going to have to performance test. You have to find that optimization sweet spot. This isn't really science. You just have to test this and observe your performance. There's a tool called Lambda Power Tuner. Lambda Power Tuner helps you with this. It generates a graph. That red line represents performance: the lower the better. That blue line represents cost. As you can see here, the crossover is where this lambda function is probably optimal. You could actually get a little better performance as you move to the right. Moving to the right increases more memory. The performance increase is so minimal, and the cost starts skyrocketing. In most cases, it doesn't make any sense. Look up the Lambda Power Tuner, it's super valuable. We use it across the board, and it helps us bring down lambda costs. We've been able to tune many lambda functions and save 70-plus percent cost, for most of our application teams.

The last thing that's important is, when you move into a managed service with AWS, you no longer have access to those EC2 instances, which means you can't log into anything. There's no agent that you can use to get metrics and logs and data. You have to rely on AWS-specific tooling to understand how your functions are performing. Make sure you use the correct CloudWatch metrics, and read those metrics properly. This is something that we had to learn as well. There are three most important metrics to look at with lambda. Invocations, which is how often your function is invoked. Errors, which is functions that have actually been invoked and returned an error. These don't overlap. This is a common misconception that our developers have. Then, throttles. You get throttled when you exceed your concurrency quota. Throttles is super important, because you want to understand if you're being throttled, that means there's work that's being thrown at lambda, but it's being throttled by AWS. You want to minimize that. None of these metrics overlap. That's super important. This is a screenshot of one of our production accounts. As you can see, the pattern is very much, follow the sun. This is a production workload in one of our big accounts. You can see it's just peaks and valleys every single day. I think the peak is about 120,000 invocations per minute. This is the invocation metric. Read this metric using the correct statistic too. AWS, when you go on the CloudWatch metrics, you can choose the statistic for every single metric. If you're reading the wrong statistic, you're going to get totally wrong information. Like in this case, invocations. I want to know how many invokes landed on my function, so I use the Sum statistic. If I were to use Max, or p90, or average, that wouldn't make any sense for this metric. Many of the first-time staff engineers that I work with, end up using the wrong statistic, and it drives their investigation completely down the wrong path. Be really careful about that. This is an example of concurrency. Lambda concurrency, you want to read this using Max, not Sum. This is also a screenshot of a production account. We generally peak at about 800 to 900 concurrency in a single account.

Best Practices

I want to leave you with a bunch of best practices, that you can take home and hopefully apply in your environment. All of these should also apply to non-AWS workloads as well. The concepts are pretty much the same. These are the top critical things that you don't want to do when you're working with serverless. Number one is, if you're setting out max configurations for your functions, that's generally a bad practice. Like, 15-minute timeouts, 15 Gigs of memory. If you need these, you probably need to rearchitect some of your design patterns with lambda. Because, let's say you set a 15-minute timeout, remember the pricing model is execution duration times memory configured. If you have a problem with that lambda function, and it actually causes a timeout, for example, it needs to read against an RDS database, generally happens within 30 seconds. If you have a 15-minute timeout, that function could theoretically keep timing out and running up to the 15-minute mark, and now you've just paid an extremely expensive invocation on lambda, and multiply that across millions of invocations, and you're going to end up with a huge bill. I've seen a lot of customers do this accidentally and have network configuration problems. They end up with a gigantic lambda bill that they have to deal with. Number two is, there's an API called PutMetricData for AWS CloudWatch. This is basically, I want to write some data to the CloudWatch service so that I can create some metrics on it. That call is really expensive. Try to avoid that call. Instead use something called EMF, Embedded Metric Format, and AWS ingests those metrics and automatically creates those metrics for you, on your behalf, without you having to write this API call. There are different types of concurrency with lambda. Just be careful, they're not the same. Provisioned concurrency, reserved concurrency, on-demand, these are different concepts that you want to be aware of. Number five is batch size. When you're working with any poll-based architecture, so let's say you have SQS storing orders from a cart, lambda will poll that SQS service for you, so you don't have to do any of that heavy polling. If you're setting batch size to 1, that means every single time it polls, it's polling one record off at a time, and it's invoking your function with one record at a time. That gets really expensive. The overhead to start a function, run it with one record, and then shut down, you're going to find performance delays. Instead, let AWS poll batches off of the queue, 10 at a time, 100 at a time, deliver them to your function. Then the function should be able to process in a loop, all of these records in a single invocation, and avoid the cold start of starting a function.

Then, a couple things at the bottom there. Many customers that I've worked with make this mistake, is CloudWatch Logs can be an event source for lambda. Every single time a log is generated, it can invoke a lambda function. If you think about that in a production workload that's driving millions of customer interactions, you could be invoking your functions an enormous amount of times. You're probably going to brown out your lambda service. Then, the last thing, 9 and 10 are just more deployment artifacts. You can actually deploy lambda using x86 architectures or Arm-based architectures. On paper, Arm-based architectures are 20% cheaper. Amazon price them 20% cheaper than x86. Be careful about just switching to Arm, because Arm-based architectures are not faster than x86 in all cases, and you may actually run slower. What we do is we test both sides, depending on the workload. Then, based on the workload, some run better in Arm, some run better on x86. The key there is, if you don't have an Arm compiled library, you're not going to run better than x86. You need to have an Arm-optimized library before you start down that path.

Last slide here is our top best practices that we've found that work for us. Number one, optimize your memory, save up to 70% of your lambda cost. Number two, this is big here, CloudWatch Logs by default retain forever. If you're doing performance tests, and you're generating gigabytes of logs, Amazon bills you at 53 cents per gigabyte in CloudWatch. We've seen many cases where the logging costs just keep getting bigger. Eventually your log cost more than your compute. That's the last thing that you want to happen. You just set different policies, different expiration policies by environment. Make sure you're expiring logs very aggressively in development and performance testing. Then I'll just jump down to number five, provisioned concurrency is a feature of lambda, where you can say, I want to keep x amount of concurrent lambda functions warm at any given time. You might need that because you have workloads that need zero cold starts, basically. Make sure you're applying autoscaling for your provisioned concurrency allocation when you do that. That way, you're scaling that number up and down, as your workload goes up and down.

Number seven, try to use local development as much as possible before you jump into deploying to AWS. It'll save you a lot of round-trip time with AWS. It'll actually save you cost as well. Eight is, use the right AWS SDK. Amazon's been around for 20 years, AWS specifically, something like that. They've gone through multiple major revisions of every single SDK. Generally, the first two or three versions of their SDKs were around before lambda existed. What that means is none of those are optimized for lambda. For example, if you're using the Java AWS SDK version 1, that is a gigantic SDK. It's like 30 Megs, I think. It cannot be modularized. You end up with the whole thing deployed in your lambda function, and lambda functions are capped at 50 Meg deployment sizes, if you're using ZIP packaging. You just use 30 Megs out of 50 Megs for the AWS SDK. They released version 2. Version 2 you can just package exactly what you need. If you're only working with S3, you only package with S3 libraries. Then, you can even choose the right HTTP client for your workload. There's three of them up there, and you can exclude the rest. Same concept for Node and Go as well.

Who's heard of SnapStart? SnapStart is a feature Amazon launched at re:Invent last year, and it's specifically for Java based lambda functions. What it does is, they pre-snapshot your lambda function before you even use it, and then use a warm image of that snapshot to invoke every single time the function runs. The reason they did that is because Java, in history has the worst cold start performance of all other runtimes. It's because the JDK is not small. The JDK just takes time to boot, and it needs resources to boot. SnapStart's actually free. You can enable SnapStart on your Java based lambda functions, and then you'll be able to basically decrease your performance by as much as 90%. Then last thing, set standards. Make sure you're enforcing these standards before you go big into serverless, because eventually it's going to be hard to pull back when you have tons of random stuff going on in your accounts.

When Not to Use Lambda

Participant: We didn't cover much about like, what are the kinds of functions that maybe you're running, you'd come across and be like, what is the lambda, and you really get whatever container, or [inaudible 00:47:25]. Can you speak to, you need to start reducing your teams down about when to not use lambda?

Mao: When should you not use lambda? I think the number one anti-pattern for lambda is going to be when you have anything that has high wait time. Remember the pricing model is duration times memory. If you spend 70% of your invoke just waiting for work, that's an enormous waste of resources. That's when you really want to think about using like a container instead. A good example is like, if you're polling off of a Kafka cluster and that poll is very expensive, and it takes time. You may want to think about using a Fargate task to do that instead. If you're using Kinesis, or any native AWS data source, they do the polling for you. You don't pay for the polling. They poll, and then they deliver that polled record to your function, and your function just works off that record. If you have anything that's not AWS native, that's sometimes when you want to look at a container.


See more presentations with transcripts


Recorded at:

Mar 11, 2024