InfoQ Homepage Articles The Right Way of Tracing AWS Lambda Functions

Cloud

The Right Way of Tracing AWS Lambda Functions

This item in japanese

Aug 06, 2020 18 min read

Follow us on

Youtube232K Followers

Linkedin26K Followers

Key Takeaways

AWS Lambda is a key ingredient of many cloud-native applications and use-cases
The nature of AWS Lambda requires special care for observability
Distributed tracing is all but necessary to succeed in running complex, Lambda-based applications
The distributed tracing needs of Lambda emphasize the need to comprehensive, drop-in, low-maintenance distributed tracing

AWS Lambda is probably one of the defining technologies of the cloud-native shift in software development of the past few years. According to the official site:

“AWS Lambda lets you run code without provisioning or managing servers. You pay only for the compute time you consume.”

Initially starting in 2014 with support for Node.js, AWS Lambda now supports development and deployment of functions in a variety of programming languages including Go, Java, Python, Ruby and C#.

The launch of AWS Lambda introduced us to Serverless Computing (with Function-as-a-Service as the “compute” aspect) as a mainstream cloud-native paradigm. Similar offerings have since been provided by the other major Cloud platforms, such as Google Cloud Functions and Microsoft Azure Functions, as well as by open-source projects like Knative and Apache OpenWhisk.

More than half a decade after its launch, AWS Lambda is arguably still the most known and adopted serverless platform and, despite the dearth of precise figures on its adoption, it seems fair to say it has matured beyond the “early adopters” phase.

Nevertheless, AWS Lambda is still somewhat at odds with one of the other defining topics of the past few years of cloud-native: observability. Despite Lambda’s integration with CloudWatch for metrics and logs, and X-Ray for distributed tracing, it is still a considerable challenge to understand what is wrong with a function in production.

With an emphasis on distributed tracing, this article discusses best practices for gaining and leveraging observability into AWS Lambda functions, based on the use-cases that AWS Lambdas are leveraged for in today’s computing landscape.

Why is Lambda Seeing so much Growth?

With AWS Lambda, users define functions that are executed synchronously, for example to serve an HTTP request, or asynchronously, to react to events generated by other AWS services. The list of events that can trigger AWS Lambda functions is extensive and keeps growing, with some of the most adopted event types being:

CloudWatch events, which can be defined using simple rules that describe changes in AWS resources
S3 events, which are emitted when objects are created or deleted in S3 buckets
SQS events, which pass messages queued in SQS to Lambda functions for processing

But beyond what AWS Lambda can do, its defining characteristics are what its adopters no longer need to deal with:

No more infrastructure management: AWS Lambda automates the management of infrastructure allocated to run functions, scaling up and down of computing resources. As the load your application needs to serve grows on Monday morning, with people going back to their desks, AWS Lambda will automatically, behind the scenes, increase the amount of instances of your function. And when the load goes down, at the end of the working day, under-utilized instances are automatically decommissioned. And the promise of AWS Lambda is that none of this is really something that should concern you as a developer.
No more fixed costs: Users pay only for the workload when functions are served up in terms of how much CPU time and memory is allocated, and incur no costs when there is no workload.

Does AWS Lambda live up to the above? Mostly. You can just give code to AWS Lambda to run for you only when needed, and you can pay just for the time your functions serve workload (albeit, rounded up to the closest 100ms). However, this comes at the cost of some unpredictability in terms of performance: when AWS Lambda fires up an instance to serve the load, the first request coming through that instance will suffer a considerably higher latency, as the runtime needs to be initialized. This phenomenon is known as cold start, and has led creative solutions to keep functions “warm”, finally leading Amazon to allowing you to pay to keep a certain amount of instances warm, which arguably counts as both a fixed cost and developers minding about the infrastructure, but it helps a lot when your Lambda workloads are very sensitive to latency spikes.

Top Lambda Use Cases

As with all versatile computing platforms, you can do a lot of different things with AWS Lambda. In practice the most recurrent uses cases are the following:

Prototyping and early-stage development: With no upfront infrastructure costs, AWS Lambda makes for a very attractive prototyping platform for new products and capabilities, especially in start-ups and smaller outfits that do not want to or cannot devote staff and money to maintaining virtual machines or persistent container deployments. Then, as products grow in maturity and their workloads become more predictable, there is a tendency to move away from AWS Lambda to more self-managed computing platforms, like EC2 or Fargate, for the following reasons:

Cost: if Lambda’s “scale to zero” is not needed by your workload, and you are willing to deal with scaling up and down containers or virtual machines on your own, there is a certain markup you pay AWS Lambda for its flexibility and on-demand nature that, at scale, it becomes considerable.
Complexity: although nothing really prevents you from deploying large codebases to AWS Lambda (you have 250MB available for the deployment package, and that equates to a lot of code), functions should be relatively small and straightforward, especially because they are hard to observe and debug in production.

Business process & system integration: The existence of event triggers for many AWS services, makes Lambda functions natural candidates for integration of (business) processes between systems. For example, at Instana we use Lambda functions for many different types of automation, ranging from Quality Assurance tasks like automatically provisioning infrastructure to test newest builds, to integrating our support portal to our project management systems, to automatically creating work items for our engineers in response to support tickets opened by customers.

There also seems to be a growing interest in using AWS Lambda for machine-learning use-cases, especially in combination with AWS Sagemaker.

As an interesting corollary of the fact that Lambdas are an excellent tool for business process and system integration is that (almost) no Lambda function is an island: Lambda functions, much more often than not, call out to other Lambda functions as well as systems that are not running on Lambda. These systems called from Lambda functions are either services managed by AWS or other customer systems deployed in some other AWS computing platform like EC2, ECS or Fagate, or even on-premises. The relevance of this is discussed later with regard to requirements for distributed tracing AWS Lambda functions.

The Rough Edges of Lambda

Every compute paradigm comes with trade-offs, and Lambda is no exception:

Hard to debug: The fact of being “serverless” means, among other things, that you have fundamentally no access to the production infrastructure to debug when things inevitably go wrong. (A “server ” of course exists, but you have no control over it, so to you it looks like “server-less”.) Granted, AWS provides ways of running your Lambda code locally. The Serverless framework has local testing too. There are also interesting proofs-of-concept in attaching remote debuggers to Lambda functions (for .NET and Python, for example). However, the reality of things is that, when you have issues in production, the out-of-the-box functionality you have to debug is exceedingly limited and it tends to become a game of “Cloud Printf” (that is, add some more logging to CloudWatch, push a new Lambda version out and cross fingers), which is not a fun game to play if you are scrambling to fix an outage. To make things worse, since the cost of one AWS Lambda invocation depends (among other things) on how long the function runs, bugs that get your AWS Lambda function stuck like processing unexpectedly large database result sets, are both hard to debug and costly for your cloud budget. Which leads us to...
“Stateless” just means you pull state from somewhere else: Lambda, insofar as the runtime is concerned, needs functions to be stateless: you cannot rely on any one Lambda function to retain state from the processing of a previous request. It is, however, a very rare thing to have business logic that does not need state to process. Thus, most Lambda functions need to load some state information from other services, which may result in unpredictable execution times, and often also the Lambda functions need to store some state modification too. To be fair, input-output issues inside the AWS infrastructure seem rare, but programming oversights like “pull half the RDS database” are far less so.
Distributed complexity: complex scenarios often involve large amounts of Lambda functions loosely coupled to one another through events. Just have a look at what its authors describe as a “typical 100% Serverless Architecture in AWS”: that’s a lot of moving parts to keep track of. If you consider that a lot of Lambda functions operate asynchronously, finding out what functions were involved in serving which request and what went wrong can feel like trying to solve a million-piece jigsaw puzzle.

The distributed complexity inherent in many AWS Lambda architectures and the limited debugging capabilities of AWS Lambdas running “in production” on AWS require you to muster every little bit of observability you can get for your Lambda functions. This means, besides the obvious logs and metrics in CloudWatch, adopting distributed tracing for your Lambda functions.

Distributed Tracing to the Rescue

Distributed tracing has been an integral part of Application Performance Monitoring approaches since the early 2000s, and recently has grown to the forefront of the monitoring and observability discourse in no small part due to the OpenTracing API and implementations thereof by open-source projects like Zipkin, Jaeger, as well as unrelated projects like OpenCensus and HTrace. While the OpenTracing and OpenCensus projects have been discontinued, their successor, OpenTelemetry, is actively worked on.

Distributed Tracing in a Nutshell

The advent of microservices and cloud-native architectures have arguably made our distributed systems more distributed than ever before. Meanwhile, the signal software components have grown smaller. The more moving parts you have working (hopefully) in unison to serve your workloads, the more visibility about the collective and individual behavior you need, and specifically in relation to how end users are served. Some say that the work of a developer is more and more akin to the one of a plumber, focusing on the “pipes” connecting the various microservices.

This increased distribution and interdependency is precisely why distributed tracing has grown to be so important and valuable. Distributed tracing is a monitoring practice that involves your services to collectively and collaboratively recording spans that describe the actions they take in servicing one request. The spans related to the same request are grouped in a trace. In order to keep track of which trace is being recorded, each service must include the trace context in its own requests towards other upstream services. In a nutshell, you can think of distributed tracing as a relay race, the discipline of track and field sports in which the athletes take turns running and passing one another the baton. In the analogy of distributed tracing as a relay race, each service is an athlete and the trace context is the baton: if one of the services drops it, or the handoff between services is not successful because, for example, they implement different distributed tracing protocols, the trace is broken.Another similarity between distributed tracing and relay is that, while each of the single segments of the race matters and can make you lose the race, you need to be fast in each segment to excel.

Distributed Tracing of AWS Lambda Functions

Before discussing what is available for tracing AWS Lambda, let’s discuss the functional and non-functional requirements a distributed tracing solution should fulfill:

Runtime polyglotism: AWS Lambdas can be written in a variety of languages, the most commonly adopted being Node.js and Python, but there are many more, like Java, Ruby, .NET Core and Powershell. Similarly to what happens to microservices (and because a Lambda function is effectively a very micro microservice), teams pick the language they deem most fit for the task, because of libraries and SDKs they want to adopt, as well as the team’s familiarity with the language. What I have heard from pretty much every single Lambda adopter I interacted with, is that they use at least two different AWS Lambda runtimes.
Platform polyglotism: as stated before, no AWS Lambda function is an island. There are cases of architectures implemented exclusively as AWS Lambda functions, but in our experience they are a (rather exceptional) exception, not the rule. The value of the insights to be achieved with distributed tracing grows with the amount of systems interconnected, which means that whatever distributed tracing framework you want to use with your AWS Lambda functions, should better be usable in the rest of your infrastructure as well. This is especially true for the infrastructure consumed by Lambda, but the network effect applies indisputably to distributed tracing. Incidentally, this is the very reason the W3C Trace Context specification, which aims at providing a measure of interoperability between distributed tracing implementations. As a side note, the languages you use in your Lambda functions may not be the same you use, say, in your “legacy” applications in your datacenter, which compounds the need of runtime polyglotism for Lambda tracing.
Low overhead: the overhead introduced by the distributed tracing implementation costs you in terms of latency for your end-users, but also directly in terms of the AWS bill: Lambda functions are charged by both CPU time as well as maximum memory allocation. No self-respecting distributed tracing implementation would add tens or hundreds of megabytes to the memory footprint of your Lambda function (although, I have seen things on other platforms). CPU time, however, can be affected, especially because, due to the statelessness of Lambda, the tracing data must be sent to the APM solution before the Lambda function completes, often blocking the completion of the function until the tracing data is uploaded, or risk losing those data.

I have thought long and hard about adding to the list above an entry called “Integration with other AWS services”. After all, Lambda functions are invoked asynchronously from events generated by other services, and synchronously from AWS’ API Gateway and Application Load Balancer. And having inside the same trace spans coming from API Gateway, for example, would help answer the question “Where is this latency coming from?”, which is truly as old as distributed systems. However, I have decided against because the source of latency is seldom ever the AWS network between the used load balancer and the Lambda function, but rather either the AWS Lambda function itself, either in terms of its dependencies, blocking on or waiting for some long-running synchronous call, or its internal processing.

Types of Instrumentation

The collection of tracing data is performed by specialized instrumentation. Generally, instrumentation can be classified into two large groups:

Programmatic instrumentation is instrumentation that offers APIs to be coded against, such as OpenTracing or AWS’s X-Ray SDK.
Automatic instrumentation works by modifying your code and the frameworks you use to extract tracing data without need of additional code. For example, it is usually done via monkey patching in Node.js and in Java via bytecode manipulation.

Notice that, in cases of programmatic instrumentation built into frameworks and libraries you consume, or services managed by a provider, like AWS RDS, the way they implement instrumentation feels to you like automatic, because it is code you do not need to maintain. And that is precisely the crux of the matter to me: good instrumentation works and it is code you do not need to own and maintain.

Delivery of Instrumentation

But how are you going to deliver your instrumentation to your production system so that it collects the data you need? It is again a gradient, from manual to automated:

Built-in instrumentation that is shipped with your code or in the runtime that runs it: whenever you deploy a new version of your function, the instrumentation code embedded in it follows. Programmatic instrumentation is mostly built-in, although some approaches have a split between API and implementation that requires additional dependencies to be available. Of course, there may be configuration required for built-in instrumentation.
Drop-in instrumentation consists in adding to the runtime some dependencies and configuration that activate the instrumentation. For example, instrumentation shipped by a Lambda layer and activated via configuration options like a custom wrapper script (also delivered via the Lambda layer) that delegates execution to your actual Lambda function handler. (This is, by the way, how Instana does it.)

The same way that I think that automated configuration is superior to programmatic one, the less amount of work you need to do to deliver the instrumentation, the better. In this light, one might think that programmatic instrumentation has its merits, as long as it is also built-in. In my experience, that is not quite so: work to maintain programmatic instrumentation is far more expensive than work to deliver drop-in instrumentation. In most cases, the delivery of automation can be set up in the CI/CD pipeline, once and for all or with minimal maintenance needed. But programmatic instrumentation needs to keep changing with your code, and that is a cost you will pay as long as your code keeps changing. Lehman's laws of software evolution states, in a nutshell, that software needs to keep changing to remain useful. When considering programmatic instrumentation, those changes you need to keep your software useful may require to adjust your instrumentation. In Lambda, the likelihood of changes causing the need of adjusting the instrumentation is anecdotally higher than usual, as many of the changes you apply to Lambdas that integrate software systems are about how you interact with them, and that often requires adjustments of the tracing data you collect, and may keep the needed investment in programmatic instrumentation substantial over the lifetime of the Lambda codebase.

The Best Type of Instrumentation

To sum it up, I have argued before that the best type of instrumentation is the automatic one, and it is very important to deliver it without much overhead. So, how is that achievable? In the state of the art, there are two approaches that can achieve these requirements:

Custom Lambda runtimes: AWS Lambda allows you to provide custom runtimes for your Lambda functions, and some vendors of distributed tracing solutions do provide custom Lambda runtimes as a turn-key solution. However, that means that security and maintenance updates, bug fixes and new features provided by AWS Lambda runtimes have the distributed tracing provider as a gatekeeper. Your mileage may vary, but I would personally not be really comfortable with that: it reminds me too much of the issues of the Android mobile ecosystem, with mobile divide manufacturers having generally a really poor track record of keeping up with Android releases and maintaining older devices. Of course, how good your experience is depends on the actual quality of the work of the distributed tracing vendor, and I do not aim to say that this is an option no one should use.
Instrumentation over Lambda layers: you can configure your functions to use AWS Lambda layers, which are fundamentally additional files that are made available on the filesystem of your Lambda instance. This can be used to deliver with really little operations overhead, the instrumentation you need to trace your functions. And many distributed tracing vendors do indeed that: you will find a lot of distributed tracing vendors, for example, in the λ AWSome Lambda Layers list of monitoring solutions. There are differences in the state of the art in terms of how easy it is to activate the instrumentation. With some vendors like Instana, you just need to set a few environment variables; with others, you need small code changes. I am of the opinion that configuration is easier to deal with than code changes, but again, your mileage may vary.

To sum it up, in my eyes the best instrumentation is automatic and delivered in the easiest possible way.

Conclusion

AWS Lambda continues to be the standard when it comes to serverless computing, But even as the use of serverless functions in application development is at an all-time high, companies are still exploring just how effective serverless computing can be in production environments.

As more organizations use serverless functions as a vital part of their application development process and platform, they are looking at how they can - and if they should - use more serverless functions in production applications.

One concern is observability, which remains a major challenge for teams that implement serverless production code, especially since legacy APM tools continue to struggle to attain the levels of visibility they need.

AWS Lambda ushered in significant differences – for both developers and operators – that underline the need for purpose-built cloud-native monitoring tools that understand the serverless monitoring challenges and use different methods for obtaining application observability.

Anyone considering serverless as part of their production environment should include end-to-end observability as a necessary requirement and focus on solutions that can monitor and trace across distributed systems built with cloud-native technologies, serverless platforms like Lambda and, since Serverless is seldom an island, potentially older technologies integrated via Lambda.

About the Author

Michele Mancioppi serves as Senior Technical Product Manager for Instana where he leads all product development related to agent, distributed tracing, Cloud Foundry and VMware Tanzu. Prior to Instana, he was a Development Expert and the Tech lead of the SAP Cloud Platform Performance team at SAP. Michele holds a Bachelors and Masters degree in Computer science from Università di Trento and a PhD in Information Systems from Tilburg University, the Netherlands.

InfoQ Software Architects' Newsletter

The Right Way of Tracing AWS Lambda Functions

Follow us on

Key Takeaways

Related Sponsors

Why is Lambda Seeing so much Growth?

Top Lambda Use Cases

The Rough Edges of Lambda

Distributed Tracing to the Rescue

Distributed Tracing in a Nutshell

Distributed Tracing of AWS Lambda Functions

Types of Instrumentation

Delivery of Instrumentation

The Best Type of Instrumentation

Conclusion

About the Author

Rate this Article

This content is in the Cloud topic

Related Topics:

Related Editorial

Popular across InfoQ

The InfoQ Newsletter