Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Articles AWS Lambda under the Hood

AWS Lambda under the Hood

This item in japanese


Key Takeaways

  • Lambda allows users to execute code on demand without the overhead of server management and operations, enabling efficient execution across various integrated languages.
  • Lambda offers synchronous and asynchronous invocation models; synchronous invokes ensure a rapid response, whereas asynchronous invokes queue requests for deferred execution.
  • Lambda adheres to core design principles - availability, efficiency, scale, security, and performance - informing technical decisions to create a reliable, secure execution environment, minimize overhead, and efficiently scale resources.
  •  The Invoke Request Routing layer connects microservices, offering attributes such as availability and scale.
  • Lambda snapshot distribution service, incorporates chunking and on-demand loading, streamlines the invocation process, notably reducing download times and enhancing system efficiency.

AWS Lambda is a serverless compute service that runs code as a highly available, scalable, secure, fault tolerant service. Lambda abstracts the underlying compute environment and allows development teams to focus primarily on application development, speeding time to market and lowering total cost of ownership.

Mike Danilov, a senior principal engineer at AWS, presented on AWS Lambda and what is under the hood during QCon San Francisco 2023. This article represents the talk, which will start with an introduction to Lambda itself to outline the key concepts of the service and its fundamentals, which will facilitate a deep dive into the understanding of the system.

Subsequently, we will delve into the invoke routing layer, recognized as a crucial component connecting all microservices and ensuring the seamless operation of the entire system. Following that, we will shift towards the compute infrastructure - the space where code execution occurs. This represents a serverless environment within the broader serverless framework. Concurrently, we will weave in a story about cold starts, a common topic when running code in cloud infrastructure.

AWS Lambda Overview

Lambda enables users to execute code on demand as a serverless computing system without needing server ownership, provisioning, or management. Built with various integrated languages, Lambda streamlines the process by allowing users to focus solely on their code, which is executed efficiently.

With a rapid response to varying demand, Lambda, which has been in operation for several years, caters to millions of monthly users, generating a substantial volume of trillions of invokes. The operational mechanism involves simple configuration, where users specify their preferred memory size and proportional allocation of resources, including computing and CPU.

Lambda supports two invocation models, starting with the synchronous invoke. In this scenario, a request is sent, routed to the execution environment, and code is executed, providing a synchronous response on the same timeline. On the other hand, asynchronous invoke involves queuing the request, followed by execution at a different timeline through poller systems. Emphasizing the equivalence of execution in synchronous and asynchronous invokes, the discussion primarily focuses on synchronous invokes in the present context.

Its design principles are crucial to understanding Lambda's approach, guiding the framework in making technical decisions and trade-offs. The first tenet, availability, ensures a reliable response to every user request. Efficiency is vital in on-demand systems, requiring quick resource provisioning and release to prevent wastage. Scaling rapidly in response to demand and efficiently scaling down to minimize wastage represents the scale tenet. Security is the top priority at AWS, assuring users a safe and secure execution environment to run and trust their code. Lastly, Lambda emphasizes performance, aiming to provide minimal overhead on top of application business logic, resulting in an invisible and efficient compute system.

Invoke Request Routing

Invoke Request Routing is a vital part of Lambda - a critical layer interconnecting various microservices, offering essential attributes such as availability, scale, and access to execution environments. Let’s take a practical approach by illustrating the building process to understand this layer better.

The scenario involves Alice seeking assistance to deploy her code in the cloud. The initial step is to incorporate a configuration service to store her code and related configurations. Subsequently, a frontend is introduced, which is responsible for handling invoke requests, performing validation and authorization, and storing configuration details in a database. The next component required is a worker, serving as the execution environment or sandbox for Alice's code. Despite the apparent simplicity, challenges arise in an on-demand computing system, where the availability of workers or execution environments may be uncertain during an invoke. To address this, a new system called "placement" is introduced to create execution environments or sandboxes on-demand.

The front end needs to communicate with placement to request a sandbox before forwarding the invoke request to the created execution environment, completing the functional setup. However, a challenge persists due to the inclusion of on-demand initialization before each invoke request. This initialization involves multiple steps, including creating the execution environment, downloading customer code, starting a runtime, and initializing the function. This process can take several seconds, potentially negatively impacting the overall customer experience.

The latency distribution graph illustrates the frequency of invoking durations over time. The green latency represents successful code execution, reflecting the efficiency of the business logic and its CPU utilization. To minimize overhead and enhance customer experience, a new system, the worker manager, is introduced. Operating in two modes, it either provides a pre-existing sandbox upon frontend request, leading to smooth "warm invokes," or, in the absence of a sandbox, initiates a slower path involving placement to create a new one. While warm invokes exhibit minimal overhead and speed, efforts are underway to eliminate cold starts, acknowledging the need for further improvements.

Above presents a practical depiction of Lambda synchronous invoke routing in production, emphasizing enhancements for resilience. Additional availability zones and a front load balancer were integrated. The worker manager, responsible for tracking sandboxes, posed challenges due to its reliance on in-memory storage, leading to potential data loss in case of host failures. A replacement named the assignment service was introduced a year ago to address this. Functionally similar, it features a reliable distributed storage known as the journal log, ensuring regional consistency.

The assignment service utilizes partitions, each with a leader and two followers, leveraging a leader-follower architecture to facilitate failovers. This transformation significantly bolstered system availability, rendering it fault-tolerant to single host failures and availability zone events. The move from in-memory to distributed storage, coupled with implementing a leader-follower model, improved efficiency and reduced latency. This marks the conclusion of the first chapter on invoke routing, exploring aspects of cold and warm invokes and the pivotal role of a consistent state in enhancing availability.

Compute Fabric

Compute fabric, specifically the worker fleet within Lambda's infrastructure, is responsible for executing code. This fleet comprises EC2 instances, known as workers, where execution environments are created. A capacity manager ensures optimal fleet size adjustments based on demand and monitors worker health, promptly replacing unhealthy instances. Data science collaboration aids placement and capacity managers in informed decision-making by leveraging real-time signals and predictive models.

Considering data isolation, the text explores the challenge of running multiple users' code on the same worker. Adopting Firecracker, a fast virtualization technology, helps overcome this hurdle. By encapsulating each execution environment in a microVM, Firecracker ensures strong data isolation, allowing diverse accounts to coexist on the same worker securely. This transition from EC2 to Firecracker significantly enhances resource utilization, preventing overloads and providing consistent performance.

The benefits of Firecracker, including robust isolation, minimal system overhead, and improved control over the worker fleet's heat, are highlighted. The adoption of Firecracker results in a notable reduction in the cost of creating new execution environments, as demonstrated in the latency distribution diagram. The narrative then introduces the idea of using VM snapshots to expedite the initialization process, leading to reduced overhead for creating new execution environments.

The process of building a system for VM snapshots is outlined. The text emphasizes distributing snapshots between workers, ensuring fast VM resumption, and maintaining strong security measures. Introducing an indirection layer, known as "copy-on-read," addresses potential security threats associated with shared memory. The challenges of restoring uniqueness to identical VMs and ongoing collaboration with various communities, such as Java and Linux, are discussed as part of the continuing efforts to enhance the security and efficiency of the system.

Snapshot Distribution

Snapshot distribution is a critical aspect of EC2 instances, especially considering the significant size of snapshots, which can reach up to 30 gigabytes. Traditional download methods could be time-consuming, taking at least 10 seconds to complete. An analogy is drawn to video streaming to optimize this process, where content is progressively loaded in the background as the initial portion is played. Similarly, snapshots are split into smaller chunks, typically 512 kilobytes, allowing the minimal set of chunks required for VM resumption to be downloaded first. This approach has the dual advantage of amortizing download time and retrieving only the necessary working set during an invoke.

The mechanism for on-demand chunk loading involves mapping VM memory to the snapshot file. When a process accesses memory, it either retrieves data from the memory pages or, if not available, falls back to the snapshot file. The efficiency of this process relies heavily on cache hit ratios, encompassing local, distributed, and origin caches. A strategy is proposed to maximize the cache hit ratio by identifying and sharing familiar chunks. For instance, operating system and runtime chunks can be deduplicated and shared across multiple functions, enhancing efficiency and minimizing calls to the region.

Layered incremental snapshots are suggested to optimize chunk management further. These snapshots are encrypted with different keys, categorized as operating system, runtime, and function chunks. Operating system and runtime chunks can be shared, leveraging convergent encryption to deduplicate common bits even when the origin is unknown. Convergent encryption ensures that identical plaintext content results in equal encrypted chunks. This comprehensive approach enhances cache locality, increases hits to locally distributed caches, and reduces latency overhead.

In the production system, the indirection layer is replaced with a sparse file system, streamlining the request process and providing chunks on-demand at the file system level. This sophisticated approach contributes to a more efficient and responsive system.

Have We Solved Cold Starts?

The system should ideally function well after successfully implementing snapshot distribution and VM resumption. However, some delay persists in sure cold invokes despite their proximity to the target location. To comprehend this issue, revisiting page caches and memory mapping is essential. An optimization in the operating system involves read-ahead, where, anticipating sequential reads in regular files, multiple pages are read when a single page is accessed. In the case of mapped memory instead of a regular file, which allows random access, this method proves inefficient, leading to the download of the entire snapshot file for seemingly random page requests.

This inefficiency is depicted in the latency distribution graph, highlighting the need for a resolution. To address this, an analysis of memory access among 100 VMs reveals a consistent access pattern. This pattern, recorded in a page access log, is attached to every snapshot. Consequently, during snapshot resumption, the system possesses prior knowledge of the required pages and their sequence, significantly enhancing efficiency. This innovative solution successfully mitigates issues associated with cold starts. Notably, users can experiment with this improvement firsthand by enabling Lambda SnapStart on their Java functions and experiencing the optimized performance of VM snapshots.


In this article, we delved into the invoke routing layer within Lambda, resulting in enhanced system availability and scalability. A comprehensive exploration of the compute infrastructure showcased the introduction of Firecracker, a technology that significantly heightened efficiency while upholding robust security measures. The system's performance was markedly improved, and the challenge of cold starts was successfully solved. These efforts culminated in a fundamental concept: executing code in the cloud without servers, which captures Lambda's essence as a compression algorithm for user experience.

Lastly, more details are available here.

About the Author

Rate this Article