Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Articles Four Techniques Serverless Platforms Use to Balance Performance and Cost

Four Techniques Serverless Platforms Use to Balance Performance and Cost

Key Takeaways

  • The cost and performance models are two of the key drivers of the popularity of serverless and Function-as-a-Service (FaaS).
  • Cold starts have gone down a lot, from multiple seconds to 100s of milliseconds, but there is still much space for improvement.
  • There are various techniques that are being used to improve the performance of serverless functions, most of which focus on reducing or avoiding cold starts.
  • These optimizations are not free; it is a trade-off between performance and cost, which depends on the requirements of your application.
  • Currently, closed-source serverless services offered by public clouds offer few options for users to influence these trade-offs. Open-source FaaS frameworks that can run anywhere (such as Fission) offer full flexibility to tweak these performance/cost tradeoffs.
  • Serverless computing is not just about paying for the resources that you use; it is about only paying for the performance you actually need.

Serverless computing is for many the logical next step in cloud computing, moving applications to a set of higher-level abstractions and offloading more of the low-level operational work to the cloud provider (regardless of whether that is a public one or an internal infrastructure team). It promises reliable performance on-demand while directly linking the pricing to the resources used.

This blogpost is a synthesis of a talk that I gave at a couple of conferences in late 2018 (a recording is available of the version of the talk I gave at Kubecon China 2018.) Being active as both a researcher and software engineer in the serverless computing domain, my aim is to give you an idea of what is going on under the covers of the current state-of-the-art serverless platforms—especially with regards to performance and how you can influence it. 

Performance and Cost Models

There are 2 aspects that have been key to the rapid adoption of serverless computing: the performance and the cost model. 

The Performance Model

Serverless functions are designed to have almost no performance tuning knobs; the performance model is supposed to give the impression of an infinitely scalable, infinitely reliable computer.

However, in reality there are practical limits. For example, all serverless computing systems have the “cold start” problem-the latency of starting a function (more on this later). Even so, a large number of real world applications find these constraints acceptable.

We can think of the performance model of serverless abstracting over three important characteristics:

  1. Throughput:the most prominent feature of serverless performance is its fully managed autoscaling. As a user, you do not have to worry about provisioning resources, nor do you have to scale these resources up or down yourself. These concerns are managed by the cloud provider. It comes with the added benefit that you can rely on the (near-)infinite infrastructure resources of the cloud provider. 
  2. Availability:similar to autoscaling, you have clear expectations for the availability of your serverless applications. Although this is still a relatively unexamined aspect of serverless computing, you can generally expect your serverless application to have an uptime similar to other cloud services offered by the vendor.
  3. Latency:latency overhead is a hot topic in serverless computing, which is certainly not yet low enough for most latency-sensitive use cases. Yet, it is still the one of thefastest ways to serve workloads without having a permanent deployment. Additionally, recent advances in serverless performance (such as pre-warmed containers or infrastructure resources, cached functions, and more -- which we’ll explore below)  are leading to serverless being applied more and more for latency-sensitive use cases.  

The Cost Model

Arguably even more important to the popularity of serverless than performance is its cost model. Any serverless offering ought to have the following three characteristics in its cost model: 

  1. No costs when idle:the agreed characteristic of serverless solutions is the usage-based payment model. You only pay for the resources that you actually use for your applications, instead of paying for all resources - either used or reserved -  in other cloud models.
  2. No upfront or recurring operational costs:not only do you only pay for the resources that you actually use, you should not have to pay any upfront or recurring fees for operational costs. In other words, if you do not use your serverless application in a month, your cloud costs should be zero.
  3. Granular billing:when your serverless application is in use, you pay on an extremely granular level. You pay for the fine-tuned resources that you actually consumed by the millisecond, instead of by hour or even longer in traditional cloud models. 

The Central Trade-off in Serverless Computing

As we can see, these two main aspects of serverless computing are conflicting. As we’ll show later in this post, increasing performance requires increasing costs and reducing costs affects performance. In general, when you think of high-performance systems, you generally don't expect them to be very cost effective, and vice versa. 

And, of course, we do not suddenly get all kinds of supercomputing resources for free if we would just start using this serverless thing. As any critic of serverless computing will start with: there are still servers in serverless computing- and someone still needs to pay for them.

Instead, what serverless computing is all about, is that with its cost and performance model it allows you to directly tie performance to a price. This explicit link between cost and performance has required serverless providers to find techniques to optimize performance within this strict cost model. And you can, too.

A Serverless Platform 

Before we dive into the optimizations, it is useful to have an understanding of what the most basic Function-as-a-Service (FaaS) platform looks like under the covers - as functions are the building-blocks and execution units of serverless computing. Let’s review a reference architecture for a ‘representative’ FaaS platform, which we have been developing in collaboration with a number of companies and universities within the SPEC RG CLOUD group.

Covering the entire reference architecture is worth an article on its own (which we are working on!). But for the scope of  this article, let’s  discuss the FaaS part of serverless, focusing particularly on how FaaS functions are executed (we won’t be covering  the development, build and monitoring workflows of a serverless functions); we are only going to cover how FaaS functions are executed.

Data Model

Starting with the data model, a FaaS platform uses two datastores for the functions:

  1. Function Metadata Store: stores the configuration and other metadata associated with a function. The metadata contains function-specific answers to platform concerns: What version of the function should we use? What kind of resources should be used? How should the function be scaled? What permissions should the function have? Where is the function stored?
  2. Function Store: stores the actual function sources. Although in canonical examples functions generally consist out of a couple of lines of Python code, in practice functions can quickly comprise 10s to 100s of MBs of dependencies or static assets. 

The reason for these two conceptual stores is because of these two different access patterns:  We need to lookup this small-sized metadata frequently and as fast as possible, whereas we only need the actual—potentially large—function sources when we need to deploy a function. However, in practice in some FaaS platforms these two conceptual stores are stored in the same database, for example, using the questionable approach of using Docker images as functions.

Figure 1 - The anatomy of the runtime of a FaaS platform.

Execution Model

To be able to deploy and execute these functions we need a set of components, which together make up the runtime of a FaaS platform (Figure 1). 

Although event triggering and routing are also a fundamental part of serverless, these are not the concerns of the FaaS platform runtime. From the runtime’s point of view, there is no difference between events—whether they come from a message queue, HTTP request, or a modification in a database. All these events arrive at the Router; the component responsible for accepting events and deciding which function should be executed. However, since we only deploy functions when they are needed, it often happens that the router has to request a function instance to be deployed through the deployer.

The Deployer component has a single task: it takes the demand for a function together with the function metadata to decide how the function should be deployed. However, the actual deployment of the resources it typically handed off to a Resource Manager.

The Resource Manager is typically a conceptual layer below the FaaS platform; managing the deployment of generic cloud resources, such as containers, networks, and storage. Today this has become synonymous with Kubernetes. In the open-source FaaS space, nearly all platforms can be deployed on Kubernetes. Within our FaaS platform model, Kubernetes (or another resource manager) is responsible for receiving the decision of the Deployer and deploying the resources accordingly. In the process it fetches the needed function sources from the function store to deploy the function instances.

The final product is a Function Instance(also frequently referred to as a Worker), which is the actual deployed function that is capable of executing the function requests, which it receives from the router.

Fission: Fast Serverless Computing on Kubernetes

Although the reference architecture is pretty simple, it can be—and in practice is—implemented in a number of different ways. For example, a variety of databases is used to store functions, different resource managers are used, and the communication between the components is implemented anywhere from using HTTP requests, to using message queues, to a central database.

To give you an idea of how this reference architecture is implemented in practice let’s unpack Fission: a popular, open-source platform for fast serverless computing on Kubernetes

Figure 2 - The runtime architecture of Fission (without advanced features and optimizations).

Excluding all optimizations and more advanced components, Fission’s architecture roughly implements the FaaS reference architecture. 

Although it has a number of other features, such as canary deployments, Record-Replay, and  more, Fission’s router too is primarily concerned with accepting HTTP requests (in Fission all events are converted to HTTP requests) and routing them to the correct function instances.

A component called the Executor implements the Deployer component in the reference architecture. For its function deployments it accesses the function metadata store, which is implemented using Kubernetes CRDs (which generally get stored in an ETCD cluster.)

Fission is built as a Kubernetes-native FaaS platform. It heavily relies on various features of Kubernetesfor the grunt of the resource management, with a simple function store (storagesvc) component deployed in the cluster. Function instances are deployed as Kubernetes deployments, allowing them to easily integrate with existing non-serverless Kubernetes deployments - such as microservices or other container-based applications. 

Cold Starts

This reference architecture also allows us to address the elephant in the room: cold starts. A cold start is, in its essence, the worst-case time that a function execution will take. Cold starts can’t take advantage of shortcuts or other optimizations.

Figure 3: The typical lifecycle of a cold start and warm execution.

A cold start typically occurs when a request arrives at the Router without a function instance being available to handle the request. The router has to signal the Deployer to start the deployment of a new function instance. The deployer in turn signals the Resource Manager to deploy the desired resources that comprise the function instance. Only after the function instance is fully deployed the request can be forwarded by the router to the new function instance to be executed. Typically this cold start takes around 100s of milliseconds to multiple seconds in less-optimized platforms.

In contrast, (regular) warm executions are the best-case scenario: a function instance is already completely deployed and ready to handle the request. This allows the router to directly forward the request to the function instance without having to wait for any part of the deployment process. Typically, the latency added by the FaaS platform is a couple of milliseconds.

Why should I care?

Cold starts are not just a part of our reference architecture or Kubernetes-based platforms, the cold starts are currently a fundamental characteristic of serverless computing. Reducing cold starts is a hot topic in academic research as well as a prime concern of production-ready FaaS platforms.

Figure 4 - Cold starts of cloud providers over a 7-day period in 2017 (source: Want et al., Peeking Behind the Curtains of Serverless Platforms source)

In the summer of 2018, researchers presented a comprehensive investigation at the USENIX ATC conference into—among others—the cold start behavior of the FaaS platforms of the major cloud providers (Amazon Web Services, Google Cloud, and Microsoft Azure). 

As Figure 3 shows, even on AWS Lambda, the longest-running serverless platform, cold starts are still present - with a minimum of 200ms cold start latency. Although this latency is going down as platforms mature, these magnitude latencies are still significant in any user-facing and latency-sensitive applications.

Reducing Cold Starts

FaaS platforms—open-source and hosted—are trying to mitigate these cold starts using a variety of techniques. The most straightforward approach is to minimize the overhead of all components involved in the function execution. For example, AWS recently open-sourced Firecracker, a highly optimized virtualization runtime specifically built to reduce the cold start latency of AWS Lambda and AWS Fargate. 

However, reducing the overhead of the components only gets you so far. Which is why serverless platforms employ a number of techniques that make a trade-off between performance and (added) costs. 

Let’s review four of the most used techniques:

  1. Function resource reusing
  2. Function runtime pooling
  3. Function prefetching
  4. Function prewarming

1. Function Resource Reusing

The first optimization might seem a bit redundant, but that is due to the fact that we take it for granted in today’s serverless ecosystem.

We take a note from functional programming and general computing theory, one execution of a function should never be able to influence another function execution; a function execution is atomic, self-contained, and isolated from other executions.

However, if we would stick to this notion, this would mean that to ensure independence of executions, each function execution would require its own independent set of resources, their own function instance. This would require each function execution to go through a cold start. 

Figure 5 - FaaS function executions in theory (left) and in practice (right).

Obviously, this is not ideal in practice, nor do we have a need for such strict performance isolation in most cases. So, one of the first trade-offs that is made in serverless computing, is to have function executions share their function instances. A function instance can handle these requests one after the other.

This reusing of function instances leads us to an interesting question: how long should we keep these function instances around before cleaning them up? The answer to this question is the same as with nearly any question in computer science: it depends. Like all of the optimizations in the rest of this article, this optimization involves a trade-off between performance and cost.

To maximize the chances that future function executions can benefit from reusing an existing function instance, the platform could choose to keep around the function instance for a long time. However, the downside of taking this approach to the extreme is that you are not guaranteed that these function instances will be needed. So, you might be keeping these function instances alive—taking up resources and costing you money—unnecessary.

Instead, a cost-focused FaaS platform or user could choose to keep the function instance alive for little or no time at all. This would minimize the operational cost, since the function instances are not kept (idly) around. But, performance will likely be impacted when taking this to the extreme, since few function executions will be able to benefit from existing function instances.

What choices cloud providers make in this trade-off too was investigated in the publication at USENIX ATC. They found that all of the big three cloud providers opted to keep alive function instance longer, from multiple hours to days (Azure has an estimated keep-alive of on average 6 days). Likely these cloud providers keep around function instances for as long as the resources are not needed by other services.

2. Function Runtime Pooling

Next to improving performance by sharing resources during or after function executions, FaaS platforms employ several techniques to optimize performance by sharing resources beforehand. One of these techniques is called function runtime pooling.

Figure 6 - The deployment process of a function instance which consists of a user-defined function and a generic function runtime 

This optimization is based on the insight that a function instance is comprised of two distinct parts:

  1. User-provided function: is the part which the user provides to the FaaS platform. It contains all the business logic of a specific function.
  2. Runtime: contains all the code that takes care of the plumbing of the function. It ensures that your function can handle requests, provides monitoring, and all other operational aspects. The function runtime is typically provided by the by the FaaS platform and can be specific to a programming language.

With this notion of a runtime and the actual function, you can also see the deployment of a function as a two-step process. First, in the function runtime deployment, the platform deploys a runtime without a function, resulting in a generic (or unspecialized) runtime. Then, in the second step of the deployment process we deploy the user-defined function onto the generic runtime—specializing it—which results in a function instance.

Using this multistep deployment process, the FaaS platform can now employ a technique that is common in many fields, namely resource pooling. The idea behind resource pooling is that you create or prepare resources ahead of time to reduce the costly creation at run time. A typical example of this is in multithreaded applications, where you can employ thread pools to reduce the cost of thread management by creating them ahead of time and sharing them among multiple actors.

Figure 7 - Pooling function runtimes to reduce cold starts: keeping a pool with generic runtimes around (left), taking out runtimes to deploy function instances quickly (center), and meanwhile rebalancing the pool to the desired state (right).  

In serverless computing this technique can employed fairly similarly—which was one of key ideas that led to the creation of Fission, which was the first open-source FaaS platform to employ runtime pooling. 

A FaaS platform employs function runtime pooling by maintaining pools of generic runtimes. Whenever a function instance needs to be deployed, we can take a runtime from this pool. We then only need to perform the last step of the deployment process to create a new function instance, dramatically reducing the cold start. Independent of the function deployment, the FaaS platform can then deploy new runtimes to rebalance the runtime pool.

However, this again also leads us to a trade-off. We need to find an answer to the question: how large do we need to keep this pool? 

Performance-focused FaaS platforms could opt for a larger pool of generic runtime. This would ensure that even during spikes in the workload, the pool has enough generic runtimes to provide for the deployment of function instances. Yet, this also means that the large pool will be occupying resources permanently, resulting in larger operational costs.

Therefore, a cost-conscience FaaS platform could choose to opt for smaller or no runtime pooling, ensuring that the operational costs are minimized. However, as you decrease the pool size, the chances will increase that a sudden burst of requests will require more runtimes than present in the pool—depleting the runtime pool. When the runtime pool cannot keep up with demand, subsequent function executions will have to wait for the runtime to be deployed - increasing the cold start.

3. Function Prefetching

Next to preparing the runtime ahead of time, we can also prepare the deployment of the function itself in advanced. With function prefetching we can speed up getting the function to the runtime where it is needed faster by caching the function sources nearby.

This optimization might not make much sense if you have just a single cluster with a couple of nodes and a couple of functions. However, in larger enterprise FaaS environments, functions start to depend on large libraries or contain many static assets. With these they can quickly grow to 100s of MBs or even GBs in size. With these sizes, even transferring functions between co-located servers can end up adding seconds of delay to your cold start process.

Next to large functions, your serverless application might need to be geo-distributed, or needs to be deployed at the edge (for example with Cloudflare workers or AWS Lambda@edge).Transferring even small functions halfway across the world to the desired location impacts your cold start process by hundreds of milliseconds.

With function prefetching we can alleviate this cost of transferring function sources by caching the functions sources near the runtimes that will need them.

Figure 8 - A hierarchy of levels to cache functions sources.

There are many options where to cache the FaaS functions—options which you can view as a hierarchy. At the top of this hierarchy we have a single remote storage, which is the actual data store storing your functions. Below that there are several layers closer and closer to the runtimes in which you can cache the functions. You can cache functions conservatively once at a cluster level or take caching to its extreme caching (some) functions already at the runtimes that might need them in the future.

Ideally, in a world where caching is free, we would cache all functions at all possible locations, ensuring that the function transfer cost is zero. However, in practice, storage is not free, and caching everything everywhere will also grow your costs exponentially. 

You can take the other extreme in this trade-off of performance vs. cost, by caching only minimally, or not at all. This will minimize your cost, but it will also mean that all functions will need to be fetched from the remote storage. Especially in FaaS platforms that rely on external providers to store their functions, such as Dockerhub, this can quickly become a source of performance degradation.

4. Function Prewarming

With most optimizations, the cold start will be there regardless of how much we optimize the process. Can we avoid this cold start in its entirety?

With prewarming we try exactly this: to avoid cold starts entirely by anticipating the demand for a function and deploying functions ahead of time.

This is not a novel idea. Prewarming (or as it is known in academia: predictive scheduling) has been introduced in many domains: in processors we have the branch predictor, in autoscaling research proactive autoscalers are an active field of research, and in cache management there is the notion of predictive caching.

Prewarming (or predictive scheduling) in FaaS platforms is not much different. Instead of waiting for a request to arrive at the platform before deploying the function (the cold start), in the ideal scenario we perfectly predict that there will be a request arriving at a certain time. This allows us to go through the cold start process ahead of time, completing the deployment of the function just before the request arrives. Instead of going through the entire cold start process, the request can immediately be executed; avoiding the cold start problem in its entirety.

Accurate predictions are difficult

Having a good predictor is key to employing effective prewarming. Yet—like predicting anything—predicting function demand is difficult. In the related, more mature domains, such as in CPU branch prediction and autoscaling, predictive scheduling remains an active field of research.

The approaches to this problem can be subdivided into two categories. With runtime analysis the platform monitors the runtime behavior of the function and the demand, trying to answer a number of questions: How long do function executions typically take? What kind of pattern do we see in the demand over time? Based on these observations, the platform tries to make a model of both the function and demand behavior, which the platform then uses to predict the future executions and tries to prewarm accordingly. The techniques used for runtime analysis vary widely: from simple rule-based predictors, to complex time series analysis, to various types of machine learning.

The other category of approaches falls into static analysis. Here the platform exploits the (additional) knowledge it has of a function to decide accurate times to prewarm. For example, you might know ahead of time that function B will be executed right after function A completes. Or, the platform might be aware of a trigger set to execute the function every hour. In general, static analysis provides more reliable predictions, but has a clear limit on how much you can know about a function ahead of time. 

Optimistic vs. conservative prewarming

Not only is predicting function execution difficult, it also involves a trade-off. Since a prediction is always a probability of an event occurring, you have to decide at what threshold the prediction of a function execution needs to result in actual prewarming.

You can be very optimistic about this decision, prewarming functions at the slightest hint that the function will be needed. This is great news for the performance, because the chances that functions will be prewarmed get higher the more you lower this prewarming threshold. However, being optimistic is not great for your costs. A low threshold also means that you will prewarm a lot of functions that turned out mis-predicted; the expected demand for it never arrived.

An example of optimistic prewarming, and one of the earliest optimizations in FaaS, is function pinging. Early users of AWS Lambda figured that you could avoid these pesky cold starts by sending artificial requests to their functions every couple of minutes, preventing AWS from cleaning up the function. Despite the downsides and limitations of this approach, it ensured that there would always be a function instance alive—making this an extremely optimistic form of prewarming. 

You can also take a conservative stance on prewarming. By setting the threshold for prewarming higher, you ensure that less resources are wasted on misguided prewarming. However, this comes at the cost of performance; the higher the threshold, the more functions cannot benefit from prewarming because of a lack certainty in its prediction.

Fission Workflows: Prewarming with Function Compositions

For an example of this conservative prewarming we introduce a project that we have been working for the Fission serverless platform. Fission Workflows is a system for composing your existing FaaS functions into more complex functions, allowing you to reuse functions instead of having to completely rewrite each new function from scratch. It builds on top of the best practices of the well-established workflow field, allowing you to define your workflows without having to worry about discovery, data transfer, and fault tolerance.

Figure 9 - An example of a workflow, showing parallel and sequential executions.

Since these workflows basically form a graph of dependencies between the different functions, we know exactly which functions will be needed when. This allows us to relatively predict if and when to prewarm instances and deploy functions ahead of time, anticipating they would be triggered based off the workflow sequence.

There are a lot of possibilities with predictive prewarming based on functions composition.  We started with a simple prototype of this predictive prewarming, called horizon-based prewarming: we prewarm all the functions on the ‘horizon’. The horizon consists of all tasks that will be executed right after the current functions have completed.   

Figure 10 - Horizon-based prewarming with function executing (yellow), functions prewarmed (blue), functions not started (red). 

Figure 9 shows an example of this prewarming. Functions B and C will be prewarmed, because they both depend on currently executing functions. Function D and E will not be prewarmed because those depend on other functions which have not started executing. 

In the ideal case—even with a simple algorithm as this—you can effectively reduce the number of cold starts in your function compositions to one (the first function). 


By allowing a FaaS platform to handle the full lifecycle of your functions or applications, the serverless platform gains a lot of  insight into your workload and control of the resources employed. This allows FaaS platforms to use various techniques to improve performance, which would be less effective or even impossible in other, more traditional, cloud models. 

The major serverless platforms offered by the public cloud providers continuously work to optimize these trade-offs of cost and performance for the average user. However, these providers do not give you any opportunity to make your own trade-offs. Aside from changing memory and CPU requirements of your functions - which too can have an impact on the performance - none of the major cloud providers offer you options to, for example, alter pool sizes and cooldown durations in exchange for higher or lower costs.

In that respect open-source serverless platforms, such as Fission, are interesting, since they gives you the freedom to tweak all of these trade-offs to fit your specific use case. Even though the cost savings might not be as explicit when you are running your own serverless applications on-premises or on your cloud infrastructure - but without using the serverless services offered by the likes of AWS -  in the end, your trade-offs will result in increased or decreased resource usage - which impacts datacenter/cloud costs and infrastructure utilization.

Being able to make these trade-offs leads us to one of the most promising aspects of serverless computing: serverless computing is not just about paying for the resources that you use; it is about only paying for the performance you actually need.

Further Reading

About the Author

Erwin van Eyk works at the intersection between industry and academia. As a software engineer at Platform9, he contributes to Fission: an open-source, Kubernetes-native, Serverless platform. At the same time, he is a researcher investigating “Function Scheduling and Composition in FaaS Deployments” in the International Research Team @large at the Delft University of Technology. As a part of this, he leads the industry and academia combined serverless research effort at the SPEC Cloud Research Group.

Rate this Article