BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Durable Execution for Control Planes: Building Temporal Cloud on Temporal

Durable Execution for Control Planes: Building Temporal Cloud on Temporal

Bookmarks
32:16

Summary

Sergey Bykov discusses the concept of Durable Execution, with a real world example of how they used it to build the Control Plane for Temporal Cloud.

Bio

Sergey Bykov is responsible for the architecture of Temporal Cloud. Prior to joining Temporal Sergey was one of the founders of the Orleans project at Microsoft Research and led its development for over a decade. The mediocre state of developer tools for cloud services and distributed systems at the time inspired him to join the Orleans project.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Bykov: Are you either running, or building, or planning, or thinking of building a hosted, managed cloud service? How many people have done it or are doing? I can almost guarantee that this number will be much higher, because you will be building it. Why? Because managed service is a great way to provide better experience, hosted experience to your customers for a server technology. If you have a database or a queuing system or some other solution, that's a good way to get it and also to monetize, especially if you're an open source project. I'll be talking about our experience of doing that for Temporal, which is a server technology. It's popular. It's all open source MIT, console hosted. How do you monetize it? You need to provide a service, so you need to host it. Why would people pay for it? If you provide a better experience than them self-hosting it.

Where do we start? How do we build a managed cloud service? One of the first questions we need to answer, we need to choose a tenancy model. There's one option to do a single tenant where we provision a cluster, a database, a physical instance of your product for each tenant. It's very easy to do. It has great isolation, only single tenant here. It's great to manage it that way. The other option is multi-tenanted, when you have a single physical instance, but you put workloads of different customers on the same physical instance and cluster on database. You have to give them a virtual instance. Now they don't own all of the hardware, they own a slice of what you provision. How do you think of these models? If you have a single tenant provisioning, the yellow line is the load on this resource or the customer. Regardless of where the line is, they have to pay for the whole provisioned capacity, whether they use it or not, even if they're down, they will still pay for the whole physical resources, you have to charge them, they cost you.

If instead we provide multi-tenanted, we have the same usage, they just pay for the area under the graph. As a provider, we can put multiple customers here, we can share the same physical resource. We can make it bigger. We can optimize, and by large numbers we'll be much more efficient. Not only that, customers care about this headroom above the line, like what if I need to spike. What if it's not only like Black Friday or Super Bowl, but also, we will see cases where there was an outage in some downstream service, so they accumulated work, and then that service is back up, they need to process that backlog, so the usage spikes. They also care about this spike in capability. If we have multi-tenant service, we have bigger hardware instances, we have all these customers, it's unlikely that we'll statistically spike at the same time. Actually, they share this headroom that is much higher, so it benefits every single customer. A multi-tenanted system is harder to build, but in our experience, that's pretty much what all our customers want. They want it to be multi-tenanted, even though it's counterintuitive initially, or I want isolation. When they realize all the benefits, say, "No, I want this. I don't want to pay for infrastructure, I want to pay for my usage. Because if my business grows 3x, I'm totally fine paying 3x. If my business stays flat, and I still pay 3x that I expected, that's bad for business." People also refer to this as serverless. I don't like this term, because there's always servers somewhere.

We need to decide the unit of product we're going to give the customer, like in the database case it's virtual database instance. I'm queuing, whatever, that's the unit of what you get. That's the unit of isolation. For us, it was easy because Temporal product, the server only had a notion of a namespace, it's a unit of isolation within the server, like open source product, like, for example, uniqueness of names, of the workflows either within the namespace, and configuration and throughput and other performance configuration per namespace. For us, it was a no-brainer solution. From the client perspective, when people write code, namespace comes as an endpoint. You configure your client to connect, use your mTLS certificates. You also get the web experience where you can go to a URL for that namespace and see what's running, like the workflows that run in that namespace. That's what the user often sees without running any infrastructure if we host it for them.

Data Plane vs. Control Plane

On the backend side, there are these two related but orthogonal concerns, like data plane and control plane. Our industry is good at borrowing concepts from other industries, in this case from telecoms. Because what is data plane? This is where data or signal, the actual processing happens. When I pick up my phone and I send a text message or make a call, this signal goes through base stations, routers, switches, all this infrastructure which is the data plane of the mobile operator. When they need to provision a new base station or reconfigure, or, for example, when I want to transfer my number from AT&T to T-Mobile, actually two control planes of two providers need to communicate, collaborate, and transfer my number from one to the other. The control plane is the brain that manages data plane resources. Because of that, the data planes and control planes, they can initially have very different concerns and expectations. Data plane, its main job is to stay up, be available, be running. That's where data, transaction, signals go through. If it's degraded then the business of the customer is degraded. Its responsibility is to be up, support high throughput, low latency, be highly available, all of that. One responsibility, but very high expectations. Control plane has a lot of responsibilities. We need to provision resources, both physical, virtual. We need to provide billing, metering, all of that, reconfiguration, resource management when usage here of this namespace is growing and when you're moving to another, like physical cluster and things like that. A lot more responsibilities. If operation and performance of control plane is degraded, customer is impacted, but not nearly as much as data plane because even if the provisioning of a new namespace is delayed, that doesn't bring their current business down. Common sense. There is no secret knowledge here.

Data Plane

Let's look at data plane. For data plane we applied cell architecture. Again, the basic idea is common sense. We want our data plane physical resources to be as isolated from each other as possible, so that anything that happens to one instance, if there is like physical or virtual or programmatic issue with one, there is no ripple effect. Only customers that are unlikely to be on that resource, they backed it for a while until it recovers. Also, if we deploy a change, or a human or a machine operates a change, we want to change to go to one instance. We know empirically from history of outages, most outages today happen because of misconfiguration not because some cosmic ray flipped bit, even though sometimes people use it in RCAs. The reality is you push a change, and then you blast it, and that's the term, blast radius. I think AWS started that. What is the blast radius of any change, of any value? You want it to be contained to a single instance. Coincidentally, we just need to call instances of our data plane itself, that's like, why not? It's a simplified diagram of a single Temporal cell, there are a bunch of things here. This is an AWS example. Currently we're in AWS, but we're adding support for GCP soon and eventually we'll be in all reasonable clouds. For each cell, we create an AWS account for isolation. That account is a good isolation boundary. Within that account, we create the VPC, virtual network for isolation. There, we create EKS cluster for isolation. You see the thing, we want to isolate everything as much as possible. Now we run a bunch of compute pods on this EKS cluster, Temporal server has its own services, but also a bunch of infrastructure things, like for controlling ingress, managing certificates, for observability, ship metrics out both to us and to customers. Also, each cell has two databases, the main one and the elastic for enhanced visibility. This is a diagram of first two generations, v1, v2 of cells. In v3 and v4 we added another layer of the journal, which has its own write ahead log, it runs a bunch of instances of BookKeeper or Zookeeper, it gets bigger. The generation we define is when we need to migrate load. Where there is a significant change to topology, we don't want to try to make the change in place. It's too risky to change something like that. We create a cell of a new generation, then migrate load from the old cell to the new cell. If we look at this diagram, at least to me it becomes pretty obvious that a single tenant system will be too expensive to run. We have a use case that uses 5% of functionality and have to pay for all of this machinery, that will be ridiculous. If you don't have this machinery, you cannot be as reliable, as manageable. That's why multi-tenant model makes a lot of sense. The cells today run in 10 or maybe 11 regions of AWS by now. Similar to that cell architecture diagram, you can connect through a public internet, you connect through a private link for private connectivity, all standard stuff.

Control Plane

Now, control plane. I mentioned that control plane has a bunch of responsibilities. Let's look at just a couple scenarios more closely. First scenario, the user goes to Temporal Cloud web page, picks a region, types name, and clicks the button, create namespace. They want to create namespace in a particular region. What happens on the backend? Actually, a lot of things need to happen. The algorithm needs to decide where to place, we have multiple cells in the region, which one should you pick? That's the relatively easy local decision. Then we need to create a bunch of things, like we need to go to database and create a record, create roles, and propagate this across infrastructure pods. We need to get certificates. We need to provision DNS name. Then once that is done, we need to configure ingress route. At the very end, it takes minutes. The last step is to check connectivity of the two endpoints, one gRPC and one HTTP. We're checking that we can actually connect to those endpoints, to like gRPC and HTTP before we say it's done, meaning that customers can connect. If I were to write this as spaghetti code with all of these, there's calls that may fail, they'll take time, like DNS propagation, or call Let's Encrypt and it may provision the certificate minutes later, calling and checking. Then, what if my process crashes in the middle? It's a lot of difficult problems to solve, which has nothing to do with the business task at hand. If I were to write this code, I will be a really sad Panda, like, "No, please, get me out of this."

What ends up happening? If we look closer at some of the steps, what we want is something that naturally becomes workflow-ish. We want to, in this case, request certificate from Let's Encrypt, then wait for it until it's ready. When it's ready, provision a route, and then check connectivity and be done. What ends up happening oftentimes is that people started writing something that initially looks simple, like I write something workflow-ish. I will do the checkpoint, I'll remember, did I recover? It's pretty much always a mistake. It's a kind of puppy you acquire. Then two years later, somebody looks at this code, like what do we do with this? It's brittle. The person who wrote this code is out of the company by now. It's always difficult to keep that live and maintain it for years. We applied, obviously, durable execution. This is a slide from my talk at QCon New York. Think of it as durable execution is a model that's implemented by a family of technologies. AWS has Simple Workflow Service. Azure has Durable Functions. Uber Cadence is a precursor to Temporal. All these technologies were created by the same two guys that happen to be co-founders of Temporal. Essentially, Temporal is like the third iteration of the same ideas. The goal there is to go from smoothie architecture to a layered cake. A smoothie architecture is when bits and pieces of business logic are mixed with bits and pieces of state management, failure handling logic, all of that is in this big blob, which is difficult to maintain and evolve. What we want to have is a layered cake architecture where you have business logic separated from your integrations with external systems, when you call APIs, and failure handling and state management implicitly provided to you.

For the purpose of this talk, think of it this way, there are two main concepts. There are these actions when you call external systems, APIs are the services, call activities. There is general orchestration code, like the function that orchestrates that. In this example, calls to external services, foo and bar, they're activities, but the call that orchestrates them together, that's the workflow. I put here a screenshot of real production code. The point here is that, on the left-hand side, that screenshot of the whole function that provision namespace. This function is written in the happy pathway, it just goes through the steps that need to happen. It doesn't have any retry logic. It doesn't have any what if, backoff, all of that is not there. It's just happy path, business logic, what needs to happen. On the right-hand side is the built-in UI that's provided for execution of that logic, where you really see the steps number, then you can go and inspect what's happening. What it is right now. What air gap, and things like that. If we zoom in, I mentioned checking of connectivity at the very end. The first line I highlighted, it has its options with infinite retry. Essentially, it's a policy to say keep retrying these operations, these activities until they succeed, or until the timeout I gave for the whole operation is up. For example, for namespace provision, we get max time, like 30 minutes. Because if it didn't provision even with Let's Encrypt and everything, something is wrong. When the timeout page I engineer, but within that window, this is the last step. We can keep checking, can we connect? Is it of DNS propagation, things like that. This is all we need to write, there is no retry, backoff logic. We may try this 100 times, it doesn't matter. That's what we see in this UI. You see the steps, like 118, 124. This is the actual screenshot from the execution of this workflow. It has the steps, you have a timestamp, you can click and see the arguments and everything, the state. Developers actually love it, because they get out of writing this boilerplate code that's very difficult to test, because that is almost never correct. It works when the happy path is invoked, but when something goes wrong, all these edge cases, they're not edge cases, they're just error handling cases that aren't handled, happen. When developers get that they just write optimistic code, they love it.

With that, we converted our steps that we needed. There is a high-level workflow, and the steps within it are either activities that calls to other services, or child workflows, when they have a set of steps inside, to encapsulate a set of steps. Kind of functions and some functions, all written in the happy path way. When I deal with this kind of code, I become a happy red panda. Actually, Redpanda is one of the companies that use the same approach and build their control plane, but also there are other companies. These are just those that I was cleared to show publicly that use Temporal to build similar backends, the control planes for their services. The learning here is obvious, when developers are given these tools, they become happy developers, and happy developers are productive developers. People don't like writing boilerplate code.

Rollout of a New Version

The second scenario is we need to roll out a new version of software. Temporal server, the code base evolves, we release minor releases, patch releases, all of that. How do we roll it out to the fleet of cells? It's a rolling upgrade. That's the only way to do it, really, in a safe way. Cells in our cloud are organized into deployment rings, they're numbered from zero to infinity, like it's just a number, from least important to most important. For example, ring 0, we call it pre-production. This is cells that have constant synthetic traffic go into them, but have no customer traffic, so that if something goes wrong, there is no impact on customers. Then ring 1 is cells with namespaces that either have no traffic or very low traffic, and it goes on and on, until on the highest strings we have customers that have a huge amount of traffic, are most critical, and so we propagate all changes in that order. The propagation algorithm is, again, very simple. We have a loop over rings and we sleep between them. Then we loop over batches of cells, because within the ring, we don't want to blast to all of them, we also want to have blast radius contained. Why do we need to sleep? Because some of the bugs, they don't manifest right away. You have a memory leak or some race condition that's probabilistic, you want to give it time to manifest and catch it early. Ring 0, we wait for generally a week. We give it a week there. If nothing manifests, we start going to ring 1. Then there are wait times, like a day or 24 hours between rings, but when we go to higher rings, we expand it. It's all configurable. The point is the same, you want to gradually progress and make this happen. We also have a notion of business hours, like when these upgrades happen, we prefer them to happen when people are within the reach of Slack, or assume that if there's any alert, better not wake people up but have them get in the business hours. The scheduler takes care of that. When it gets to individual cells and upgrades at Temporal, there are four services that goes like one by one and upgrades, frontend service, matching service, and history service. It sleeps between those. It also checks, we have a special canary namespace in every cell that has traffic and keeps executing functionality. We check for 15 minutes, is it fine? Were there any failures in the canary? Because if there was that may indicate, that's why it's canary in the coal mine. If there's any failure, then we stop, page the operator, and see what's going on.

This is the algorithm, how is it implemented? This is high level of how it's implemented. A rollout is initiated by an operator by calling a control plane API either from the CLI or from web application internal, and that starts a request workflow. This one is just to encapsulate that there was a request. You can query it later, it has history. You can check two days later, was there a request, what was status? A convenient pattern to keep track of what is either running or completed. That request calls the actual rollout workflow that implements all these nested loops. Then as it goes from cell to cell, it makes RPC calls to so-called entity workflows. It calls entity workflow of particular cells, saying, upgrade to this version, and then waits for it to complete and then calls the next one. When a cell entity needs to upgrade a particular cell, it actually calls another API, and call an infra API, which has its own namespace where there's also workloads running that go low level to Kubernetes, and execute Kubernetes commands, kind of like two namespaces, they integrated through this API. They're both implemented under the covers as a set of workflows. It's like turtles all the way down in our control plane.

This is another snippet of code in production. It shows that the actual implementation is more optimized than the conceptual diagram. Instead of having like three loops, the engineer who wrote this code decided to just get batches and put all the loops instead of a list of batches with wait times assigned to them, and have a single loop over batches. It makes sense. Then before applying a change to a batch, a popular pattern is post a Slack message in the Slack channel that people wanted or can see it as literally, like that k, of like sleep time 17 hours because, again, it takes into consideration business hours or things like that. It records what needs to be upgraded. This loop that I showed, generally can run for 10 days, 2 weeks, because we want to wait a week after ring 0 and progress forward. If we were to write a single process application, generally you cannot keep it running for two weeks, you want to upgrade it. It may crash. You want to restart it and redeploy. Here it's just transparent, it keeps running this process, keeps going. It's very easy with durable execution because the illusion you get is that your whole execution is persisted not just your state.

Entity Workflows

I mentioned entity workflows. That's not a concept, it's just a pattern. Because Temporal gives this very strong uniqueness guarantee of which workflow ID, by assigning a workflow ID based on physical or virtual resources. It's very easy to create these singletons. For example, we have entity workflows for cells, we have done for namespace for users. It provides us a funnel for concurrency control. When we need to update the cell, this is literally an RPC call to that cell entity, and say, I want you to upgrade. Then you can wait for a response, which may come an hour later, if it takes an hour to upgrade, with all those wait times between services. Any workloads, they're great for modeling digital twins of physical or virtual resources. At Uber, where Cadence, the predecessor of Temporal is widely used, the recruiter told me that they have more than 1000 use cases now running on Cadence. What surprised me, they modeled, of course, a lot of entities as entity workflows, and one of them is loyalty points. When you take a ride with Uber, they increase your loyalty count. Surprisingly, that data, at least was not at the time I was told, it's not stored in any database, it's just stored as part of this entity workflow state that is there within the system. It's as if it's in memory, but it's persisted. Because of the mechanism for signals, you can make these RPC calls and send the command, and say, I want you to do it. There are two ways to implement it. You can do it as I described, the Uber loyalty where the whole state of this entity is within the workflow itself, or you can externalize the state. That's what we ended up moving from the first pattern to the second, where, for example, we call a cell entity and say, "You need to upgrade." This workflow activates, it knows its identity, it goes to database, loads its state, performs operations and updates databases necessary. Then it completes when there is nothing to do. For me, it was very close to home, because I spent a decade working on actors. In particular, on virtual actors where the idea is exactly the same. You call this thing, like in this case, entity workflow, and it's always there, and it knows how to load state transparently, it executes. Then the fact that it completed, you'll need to know you can always call it, it's always available. Even though it's just a pattern, it's a very powerful pattern to model a bunch of physical like real-life things. It scales horizontally really well, because you can have millions of them running. It's embarrassingly parallel, there is nothing for them to share. They're isolated, just like in the actor model. The old name for entity workflows is actor workflows.

Summary

As I made my claim in the beginning, if you haven't worked on building a managed cloud service, chances are, you will be at some point because that's the monetization mechanism. That's the best experience for end users. Economics will drive the decision, not just the architecture, because people want multi-tenanted services. That's more work for us, but that's better for the customers. If you build the service, you're like building control plane. There is no way around it, you have your data plane resources, and control plane is the brain of it. If you attempted to build your own workflow, I will record my intent, I will checkpoint, I will know how to recover. My claim is, just don't. This is almost always a mistake. Look at solutions that are available. We picked ours, and we picked Temporal not for dogmatic ideological reasons. Our CEO didn't come and say, "Thou shall use Temporal." We just look, what can we do? If we like technology better, we'll do what's better for business to make our control plane better. It just happened to coincide that this technology fits so well. It's not just us, other people are doing the same thing. This durable execution is a very powerful model, like I mentioned. There are other resources to learn about it. My favorite is like digital twins, their entity workflow is good for digital twins. It's very easy, very natural to express the object-oriented relations between physical entities in the real world, and the virtual entities, like call each other and have their state isolated. Then, we're not the smartest people in the room. Other people will realize the value of durable execution and how well it fits this problem space. They've been using that. They use self-hosted, some of them already came to our cloud. It doesn't matter, it's just the pattern that matters. This durable execution is a powerful mechanism for building these kinds of applications. If you're going to be building your control planes, just consider, I'm not aware of any better tool to use for that.

Questions and Answers

Participant: I'm just curious to get a peek into some of the implementation detail about what happens as one of those blocks of code that you show, if a machine that's running that or a process that's running that goes down, and then that workflow starts back up. How do you get back to the spot in the code where it left off? How does it skip over those other parts above it? Can you explain that?

Bykov: There's workflow.executeactivity. Then the activity is actually that function, validate gRPC, or validate web endpoint. It's not a call into a function, it is a wrapper. There is a framework that wraps a call. What happens here behind the scenes when that line hits, the SDK that implements this function calls the server and says, there is intent to execute this function, with these arguments, and it gets recorded. That happens for all the steps. It's recorded with encryption on the client side. The server doesn't see the arguments or anything. It's very secure. If you happen to crash on the second line, the previous one already executed. When you process a workflow service, it starts going from the beginning. It executes every single line. When it gets to a line with the previous already completed activity, the framework just injects a response back. It doesn't make a call to validate, it knows that it already succeeded. You have this illusion that you re-execute it, but actually you didn't execute any of these activities that already succeeded, they just got short-circuited. That's how you get to the line, and you can stack all your memory state in this process. We call it a replay, that's why we call our conference, Replay. This is key.

 

See more presentations with transcripts

 

Recorded at:

Apr 19, 2024

BT