InfoQ Homepage Presentations Stateful Programming Models in Serverless Functions

Stateful Programming Models in Serverless Functions

View Presentation

Speed:

Download

46:56

Summary

Chris Gillum explores two stateful programming models: workflows and actors. He discusses how they can simplify development and how they enable stateful and long-running application patterns within ephemeral, Serverless compute environments. He explains why he is making a big bet on these programming models in the Azure Functions service.

Bio

Chris Gillum is Principal Software Engineer at Microsoft working on Azure App Service and Azure Functions (especially #DurableFunctions).

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Gillum: My name is Chris [Gillum]. I'm a software engineering manager for the Azure Functions team at Microsoft. I want to talk to you about stateful programming models in serverless. Here's just a quick overview of what I want to talk about today. We're going to talk just a little bit about serverless - I assume that folks in the audience know what serverless is. I want to talk about two programming models that are inherently stateful and how they can be composed with serverless. Those two models are workflows and actors.

What motivates me? I've been at Microsoft for about 13 years now, full-time and working in a lot of enterprisey and cloud-related areas. I myself am an engineer. I've had to work on a lot of hard, complicated problems, and one thing that I've learned is that I don't like complexity. I assume that most of you folks don't either. Developers tend to be my customers, I want to make hard problems easy. I want to make complex things simple so that myself and other developers can be more productive. You'll see that as a theme throughout this talk.

Serverless

When I talk about serverless I'm mostly talking about Functions-as-a-Service or FaaS, talking about elastically scalable event-driven programming models and generally low-cost, pay for only what you use.

If you've done some research into serverless or looked into it, you may have heard some of these best practices which some of the thought leaders in the space will often promote. Functions must be stateless, functions must not call other functions, functions should do only one thing. I tend to agree with these in principle, but I do push back on them a little bit. I think that there are actually ways that you can do stateful serverless in a way that's responsible and still adheres to the spirit behind these principles. We're going to talk a little bit about that throughout my talk.

Workflows

The first stateful programming model that I want to talk about is workflows. Here's an example of a very common workflow pattern. In this diagram, F1, F2, and F3 are functions. You could think of those even as micro-services. I'm using the Azure Functions logo because it's convenient. In this particular flow, you have some function F1, it produces some outputs, that output is stored into a queue represented by the cylinder shape here. That becomes the input of some other function F2, which then does some computation, creates an output that gets fed to F3 - just a chain of functions.

This is a very common pattern that you see maybe in order processing. You receive an order, you need to go and check your inventory system, you need to go and remove from your inventory, get the things shipped, there's a lot of sequential steps that need to be taken. You can implement these with classical serverless today, but there are a bunch of problems that you'll probably run into as you start to do this, one of which is the relationship between your functions or your micro-services and these queues it's unclear.

If you're actually using a cloud vendor, and you're looking at your list of all the functions that you have, and the list of all the queues that you have provisioned you're not going to get a nice diagram like this typically that shows you what that relationship looks like. Typically, you just have a flat list of all your different resources that you're managing. You don't have as much clarity on, what is it that my application is doing? What are those dependencies looking like?

Operation context - you're calling these functions in the context of some particular operation. Typically, you have to flow that context through the whole thing. Oftentimes, you may need to rely and get an external database which maybe has a record of what is the order that I'm actually processing here? Then flowing maybe an identifier to that through these different functions.

Middle queues are actually an implementation detail. You're just trying to pass information from one function or one service to another. The fact that there is a queue that you have to manage, that's a bit of conceptual overhead. Then, if you think about error handling, obviously, queues will give you some degree of error handling. If there's a machine failure or something, that message will go back, stay on the queue, or when your compute comes back up it can pull it off of the queue. You get some sort of resiliency there.

What if you need to do things like compensation? Let's suppose that there was some application-level failure within F3, and you need to do some compensating action to undo what you did in F2 and F1. Now suddenly, you have to add a bunch more queues, and then the lines and boxes get a lot more complicated.

There are some solutions that make some of this easier such as declarative workflow, like designers where you drag and drop "Here's my business flow," or perhaps even using some sort of XML or JSON markup to describe "This is what my user flow is." It solves some of the problems that I mentioned before. These are commonly used today, but from my experience and what I've heard from talking with many others, you run into problems of scalability with a lot of these things. Typically, business workflows that you're designing, they're maybe not simple four-step flows. They tend to have a lot of conditions and maybe even some loops and different things. Especially if you're using a visual designer, that tends to not scale very well. You have to zoom out and there's just a lot of complexity. Even in the case of markup that JSON file could get really nested, really deep, really quickly.

There tends to be a bit of an impedance mismatch. Typically, the functions or the services themselves that you're orchestrating, those are in code and they're dealing with data, they're generating data. Then you have to somehow marshal that data through this workflow system and declare how you're going to pass it from one action to the next. Because you're using some declarative language, there can be a bit of an impedance mismatch. I would say they're not exactly developer-friendly in a lot of cases. You have to learn this new language, so to speak. Developers want to do things like unit test their workflows, not just the individual functions, but even the thing that ties them all together. You can't do that easily with some of these declarative solutions.

One of the projects that I work on is called Durable Functions, which Colin [Breck] mentioned. I want to show you how with what we've come up with, you can actually do this function chain using code. This function that we have here on the screen is what I call an orchestrator function. It basically represents that diagram that you see to the right. We know that this is an orchestrator function, it has this context object which is passed in as a parameter which will give you some information about a what is the ID of this orchestration and what was the instance that kicked it off?

Then all the different actions that you run in the middle are what we call activity functions, F1, F2, and F3 in this case. Here, this is in C-Sharp, obviously, and we're just using some simple APIs to actually call F1, F2, and F3, get some return values back and then pass them on to the next step which feels very natural to a developer. If you need to do compensating error handling, you can use try catch and those sorts of things like we have here. It turns out we're able to achieve the exact same behavior in terms of reliability between F1, F2, and F3 that you get if you manually implemented this using stateless functions and queues.

One of the things that I should mention is, our use of await here, if you're familiar with async-await, we have a trick where whenever you do an await statement, we're actually able to checkpoint your progress within this function so that if it gets unloaded from memory, if it crashes or something like that, let's say you've finished F2 but haven't gone on to F3, as soon as we bring this code back up onto a healthy VM, we're actually able to start where we left off. We don't need to re-execute F1 and F2, we can start directly from F3 just like if you're using normal queues. This is what I think chaining could look like using code. This is, in fact, something that we do.

How Do We Preserve Local Variable State?

I thought maybe I'd ask you guys if you have any thoughts on how we do this? The answer is C, event sourcing. We're actually not doing any special memory snapshotting or compiler hooks. We're just using event sourcing to create a statefulness in these orchestrator functions that also gives us a degree of reliability. We're going to talk about how that works behind the scenes. I warn you, this slide is a little busy with a bunch of animations. I probably spent way more time on this than I should have, but I wanted to explain to you how we're using event sourcing behind the scenes to actually implement those top three lines of code and make it durable and reliable.

In the beginning assume we have some trigger-function which actually starts up this workflow, and it's putting a message into a queue behind the scenes. This queue is managed by us. This is not something that when you write this type of code that you have to manage. We have a framework that actually puts that into a queue for you that we provision and manage.

Then on the right-hand side here, we have an event history which is basically our event sourcing, the log that we keep of what happened. We trigger this, we write execution started. At that point, this orchestrator function sees that, "I have a message to start executing," it's going to go ahead and read that. Then I'm highlighting the line of code that we're currently at, like if you're doing interactive debugging. The first line says, "Ok, I need to call this activity function called F1." Behind the scenes, what's actually going to happen is, this function is going to drop a message into another queue, and we're going to write another event history saying, "Ok, we scheduled F1, great."

At this point, we can actually unload that function from memory either because we want to preserve memory, because this is going to be a long-running operation, this F1, or it could even be like a machine crash, for whatever reason we're able to unload this. The activity function then can pick up that message and it can say, "I need to execute that." On the bottom here, I'm saying F1 just returns a number, say, 42 for the sake of simple illustration.

It executes that, it gets a return value, it puts that return value back into another control queue which is going to trigger our orchestrator function on the top again. You'll notice that we wrote in the event history that the task completed F1 returned a value of 42. Now the orchestrator picks that up, and because we unloaded it from memory, it has to start its execution from the very beginning. Once again, we're at that first line and saying, "I want to call activity async F1." This time, we have that context and we're able to look at our event history through that context and see that "Look, we already ran F1, it returned a value of 42." Instead of running that again, I'm just going to take that value that I got back, and then just return it immediately so that the variable x now contains that value, and we're able to move on to the next step.

Now, we're calling F2; we have not called F2 yet, so we go through the same process. We write a message into a queue, we write down that we schedule this, and then we actually execute it. It's going to take n plus 1 in this case, so now we're up to 43. It sends that response back, again updating our history, and then the orchestrator, once again, starting from the very beginning, is able to walk through the history and say, "Ok, I did F1, I got 42, move on to the next line. Ok, I called F2 and now my response is 43. I can save that into y."

Then once again, next step, haven't called F3 yet, let's go ahead and go through that process again of scheduling that final F3 message. It's going to execute, we're adding two more, we're sending the response back, and then the orchestra can pick that up. Again, walk through the history one final time we can see that we've already called all of these functions before and we have values for them. At this point we're completed, and we have some final return value of 45.

As you can see, now we've created a statefulness here which is implicit, we're able to rebuild local variable states and walk all the way through this orchestration and have a lot of the reliability guarantees that you would expect. If you're running this in a serverless compute environment, like Azure Functions or in a service like Lambda where you're charged based on how long your functions are executing, this is actually really nice because of how aggressively we can actually unload these orchestrator functions so that you're not being double-billed. Or, the orchestrator function is not sitting in memory waiting for the activity function to complete itself. It's one of the reasons why there is this principle of functions shouldn't call other functions, double-billing is one of them.

Because we're using event sourcing and we're not doing anything with snapshotting memory or doing compiler tricks, we can actually do this in multiple languages pretty easily. Here's the exact same orchestrator function which does that function chain written in JavaScript. In the case of JavaScript, we're actually using generators instead of async-await. The reason we use generators is because we have a little bit more control compared to the way that promises work in JavaScripts. It's the same basic idea - just replace awaits with yield and we could do the same sort of things. Again, just using event sourcing to power all of this.

In order for this to work, we're replaying your function code multiple times to rebuild that states and continue to make progress. Your orchestrated code must be deterministic. In order to make it deterministic, we have a few simple rules that need to be followed when you're authoring this code, one of which is you can't have any random numbers or any random date where if you call some API multiple times it's going to return different values. That's things like if you're creating new GUIDs or UUIDs, or getting the current date/time, we're just generating random numbers. You can't do that because that's going to mess up that replay logic.

You also can't do I/O or custom thread scheduling directly in these orchestrator functions because those, similarly, are non-deterministic. That file that you're reading the first time might not exist the next time that you try to read it and that would mess up the replay. Don't write infinite loops. You saw that we were creating this history of all the steps that we took. if you write an infinite loop, then that history is going to grow unbounded and something is going to fall over.

Luckily, we have simple workarounds for all three of the above rules. Rule number four is please use those workarounds. For example, if you need the current date/time, we have a deterministic API that can be used for that. Similarly, if you need a random global unique identifier, there's an API for that. If you need to do I/O, you can do it inside those activity functions which I described previously, those can do whatever they want. Because we will cache the results of those inside that execution history, we don't actually have to call those again as part of the replay. We just need to remember what the result of it was the first time that we called it.

It's relatively straightforward to actually do static analysis of code to make sure that people are following these rules, and there's a few runtime checks as well. From my experience, this hasn't been a major issue for people who have tried to adopt this mechanism of writing deterministic code. The rules are fairly simple and easy to follow. You'll find that you can be quite successful using techniques like this.

The previous example that I showed was pretty basic - just a simple function chain. You could do that pretty easily in a lot of different ways. Where things get really interesting is when you want to do things like what I call fan-out and fan-in. Let's say you have some function F1, which goes and fetches a set of work items that need to be executed in parallel and that's represented by multiple instances of F2. Here we have three, you could potentially have 1,000 executions of F2 that have to run concurrently. Then maybe once they're all done, you have to do some aggregation and call F3 to do some final processing of all of the work that you just did.

There are problems when you try to implement this using traditional means as well. Fanning out is easy; any serverless compute can do that where you just drop a bunch of messages into a queue, you have some function that triggers off of those messages, and you get parallelization, and that just works great. The problem that you run into is how do you do the fan-in. All of those are running in parallel, you don't know exactly when they're going to complete, and there needs to be some sort of coordination that happens so that you know as soon as the last one is finished to immediately move on to F3.

Typically, that requires you to have some sort of stateful agent that's running in the background and monitoring all the work that's done here. You can do that, but it's a lot of work. Obviously, you have the same problems that I described in function chaining, now we've just made it even more complicated because we've introduced a lot of parallelization into the flow as well.

It turns out that if you're using event sourcing and creating a programming model on top of that, it's actually relatively trivial to solve. In this case I have the C-Sharp function, it calls F1 like I discussed and returns an array of items which can be any arbitrary size, then we're going to do the fan-out part, which is, I'm looping through all of those, I'm calling some F2 function, I'm not awaiting them because I want to run these in parallel, so I'm just scheduling them. If you've ever used promises or the task parallel library in.NET, then this will look very natural to you. We're not doing anything special here in terms of the programming model.

We schedule all of these and we keep track of all these background tasks that we've created, and put them into a list. Then the fan-in is simply we're just doing Task.WhenAll - please await at this point until all of the parallel tasks have completed. Once that is completed, we can immediately go to the next step where we do some sort of aggregation on the results. In that diagram that I showed previously the orchestrator function is getting signaled whenever something is finished. We know immediately when all of those parallel tasks are finished and we can move on to the next step without needing to implement any other monitoring system to keep track of that. The problem becomes trivially simple even at distributed environments where you just need to write a function.

I’ll provide a little bit of context on how some people are using this. I had a chance to go to Japan and work with Fujifilm a little bit on this new system that they were creating for Japan's version of the their professional baseball league, NPB. They have a system where there are a bunch of photographers who go to these games and take pictures of the different players and they submit them to this image work system. There's a workflow that needs to run for every one of those for the different batches of those pictures where they need to do processing, they need to do classification analysis, they need to look at the image to see, "Ok, who is being depicted here?" and so on. There's a bunch of work that they need to do to implement this flow of processing pictures.

One of the interesting outcomes of this was, they had an older system where they could do maybe 3,000 photos, I think, in about 4 hours. They actually reimplemented it on top of this on Durable Functions using the stateful event source-based programming model, and they were actually able to get this entire workflow down to about 30 minutes, which is a pretty big boost. I don't credit Azure necessarily for enabling that, but really, I think it had a lot to do with the productivity gains that you get when you're able to introduce someone to simple coding constructs that people can understand that take care of a lot of the complexities that your engineering team otherwise has to deal with. I just want to call that out as a success story that we're seeing using models like this.

Human Interaction

One other workflow pattern I wanted to highlight is human interaction. The basic idea is, let's say you have some sort of an order processing system internally with your company. I need to make a purchase order, let's say if it's over $1,000, it needs a manager approval. That manager has to go and click a button and say, "Yes, I approve this," or "No, I do not approve this." It turns out that humans are not as reliable as cloud systems. Sometimes they go on vacation, sometimes they get distracted, and so you need to implement controls for that within your workflow. For example, a timeout - let's say if a manager does not approve this purchase order within three days we have to go down some escalation path. Maybe we need to go and send an email to the manager's manager or something like that. This is a common workflow pattern as well.

Multi-factor authentication is another example of this type of workflow where somebody wants to log into a system and you need to send them a code, and then they need to enter in that code to prove that they are who they say they are and there's a timeout associated with that.

The problems with implementing this are, there's race conditions between timeouts and the actual approval that you need to be able to handle. There's how do you do the cancellation of the timeout if you did get the approval but then the background timer expired in the meantime? Same problems that you have as before. We have a code version of solving that as well.

This one's a little bit longer but it's still conceptually quite simple. There's an API where you can basically create a timer, where this orchestration can send a message to itself at some specified time. In this example we're saying in 72 hours, representing 3 days. There's also an API where you can wait for some external event. These are very stateful types of things where we're giving the illusion that we're actually suspending the execution of this workflow waiting for somebody to click a button which is going to send a message to this function and cause it to resume where it left off.

In both of these cases, we're not actually awaiting the task yet, we're going to this next line and we're saying Task.WhenAny - or very similar to I believe it's promise.any if you're using Java or similar languages - where we're actually checking to see which one of these came first, the approval event or the timeout event? Then depending on which one you got first, you can either go into the process approval branch, or you can go into the escalate branch. It's a very simple way to implement this and still get all the durability guarantees that you get with queues. If something completely fails while you're waiting for these things to happen, your VM can be restarted and you'll still resume where you left off effectively.

Aggregator

This is what we started with in Durable Functions; I think we GA the original version about a year and a half ago, I would say. It worked great for a lot of the scenarios that we described. There were a few scenarios that we saw customers trying to implement these stateful serverless patterns which were awkward to do with these workflow paradigms.

One of them is what I call the aggregator pattern. The idea is that I have something which needs to do some sort of counting. For example, one customer that we had needed to start a workflow but only after received 10 notifications from some external system. We didn't have a good way of modeling that, so we were thinking about how could we do that? Obviously, if you wanted to do this yourself, you have to think about, "If I need to process 10 documents, wait for those to arrive, and then move on the next queue, where do I store the state for that? How do I correlate these events?" Maybe you're receiving notifications from a variety of different systems and they have some information that correlates some subset of them together, how do you manage all that? How do you synchronize access to state in general?

Actors

What we decided was that we need another primitive. That primitive in our case was actors. What we ended up doing is we ended up partnering with Microsoft Research on this because we didn't have a lot of experience with actors ourselves. If you're familiar with the Orleans framework some of the researchers that we worked with were contributors to Microsoft Orleans which is an actor system. We came together and said "How can we expand what we've been doing with Durable Functions to accomplish some of these other stateful patterns and still make them work in a serverless environment?"

Actors are really interesting. If you ask people about actors, some people love them, some people hate them, most people don't know anything about them - at least this is my experience. Of the people that do know them, they have a wide variety of opinions about them. My colleague, David Fowler, is an architect for .NET within Microsoft, and he has a lot of followers on Twitter. He was just curious on his own, he just asked a simple question, "Why aren't actor frameworks more popular?" We're learning that using David [Fowler] is a great way to do market research because he has a lot of followers and we got a lot of responses to this question.

We got information back like, "These are so hard to debug: Why did my state change? Why did something happen or not happen?" A lot of different opinions like that. This reply from Roger [Johansson], "Cloud native tools won. There are other ways to accomplish all of these stateful problems." The second thing that he mentioned here is, there's zero exit strategy out of actor framework, which I thought was really interesting, especially coming from him because he's actually the author of AkkaDotNet, as I understand, as well as another actor framework. I guess he would know.

Another common response is that people just want to use CRUD. They don't want to use these actor programming models, they're just taught "I create a web app, it talks to a database; we'll deal with concurrency and all those things at the database level." Then one of my favorite responses was by Jason [Barnes]. I actually found out that Jason [Barnes] is here at the conference somewhere - there you are, I hope you don't mind me posting this. He said, "We don't have a way to right-click and publish this. We need a managed version of this," which to me is, "Yes, that's what serverless is all about." I love that. That's exactly what we would want to do because with a lot of actor frameworks today, you have to deploy cluster and manage that cluster and manage the health of that thing. It's a lot of upfront investment that you would have to do, it'd be really nice if you could have a serverless version of that. Then some nice person happens to know about the work that we were doing on Durable Functions and we actually created something called durable entities, which is effectively an implementation of the actor model in the context of Durable Functions.

Just revisiting where we came from with Durable Functions - we started out with a few different function types. There are regular stateless functions, you have your orchestrator functions which are stateful. Those compose a bunch of activity functions, which are the different steps within your workflows. Then what we decided we want to do is, we don't want to invent yet another actor framework, we want to actually take the capabilities of actors and these stateful patterns and just fold them into the family here, not make them into their own separate silo.

We actually have a couple of different syntaxes that we came up for doing this. Again, this is all on top of the Azure Functions programming model. If you've ever used Azure Functions or even Lambda, my understanding is it's pretty similar the way that you just can declare a function. You just write a function, this function gets invoked whenever a message gets sent to your actor or your entity, as we call it, and it has an operation name, you can write a switch statement and say, "If this is add, I want to get the inputs and then update my state with the current value plus some amount" or again, reset, set the current state to zero, or maybe I want to return a value back to somebody else if they do a GET.

This is the function syntax that we've done. We tried this in our initial alpha, and we got a lot of feedback like "This is really cool. This is great because I can do this in a serverless environment, that's wonderful." It becomes a little bit weird if I have a lot of different operations that I want to be able to implement it'd be great if we had a simpler way of not having these giant switch statements.

Another thing that we did for C-Sharp and a subsequent beta release was we made it so that you could have this class-based syntax. This is still running on top of the Azure Functions programming model, so we have this boilerplate code that runs in the bottom here. The idea is that you just write a method that corresponds to those different operations, so add, reset, and get in this very simple example here, and you just have some field the lives on your class and you can decorate it with some serialization attributes that says "This is my state that I care about." Your code just needs to update that state we take care of the serialization behind the scenes for you. This follows all the rules that you would expect of actors around, like only processing one message at a time so that you don't have to worry about concurrency issues, and so on.

Just to give you a diagram of what this looks like, even these entities, these actors, they're really represented just as a function, and that's what makes it easy for us to put this model on top of a stateless serverless compute platform, because you could still author everything as a function. Behind the scenes we'll take care of figuring out where the actual state lives. The invocation will contain some ID that says "What instance of my entity is this that I'm talking about? What is the operation that we're doing?" Then we can just feed that information to the function. Behind the scenes I have this entity class - in the previous example I showed you a counter - and you can have multiple instances of those just like you would in a normal actor programming model where there's different keys for the different instances, and they can all have their own state.

Demo

Why don't we jump into a quick demo? I thought that might help make this a little bit more real if I could show you exactly what's going on here. I'm going to jump into Visual Studio Code really quick. The first thing I want to show you is a real example of this function chaining going back into workflows a little bit. I am basically doing the same thing that I showed before where I just create this list. I'm calling a bunch of activity functions - in this case, SayHello to different cities, doing it in a sequence, and then adding the results and then returning them at the very end.

Because this is durable, every time you do an await, we're actually going to checkpoint or progress behind the scenes, in our case into Azure table storage. Then the SayHello function is a function that could theoretically do anything, it can make HTTP calls, it could do whatever it wants. It's going to process whatever input that it got, in this case, the name, and it's going to just return a string that says hello to whatever that is. This is just a function, and then in this case we're using Azure Functions, so I can do a func.start to actually start executing this thing and it's going to run the Azure Functions host locally. This is the same host that we use in Azure or if you're running in Kubernetes, or wherever, we use the same host.

That thing is started up, so we can immediately start interacting with this workflow. I'm going to switch over to this. One of the things that I have is a trigger, I have another function which is just an HTTP trigger function which would go ahead and start a new instance of that orchestration every time we call it. I have a convention that I'm using here, where I say the name of the workflow that I want to run, I'm using HTTPie to do this. If you've ever used that tool, it's awesome for dealing with HTTP APIs.

I can send a post message; it's going to run a little bit slow because I'm using the local Azure storage emulator right now. You can see it did a whole bunch of work and I got some response back, some 202 accepted response. One of the cool things is, because we know that this is a workflow, we can actually give you some sort of management experience on top of that. One of which is we have this location header, which is something that you can visit to go get more information about, what is the current status of that workflow instance that you just started?

If I do an HTTP GET on that, I'm able to see what was the created time of this particular instance, what is the actual function name in this successful chain. Because it's completed so quickly, I can see that it's in a completed state and has an output of hello, Tokyo, Seattle, and London. Because we're using event sourcing it's even possible to do things like - I want to show history equals true - behind the scenes, we actually have all of the history of all the different functions that you ran, and so we can actually list all of those things too, which gives you a nice management experience if you want to go back and see, " what step actually failed? How far did I actually get?"

As I mentioned, behind the scenes, what we're doing here is we're actually storing everything in Azure storage in this particular case, so I'll open up the tables here. We have two tables that are interesting. One of them is what we call an instances table. If you think in terms of CQRS patterns, this is the read a projection of all your stateful functions. I have my hello chain that I just ran here, I have the output, I see it's in a completed status, that's the summary view.

Then if I go down, we have a history table here as well where you can actually go and you can see all the difference rows, like in that previous animated diagrams that I was showing everyone. You can see what are all the steps that we took, and what all the different function names were, what were the outputs, so on, Hello London, Hello Seattle. That's all stored behind the scenes within this table.

The nice thing is the programming model that's being exposed to you doesn't know anything about this. That's just an implementation detail behind the scenes, you're just writing code that uses async-await or yield in the case of JavaScript. We take care of all the state behind the scenes using event sourcing to make it durable and potentially even long-running.

A similar demo that I'd like to show you is the counter. Here's the same counter example which I showed in my slides. I'm going to go back to my HTTPie tool and I'm going to do a post. One thing I should do before that is do a GET. Then I have to come up with some name of a counter of maybe we could do QConSF 2019, I do a GET on that and that doesn't exist. We've never created anything with that name before. What I can do instead is I can send a post to that and I can say, the operation I want to call is that add operation, and I want to send some data to it. With this tool, I can just use the echo command to do that and then pipe that to HTTPie. I'm saying I want to add 10 to this durable entity called QConSF2019. We get back a 202 accepted, and then if I go back and then try to query that again, what I should see now is that instead of getting a 404 back, I get a 200 back and it actually shows me, "I have a value of 10." Similarly, I can do the same thing a second time. We'll add 10 more to it. Then if I do a GET on that again, I should be able to see that, in fact, now the value was 20.

You get this actor-like programming model which can run in a serverless environment and behind the scenes are storing everything in the table storage. If I go back to this instances table, I can see now I also have a new row for that entity that I created, which has the name counter and QConSF2019. As I create more and more of these entity instances, we'll just see more and more rows behind the scenes.

If you're familiar with actors, there are certain similarities with actors that we have with durable entities. One is that these are addressable by some entity ID. The operations execute serially, one at a time, so the same benefits that most people use actor systems for, we continue to honor those. They are created implicitly as I showed in my demo. When they're not executing operations, they can be silently unloaded from memory, which gives you a nice high-density.

There are several differences from other virtual actor frameworks too which are important. One of the reasons why we didn't call these actors - part of which is just the political; people get really fired up when you call something actors and it doesn't quite fit their model of it. For one, it's totally serverless, which most active frameworks are not. The other is that we prioritize durability over latency. A lot of people use actors for super-low latency high-performance types of things. We think actually that durability is a slightly more important property, especially when you want to introduce something like this to a broader audience to remove a set of problems that people normally run into with actors.

It's reliable - we do in-order messaging, messaging timeouts deadlocks. This is implemented in such a way that deadlock's not possible, which if you're using something like Orleans or Service Fabric actors, those are some problems that people run into and one of the pain points which we're trying to very much avoid those make people as productive as possible without all the frustrations.

We also support integration with orchestrations, which is really interesting. This is not just actors in a silo, it's integrated with some of the other parts of Azure Functions and even Durable Functions where you can do some really interesting things. For example, let's say I have a system that matches players for an online game. We have the ability within a workflow to say, "I want to lock these entities, this player one and this player two," and while it's locke, those won't process any other operations. What I can do then in my workflows, I can check the status of those entities are they available? If they are, I can assign them to a new game instance. You can do some really interesting distributed locking types of techniques when you combine these actors with first-class workflow. These are some really interesting areas where we think you can get some very interesting innovation working completely on serverless platforms.

Other Applications

There's a bunch of other applications which might be interesting. We're going to share out the slides, so you can check out some of these. Some patterns like distributed circuit breakers are some things; Polly is a company that builds a lot of distributed system software. They announced that they want to use this for a distributed circuit breaker, IoT API cash, some sort of ride-sharing, a lot of interesting examples which might be worth checking out.

Pay attention to this space. Serverless doesn't have to be stateless. I think a lot of companies are starting to realize this, that there are ways that you can actually incorporate state into your functions or into your workflows. There are a lot of companies that are currently doing some work related to this. I encourage you all to check your assumptions about serverless and take a look into what some of these companies are doing.

If you want more information, we have some documentation on double functions specifically where you can learn more of what we're doing behind the scenes. Everything is open-source, so the full source code is available if you're curious how we implemented the event sourcing patterns and things like that. You can find me on Twitter, so feel free to reach out as well.

See more presentations with transcripts

Recorded at:

Dec 03, 2019

Chris Gillum

InfoQ Software Architects' Newsletter