Today on the InfoQ Podcast, Yan Cui (a long time AWS Lambda user and consultant) and Wes Reisz discuss serverless architectures. The conversation starts by focusing on architectural patterns around choreography and orchestration. From there, the two move into updates on the current state of serverless cold start times, distributed tracing, and state. Today’s podcast, while not specific to AWS, does lean heavily on Yan’s expertise with AWS and AWS Lambda.
Key Takeaways
- When we talk about choreography in the context of serverless, we’re talking about what we traditionally think about with an event-driven system. For orchestration, we’re more reliant on state. Both have use cases in developing serverless architectures.
- When designing business workflows, a good rule of thumb is to think about the service’s bounded context. Within the bounded context of a microservice, prefer orchestration. Between the bounded contexts, the event-driven nature of choreography is often a good choice.
- Distributed tracing remains one of the difficulties when developing event-driven systems with AWS lambda. Tools that can help in the space include Xray, Lumigo, Epsagon, and Thundra.
- Cold start time (or the time it takes to spin up the function to handle requests) in languages like Java and .NET Core are one of the areas to consider when developing Lambda applications. For context, cold start times with Node or Python are typically 300-800 ms, where a Java one today is typically 500-1000ms.
- Files of over 500mb have traditionally been a challenge with Lambda functions. Larger file sizes are now supported by mounting an EFS volume. However, there are still some challenges around latency when loading these larger files.
Subscribe on:
Transcript
00:00 Introductions
00:00 Wesley Reisz: Before jumping into the podcast, I wanted to point out a place near the end of the show where our guest, Yan, talks a bit about a panel that we'll be doing together in the following week. This podcast actually was recorded before we actually edited and published it after the event. However, you can still hear that panel by heading over to live.infoq.com to find out more information on it, and also on upcoming live.infoq.com events. These events are full-day events around a topic in software, and they feature short talks from people like Adrian Cockcroft, but they also feature a peer-sharing component that connects attendees in a more personal way. So head over to live.infoq.com to find out more about the panel that Yan talked about and about future events that are coming up. Thanks for listening today.
00:50 Wesley Reisz: Hello and welcome to the InfoQ podcast. My name is Wes Reisz, I'm the co-host of the podcast and chair of QCon San Francisco. Today on the podcast. I'm speaking with Yan Cui. Yan will be talking about Serverless. Specifically, we'll be talking about things like the difference between orchestration choreography when it comes to Serverless. We'll talk a bit about distributed tracing in the context of Serverless applications. And then we'll talk about this new approach that Amazon offers with mounting larger files from EFS to be able to deal with kind of large file sets.
01:22 Wesley Reisz: Yan Cui is an AWS Serverless hero. He has worked with AWS technology since 2009. Today, he's an independent consultant. He has been in roles that are around architecture, lead developer, and he's an advocate in the industries ranging from banking to mobile games. He blogs at theburningmonk.com and you can find his talks on Serverless all over the web.
01:43 Wesley Reisz: Yan, welcome to the show.
01:44 Yan Cui: Hey, thanks for having me.
01:45 Can you tell me the difference between Choreography and Orchestration?
01:45 Wesley Reisz: So this podcast goes back from a blog post that I read on you that's up right now, at least, that's around choreography versus orchestration. So I wanted to just dive in and talk about this when it relates to Serverless. So I guess let's just start off the very beginning. When we talk about Serverless and we talk about choreography, what does that mean?
02:08 Yan Cui: Yeah. So I guess when I'm building applications using Serverless technologies, personally, I'm thinking in terms of building microservices, in terms of thinking about boundary context, in terms of thinking about how different microservices communicate to each other, and choreography and orchestration are the two ways that microservices typically communicate with each other. So when you're talking about choreography, this is whereby you are probably talking about, you've been doing architectures whereby services don't talk to each other directly.
02:42 Yan Cui: What is the traditional request response model whereby receiving an event. I do my thing, and then I publish an event when I make some state change. And then that's it. I'm just doing my part. I know what my events are. I know what I should do when those events happen. And I know what events I should admit when I do something interesting and everybody just do that. And then you get the system behaving a certain way based on how all the different components react to events and what they do and what events they emit. So this is a really nice way for you to build really decoupled, loosely coupled systems, using those different components that just look after its own bit. And then you have orchestration whereby you have essentially some kind of a controller component that says, okay, when something happens, when there's a request comes in, you do something first and then you do something else.
03:33 Yan Cui: And then you won't get to the event you have, the state of the system is in some states you should do something. Otherwise, maybe we skip you and we get someone else to do something else. So you have that centralized the controller component. And that is what we call an orchestration model. So in the self service world, given the fact that when you're talking about Lambda and lots of other services that are inherently event-driven. So I think this choreography model has been very, very popular when it comes to building certifications. You see a lot of people building very complicated systems using Lambda and other cues that you have in the system in AWS, like SNS, SQS event, bridge, Canisius, and so on and so forth. But I think there's still a lot of places where you should be thinking about using orchestration instead of using some service, like step functions, where you can model the entire workflow, and that can be source controlled and use that to trigger different bits of compute or Lambda functions you have.
04:32 Wesley Reisz: Let's break it down a little bit more. So with choreography, you might have something like a Kinesis stream and you have events flowing through it, and then you would basically register your Serverless function to react to different events that come through. Is that right?
04:46 Yan Cui: That's right. And then you say you've written an order into the database. You will also publish an event to say, the order has been confirmed, the order has been placed. So there may be some other system that can listen to that particular event and maybe you want to implement some kind of a promo code, for example. You don't want that to be tied to the function that's saving the order, but you can have them loosely coupled, and only depend on these shared contracts you have through this event.
05:12 What are some of the smells that lead you to an orchestration approach?
05:12 Wesley Reisz: Some of the (I guess smells with that approach that kind of lead you to the need for more of an orchestration approach?
05:17 Yan Cui: Yeah. So the great thing about choreography is that everything is loosely coupled, where they can be scaled independently. They can be modified, deployed and updated independently. You also have this, a single point of failure from the point of view of, say, an organization. If one person messes up the promo code system, it's not going to affect all the processing and so on and so forth. The downside of this approach is that now you don't really have a way to easily monitor and report on the end to end progress or status of say an order. If, say, for example, you have for order processing, you have multiple steps. Every single step is implemented as a separate Lambda function that just listens to some events, does its thing, and then saves some other event that triggers the next bit. So when you have say something like, oh, the workflow implemented using this choreography approach means things like, even something simple like, "Oh, we should have a timeout on the order so that if it's not completed process within three hours, we should just give up."
06:20 Yan Cui: I mean, the customer plays an order for dinner. If we're not delivering the food within three hours, now what's the point, right? So that kind of thing becomes quite difficult to implement, even though it's quite common and quite simple requirement, really. And also, I guess when it comes to bigger companies and you have more complex workflows, it also becomes quite difficult to pin down on who is the person or team that's got the entire knowledge about how the system works end to end. You've probably been in the institutions whereby, if something's gone wrong, you go to a war room with all these different teams and trying to find out, "Okay guys, what went wrong, what happened?" Nobody knows because everyone just knows their tiny bit, but no one has a complete picture of how everything works. And of course, how the system behaves is not capturing code.
07:06 Yan Cui: Not really. It's the emerging property of how your event driven system is all doing the same pit, how the system behaves end-to-end is just an emergent property of that. How the worker actually works is not source controlled. So I think that's where you have business logic.
07:22 Yan Cui: That's spread across many different places and probably across many different teams and different brains. So you open yourself up to all kinds of problems with that. And though some people maybe leave the team and you lose those knowledge about how the system actually works. I've seen lots of companies rebuild entire systems because people are new. The system left the company and the documentation is out of date. Some reflect how the world actually is right now.
07:46 Yan Cui: Whereas with the orchestration approach where you're using some kind of a workflow engine, like step functions, your actual workflow itself is source controlled. It's not just the documentation that is the actual executable. So I think for something like processing orders or payments or things like that, I think the orchestration approach makes a lot more sense, especially when you're talking about something that's business critical, and you get all of this engine reporting, as well as monitoring out of the box, step functions as well .
08:15 Wesley Reisz: With orchestration, it's kind of like a state engine, right? You can model the behavior of the system and then maintain that state. Was your orchestration engine, is that accurate?
08:25 Yan Cui: Yes. Yes. So we have the step functions. I don't know if you've seen it before you've got this modeling language, the space on JSON or YAML, but then you also have the execution engine. Whereby a run time, you will trigger Lambda functions and you'll keep the state of your application. And there is a visualization tool as well, we can see for a particular execution, where did they end up? Where is it currently? And you can see the state at each step of the workflow as well. And then there's also audit history and all the different things.
08:55 Yan Cui: And you can also implement things like a timeout or for a particular step or for the entire workflow itself. Those things that become really trivial for you to implement. And one of the great things I love about using word for engines like this, this, especially when dealing with something that's more, I guess complicated is I can show that workflow visualization to a business analyst or someone who's nontechnical and they can figure it out, you know, roughly what this thing does and even voice your support team, who otherwise we have to troll through logs and try and ask people, okay, what did this mean when I see this log message for something like step functions, they can just go into the console, look at the visualization for the customer's requests.
09:34 Yan Cui: I think I figured out, okay. Right. So your payment failed because when we try to verify your card number, we have MasterCard, it came back with so and so error.
09:43 Are step functions a silver bullet?
09:43 Wesley Reisz: So as a step function, a silver bullet, should we just use step functions of everything?
09:48 Yan Cui: No. So it's great for moderating and implementing business workflows, especially if this is critical ones, but there's still a lot of places where you should be using choreography instead, especially when you want to, for example, take the order processing example I'm using. If someone wants to build the promo code example that I mentioned earlier, what we don't want to have, everything that has to do with an order have to be implemented inside this same workflow is that we should still be using events so that inside our workflow, if we are say, save the order in a database, or we change the status of the order, or we may be communicated with MasterCard to confirm the payment card detail. When some state changes happen, we can still emit events from inside the workflow.
10:33 Yan Cui: And then we can build on top of those events are the microservices that look after different aspects of the business, maybe different business workflows. So you should still be using choreography between this boundary context of microservices or whatever you'd like to call them. But within the microservice where you have, I guess, more control around one particular aspect of the business, at least when it comes to workflows, you should be using something like step functions, you should have some kind of workflow engine that lets you have that orchestration.
11:04 What about with other cloud providers. What are the comparable technologies?
11:04 Wesley Reisz: You keep the logic within the bounded context as a microservice, kind of dictates makes sense. So a lot of the vocabulary that you're using, step functions, is very AWS-centric. What about other cloud providers like Azure and Google cloud functions are the constructs the same there?
11:21 Yan Cui: I think most of them have something similar. So I know, you have Logic Apps, which is, I guess, the like for like step functions in Azure. I'm not sure what's the equivalent in GCP. I think that's like a workflow thing. I forgot the name of that service. But I do remember seeing you before.
11:37 Wesley Reisz: About third party or non-cloud provider provided tools
11:40 Yan Cui: Have lots of other open source tools as well as other third party services. Camunda is a quite nice one as well. I think that you can use it as a service or you can run yourself or you can, I think they've launched a cloud recently a couple months ago if I remember correctly. So there's quite a few other alternative sites you can use. So if you're running your own infrastructure on top of containers or whatnot, then you maybe want to use some of these open source tools where you can have a word engine running on top of Kubernetes or on your own containerized environments.
12:12 What’s the big takeaway when it comes to orchestration versus choreography?
12:12 Wesley Reisz: What I'm hearing you then saying, kind of the big takeaway, is that you should prefer orchestration when you're within the bounded context of a microservice, but choreography between it to allow the units to be able to interact, I guess loosely.
12:27 Yan Cui: Yes, I guess that's just a general rule. That's kind of what I tend to do. At least when it comes to business workflows, there are lots of other small places where I will still use events, but not use it to implement some kind of a workflow. So for example, within my microservice, I may still use things like S3 and use that to trigger Lambda functions or putting stuff into Kinesis so that i don't process something right away or use things like event bridge to trigger one off things, but not complex or not business critical workflows. So for those, I would definitely try to prefer the orchestration approach instead.
13:01 What is the latest with distributed tracing and Serverless?
13:01 Wesley Reisz: So I want to switch gears just a little bit and talk about another kind of challenge, I guess when it comes to serverless and that's distributed tracing, it's been a while since I've dove into serverless in this space. So what is today state of the art? What's the AWS story? What does it look like for distributed tracing when it comes to serverless architectures?
13:20 Yan Cui: It's gotten a lot better. So interesting that you mentioned this with tracing, cause that is one of the difficulties when it comes to a really event driven architecture on AWS with Lambda that a lot of tools have a support, a lot of things that you just never had to do before, before you saw HTTP requests here and everywhere is kind of easy to trace those. We know how to do that, but now we're talking about you event-driven architecture of Lambda, and you have to trace invocations through lots of different things like SNS, SQS, Kinesis, DynamoDB streams, you call [inaudible 00:13:55] . There's lots of different ways you can trigger one function and then another. So in terms of the tools out of the box, you have X-ray supports those synchronous HTTP and vocations really well. So you've got one Lambda function, cording, another, say API gateway, or direct invocation.
14:11 Yan Cui: You can see those, no problem at all. There still is some work you've got to do yourself in terms of requiring the X-ray SDK. And then using that to wrap your AWS SDK or your HTTP modules, but it's in general, it was quite well for this kind of simpler architectures. But you find that a lot of people are not building really complicated. Interesting architectures with Lambda because they're just so many different things you could do. And when there's so many things you could do, people do lots of, lots of different things. So in this space, you've got a lot of, I guess, newcomers, I guess, more service focused.
14:45 Yan Cui: The vendors, people like Lumigo who are actually (full disclosure, I spent two days a week working with those as the developer advocate). I have a really strong belief in their platform and what they're doing. And I think what they offer is probably the best user experience for third-party tracing system for serverless applications. But then you also have other things like Epsagon and Thundra as well, whereby this third part of solutions, they can do a much better job of tracking evocations through asynchronous sources, such as SNS, such as SQS, Kinesis and so on.
15:17 Wesley Reisz: When we were talking kind of prepping before this, you were talking about one of the most difficult things with serverless and distributed tracing is understanding the state of a system at any given point. Can you elaborate a bit more on that and kind of discuss that?
15:30 Yan Cui: One of the most difficult things I find about dealing with this really expensive and really interesting and sophisticated architectures is that trying to understand the state of a system every moment in time. You're talking about, okay, well, when your function calls another function, publishing events or event bridge triggers, another function somewhere else. Well, what was the state of that Lambda function? And it makes a call to DynamoDB and that returns an error. And then you can trace it back because okay, the second function or third function called DynamoDB with the wrong parameter. And that parameter came from the event you got, which came from the first function that published some wrong data to the event bridge that triggered a second function, pass the bucket onto the next guy, right? So it's trying to chase her, those kinds of problems, quite interesting and quite difficult sometimes without the right tools.
16:17 Yan Cui: And that's where something like Lumigo, I think is really useful because they show you both your architecture in terms of, okay, here's a Lambda functions, published something into an SNS topic that's triggers another Lambda function and so on, but all the logs for those three Lambda functions, that's part of the transaction side by side. And you can, when you click on one of those Lambda functions, so you can also see the input you got as the invocation event and where that function talks to other services like DynamoDB. You can also tap into what requests were sent to DynamoDB. So using that, I'm actually able to walk us through how that piece of data, how that request I'm sending to DynamoDB came to be in the third function in my transaction.
16:57 Wesley Reisz: Is the premise still the same where, you know, you come in, you get a correlation ID and that correlation ID is passed all the way through each of the functions so that you can recreate the trace for that particular request? Is still the same premise, or does it operate kind of a little bit differently?
17:11 Yan Cui: It's still the same premise. Some of the implementation under the hood, they're using a coalition ID still, but in addition to just correlation IDs, they also record things about your location, things like the invocation event, and they also auto-trace every single request you make to other AWS services or the HTTP endpoints so that you can see what are the things your function has spoken to so that you get that auto-instrumentation out of the box without you having to manually write logs, put the correlation IDs in and say, okay, I'm talking to so and so HTTP API, or I'm writing something to DynamoDB, and then after was realize, Oh, I'm missing something. I need to go and write another log file.
17:51 Wesley Reisz: How does it in a longer transaction that has multiple participants of a serverless function? How do you deal with just knowing the time it takes the request to respond (kind of holistic view of the request)?
18:05 Yan Cui: So that's the way a lot of the tracing systems help, but if it just came out that one number, the end-to-end number, then you can look at its duration for that edge, most function. So that will tell you the end-to-end provided that you're talking about synchronous course, one function calling another API, which is another function calling another one and so on. But for most of end-to-end events flows, that's where you do need something like X-ray or Lumigo or something else that can track those asynchronous transactions. With X-ray the async support is not great. It only supports SNS. But then you don't see the next Lambda function invocation that picks up the message from SQS in the same transaction. So that's where some of the other party commercial tools can do a much better job in supporting those different events, sources for Lambda and being able to connect everything together.
18:59 How are cold starts on AWS Lambda now?
18:59 Wesley Reisz: So my background is Java, and it's been a while since I've done any serverless functions, but there used to be some blind spot with cold start times and things like that. What's that state today? What does it look like particularly when it comes to like tracing and cold start, but what does that kind of look like today?
19:14 Yan Cui: Yeah, also it's been a lot better. So certainly for Java, for .NET core Is still a little bit rougher compared to some of the other lang run times. So for Node.js and for Python and Go, I guess you can look at, in the real world, cold start time, somewhere around, I guess, 3 to 800 milliseconds is quite normal without much optimization. We've Node.js functions. We've a little bit of optimization. We can bring that down to probably in at least 300 milliseconds or so, but for Java, you probably still looking at around, you know, 500 to a second, probably be closer to one second. Even with some optimization work you're doing there, but in general, cold starts getting better. AWS has done a lot of work behind the scenes to improve different aspects of cold start. And also for Java functions. Now you're kind of get out of jail card is to use what is called provision concurrency essentially is a way for you to say, well, I know my traffic is going to need 10 containers running most of the time.
20:14 Yan Cui: So rather than having to suffer cold starts because the system always scaled to zero, I'm just going to pay you some extra to have 10 containers running all the time, but when I need more, let the normal on demand, scaling behavior kick in. So provision concurrency in kind of your get out of jail free card when it comes to Lambda and cold starts provided that you can't optimize your code anymore, or you have a lot of synchronous API calls between your Lambda functions. So that even if you do minimize the cold start for one function to say you no 500 milliseconds and it's okay for your 99 percentile, but then when they stack up, when you go one function calling another calling another, they started stack up and that's still not good. So that's where you want to use something like provision currency.
21:01 How does AWS handle larger on disk file sizes?
21:01 Wesley Reisz: So a few years back, I remember I was working on an application that was serverless and I kept running into file size limitations. AWS now has the ability where a Lambda function can mount a drive, mount another volume to it. Can you talk a little bit about what that looks like?
21:15 Yan Cui: Yeah. So now it's done through EFS or the elastic file system and essentially it a, there are files drive that you can mount to your Lambda function, which does require your function to have VPC assets to where you have configured off the file system in the FS. But it does mean that for a lot of people that used to have problems with loading large files. So if you're doing machine learning, you're using TensorFlow, the executable and the model, as you know, are often 2GB, maybe more, really, really big, and you can't load them into Lambda. Even though you can have 3 gig of memory, you can only have 500MB of disk space. So with EFS, that kind of works around that problem, whereby for those large files, you can put them onto EFS and when your function starts up, you can then load them from the EFS.
22:01 Wesley Reisz: What about issues with latency?
22:03 Yan Cui: The latency on EFS from the testing I've done is a bit better than loading from S3. At least there'll be more predictable, is more stable, less variance, but it's still quite high. So you've got a really large file. You're loading from like a 2GB TensorFlow model. And you're loading it from EFS into a Lambda function. It will probably work, but it's probably going to take what, tens of seconds for that to complete, maybe 20 seconds or 40 seconds. I don't know which means that you do have to have a fairly long Lambda function duration to be able to work with them. So you can't do anything that's serving real time recommendations from Lambda, but if we're just doing some training or testing your model, that's probably okay.
22:45 Wesley Reisz: Or just influencing with a model, maybe just running the actual decisions on it. Okay.
22:49 Yan Cui: Yes. Yes.
22:51 What’s Next for you?
22:51 Wesley Reisz So what's next for you? What are you working on?
22:54 Yan Cui: So I've gone independent for the last 12 months now, 13 months maybe. Where I still left full-time employment. And now I'm working for myself working as an independent consultant and working with different clients. And I have to say, this has been a really refreshing change and I've worked full time for 15 years. I have to say this newfound freedom. And also just being able to work less has been really great. I'm really loving it. So at the moment, I'm really enjoying working with lots of different companies, helping them build their service applications better, faster, and yeah. Learning quite a loss. Well, it's amazing how many different ways or different constraints people have to deal with and the different systems that people are building with Serverless. So that's been a really fun the last year and a bit.
23:36 Wesley Reisz: Well Yan as always thank you for joining us. And I look forward to catching up with you.
23:41 Yan Cui: Yeah. Hopefully I guess I'll see you at the InfoQ Live?
23:44 Wesley Reisz: That's right, InfoQ live.
23:45 Wesley Reisz: It will be on August 25th. So if you like what you heard and you want to hear more Yan will be part of a panel with me hosting. We'll be talking about microservices and answering the question, "Are they still worth it?" And I want to go on that limb and say yes. So Yan, I look forward to seeing you then, and thanks for joining us on the podcast.
24:05 Yan Cui: See you next week, man. Bye. Bye.