Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Beyond Microservices: Streams, State and Scalability

Beyond Microservices: Streams, State and Scalability



Gwen Shapira talks about how microservices evolved in the last few years, based on experience gained while working with companies using Apache Kafka to update their application architecture. She discusses the rise of API gateways, service mesh, state management and serverless architectures, and shows examples of how applications become more resilient and scalable when new patterns are introduced.


Gwen Shapira is a principal data architect at Confluent, with 15 years of experience working with code and customers to build scalable data architectures, integrating microservices, relational and big data technologies. Shapira is an author of “Kafka - the Definitive Guide”, "Hadoop Application Architectures".

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Shapira: It's late in the day. After listening to all the other talks, my brain is in overflow mode. Luckily, this is not a very deep talk; it's going to be an overview. Starting five years back, we started doing microservices. We tried some things, we had some issues, we tried other things. We'll go through the problems and different solutions at how microservices architectures evolved and where this is all taking us, but I'm not going to go super deep into any of those topics. Some of those topics went to in depth in this track already. If you didn't listen to the other talks, you can catch up on video. Some of them, it will just be a pointer. If you want to know more about, say, service measures, there will be a list of references at the end of my slide deck. Once you download it from the QCon site, you can go and explore more. These are highlights, so don't write in your speaker notes at the end, "Gwen [Shapira] did not go into depth into anything." I'm telling you upfront, I'm not going into depth into anything.

In the Beginning

Let's go all the way back to the beginning. I think my recollection, the hype cycle for microservices really started maybe five, six years ago. I know some people have been doing it for longer. I've seen some references going back to 2011, but let's go all the way back when we first decided to use microservices. We had those big monolithic applications and every time we had to make a change, it was forces negotiation between five different teams and so reorganization, which back then may not have existed under the name as a reorganization as they think of it, and application was the hostage. You had to have everyone agree in order to make a step forward.

This was slow, we were frustrated and we said, "What if we make these things up into those isolated contexts where a team can own an entire context of execution, start to finish, write it, test it, take it to the production, deploy it in their own timelines, make changes in their own timelines? Surely, we can move a lot faster that way." To a large extent, it worked quite well, and it has definitely become the architecture of choice in most fast-moving companies today, but there was a catch. We took what used to be one application and we broke it down into parts. It shouldn't be that surprising that they still have to work together, they still have to communicate with each other.

What we started out doing is say, "Every language in the world, everything that I do has HTTP. I know how to do REST. I know how to send JSON. Everything has JSON support and if it doesn't, I can write it down in a weekend. I mean, three months. I mean, maybe two years in the library. It will surely be done." It sounds fairly straightforward and easy and that's what pretty much universally we all did. As we discovered, this type of pattern, in a lot of contexts, basically caused us to fall back into tight coupling. This ended up getting the fun nickname of a distributed monolith, where you have all of the problems of microservices, yet somehow none of the benefits, so debugging is pure hell, but you still have to negotiate every change.

I'm going to walk us through some of the ways that this happened to us, because then I'll talk about the patterns that evolved to solve the problem. The first thing that happened, if you weren't careful, is that clients started knowing way too much about the internal organizations that you had. Suddenly, it wasn't just talking to one application and sending everything to this one application plus or minus a few DNS names. They had to know that in order to do returns, I'm talking over here, in order to do inventory, I'm talking over there. This is a serious problem because clients are not updated on your schedule. They're updated on someone else's schedule, which means that if you leak your internal architecture to a client, you're basically stuck with it for five-plus years. You definitely don't want to end up there. That was one problem.

The other problem is that during those lines of communication, you had weird shifts of responsibilities. Imagine that one of the services is an Order Service - I'm going to talk about Order Services a lot more - and one of them is a validation. The Order Service has to know that once an order is created, it has to call validation and validation has to know what to call next. Suddenly, a lot of the business logic is really encapsulated into those microservices. One may wonder whether your Order Service has to know the validation rules and who to talk to about validation. One may wonder if you create a new insurance policy has to know about the call to check where the person lives in order to figure out the correct insurance rate service. Basically, your business logic is now leaked all over the place if you're not careful.

To make things worse, a lot of the concerns are actually common to a lot of those services. If you have a validation service, maybe you have a bunch of services that depend on it, which means that all of them have to know what to do with the validation services down - do I queue events? Do I retry? Do I fail some calls? There is a bunch of logic that is spread out all over the place, leading to redundancy and shift in responsibility. Maybe your Order Service shouldn't have to worry at all about whether validation service is up or not. Maybe it's none of their problem.

The other thing is, as this architecture becomes more complex, making changes become riskier and you have to test more and more of those services together – again, if you're not careful. There are ways to go to around that. That's exactly what the rest of my talk will be about. If you're very naive, you can get into a point where, "Oh, no, I cannot make this change without talking to five other services that may depend on me. Then, I'll break service number six that they didn't even know that talks to me because we don't always know who is talking to each other." We need to solve that.

The other thing, and I think that's one of the biggest problems that really slows down innovation in a lot of companies, is that in order to add anything to this graph of microservices, I need to convince someone else to call me. Imagine that they have this amazing innovative service that really handles shipping with errors way better than anything else that we had before. I cannot really add any value to the company before I convince someone else to call my service instead of maybe the old service. These other people are busy people. They have their own priorities, they have their own deliverables, they have stuff to do. If I'm not proving to them that I'm adding value to their life, why would they even call my service? I'm stuck in this horrible chicken and egg cycle where I cannot prove that my service is valuable until you call me, but you have no reason to call me until I've proven that my service is valuable. The motivation to innovate just goes way down because you know it's not going to be very easy.

Then, the last thing. It's minor, but it's also annoying. JSON has to be serialized and deserialized on every hop. HTTP has to open and close connections on pretty much anything. This adds latency. You cannot really reduce latency by adding more microservices. It somehow happens that in microservices architectures, the only way we know how to deal with problems is add more microservices on top. I don't know exactly how it happened. I've also seen it happen in my company. We had a bunch of microservices, making a change required actually changing three, four, or five different ones. It was really painful, we complained, we had a bunch of architects, we had a discussion. Somehow, at the end, we moved from five to something like 11. That's how it goes, so you have to be mindful of latency. It's not exactly a big concern, but it's also not entirely dismissible one. Clearly, with all those problems, if you do things naively, it looks like we can do things a bit better.

We Can Do Better

I want to talk about how we'll do things better. The reason I'm here talking to you about that is that I've been solving these kinds of problems for a long time now, sometimes as a consultant, sometimes as an engineer. Recently, I became an engineering manager, which means that when I talk about developer productivity now and how to make engineering teams move faster, I now have a lot more skin in the game. It used to be very abstract, "Let's talk about architectures that are theoretically faster." Now, I'm with a stopwatch, "Is the feature done? Why are you wasting time?" Yes, I have vast interest in all of us building things that make developers more efficient. Also, I tweet a lot, which clearly makes me an expert.

API Gateway

The first pattern I'd like to share is that often API Gateway should be fairly easy. How many of you have API Gateway in your architecture? Yes, I expected every hand to go up. That's well-known and I don't think anyone lives without it. I'm actually at the point where, when I interview engineers, we have those fields design questions that we always ask. Usually, they draw the usual boxes, here's the upserver, here's the caches, here's the database, and then they throw an API gateway on top. I ask, "What is it for?" They say, "We're going to need it. Trust me." Yes. That's how it works.

API Gateway was originally introduced to solve the problem of the client knowing too much with the ideas that we can put an API Gateway to hide all those complexities from the clients, but they turned out to be even more useful than that. They can also handle the fact that, if you think about it, once you expose a lot of microservices to the clients, all of them have to authenticate. Whether every service should be responsible for authentication, whether the services were unlikely key enough to be closer to clients should be responsible for authentication, whether this is something that you really want to implement multiple times, is very questionable. API Gateways also solve that, and they ended up solving a lot of different things.

The main pattern is that we'll throw an API Gateway to front all the requests. Clients will always talk to the API gateway and say, "I want to talk to the return endpoint of an API Gateway." The API gateway will route it correctly. Because now all the routing happens through the API Gateway, it can also do things like, "If it's V1 returns, we'll send you over here, but if it's V2 returns, we actually have a new V2 service that we are A/B testing, or that works faster if you have a newer client or whatever it is that happens." Actually, routing can get quite smart.

The main benefit we have for API gateways is that they take responsibility for things that a lot of services need. They do authentication, they do the routing. They also do rate limiting, which is a huge deal. If you're open to the entire internet, you don't want every single request to hit you straight in the database, you want to have layers of control around that. I've heard that there are even services like AWS Lambda or Azure Functions architectures that can basically scale into the thousands at the drop of a hat. Someone may try to use that to speed order iPads before they run out, so rate limiting can be incredibly useful. Then, because it's the first place where clients hit your back end, it's also a really good place if you do things like tracing and spanning and trying to do observability. It's a very good place to log the first access point and start timing your back end from there. A lot of logging and analytics goes there, too.

As you've noticed, the API gateway is super useful. Sometimes, I want to talk to another service and know that I will be routed to the correct one, and be rate limited and get all this logging. It just sounds like a really good thing. What sometimes happens is that internal services start talking to each other via the API Gateway because it makes life a lot easier. It's like this one point that you know you'll get routed and everything will happen correctly. It definitely started happening to us when we have an API gateway. The guys who own it set rules and they're willing to protect them with their lives, like, "No. If you exist in one of our smaller regions, you are not allowed to talk back to the gateway." That's the worst anti-pattern."

Service Mesh

We learned, but we still want this goodness. Still, good things come with API gateways, which brings us to service mesh, which is in one word, like API gateways, but internally. The way we normally talk about API Gateways is we talk about North-South Traffic. This is from the outside world into the depths of our data center. The data inside our data center is normally called East-West Traffic. If you have very good eyes, you can see that I colored things in slightly different colors. You can see that East-West Traffic is the traffic between the microservices, not the one coming from the client.

The way we implemented the API Gateway for the East-West services - one of the things to notice is that we have a lot more connections going to East-West than we do North-South. Pretty much by definition, the internal web is much denser, which means that having one API gateway is this monolith that everything goes through as a single point of failure, as a single bottleneck is not going to work and scale at all. The pattern we implemented is to take all these things that we want an API Gateway to do, and scale it out. The pattern that was used to scale it out is what is known as a sidecar. The idea of the sidecar is that you take short functionality that every one of your applications is going to need and create its own very small application that can be deployed next to in the same pod in the same container on the same host as your actual applications.

Why an application and not a library? Because these days, a lot of times companies have multiple programming languages. Even us, we're a small company, we have maybe 100-ish engineers, and we ended up with Python, Go, Java, Scala. Some people are trying to introduce Rust – we are not going to let them. Having a sidecar means, because it's independent functionality and it talks via network calls, mostly HTTP network calls, it means that you don't have to implement the library in multiple languages on trying to do Interop from Java to C to Go. That would be really fast if we try to do it clearly, so sidecars it is.

Basically, that's the way it looks. The colored circles are my microservices and I put a small proxy next to each one of them. Why a proxy? Because my application thinks that it's talking to basically some kind of a port on localhost, but our sidecar knows what it's actually trying to talk to and routes the data correctly. On the way, it does a lot of the goodness that API gateways would also do, and sometimes actually a lot more. One of the things it can do is that especially if you run on Kubernetes, one of the things that happens a lot is that IPS change a lot and you have a bunch of load balancers and things that require to know about it. One of the nice things that the proxy can do is, you keep talking to the same port on local host, and the proxy is the one that is aware that things moved around and routes data correctly. It's the same thing if you want to upgrade from one version of the pink application to the other, you basically route it to another version. If you want to do A/B testing, pink application versus green application, you can do the routing, and the purple application does not have to know anything about that. It still thinks it's talking to the exact same thing. Routing happens magically in the background.

The other really nice thing that can happen internally is rate limiting. It basically means that a lot of intelligence about error recovery - I know there was an entire talk about error recovery - a lot of intelligence doesn't have to be built into every single application, which is very powerful. As engineers, actually, a lot of us don't want to think about error handling at all, if we could, and a lot of times we don't. Then, we write bad applications that do bad things when errors happen. One of the really bad patterns that sometimes happens is that you send something to the server, the server sends you back some an error, or it has some a delay. Instead of handling the error like a grown up, you throw a tantrum and you just keep retrying until the server gives up and dies, mostly. It happens a lot on databases.

One of the first errors I ever troubleshooted in my career was in MySQL that got viciously attacked by a set of PHP applications. Funny enough, this keeps on happening almost 20 years later. Some things never go away. The idea is that the client can be as bug as you want and keep retrying. The proxy knows about rate limiting and it will drop the majority of the retries. The server will actually be protected, it can recover in peace from whatever error just happened, but maybe more importantly, because we're dropping just the retries, the important traffic that will actually maybe get stuck in a very long queue behind all those retries, now has a chance of actually getting processed by the server. This architecture is significantly more resilient than a lot of alternatives.

Event Driven

We had sever mesh, which solved a lot of our internal communication problems, but maybe not quite all of them. I'm going to switch to talking to what is probably my favorite design pattern and the one that I've really been talking about for five years or so. The thing that we really want to solve is the fact that because of those point-to-point request response communications, changes are riskier and any two applications are more aware of each other than they have to be, and adding new things is much harder than we would like it to be. The problem, or at least something that is overused and could be used only in special cases and a lot less than it currently does is what we call request-driven pattern. In the request response pattern, I, as a service, talk to other services, I initiate the communication and I initiate it with maybe two ideas in mind. The first one is that sometimes, I want to tell you what to do, "I'm the Order Service, I need you to validate my order. Please validate my order. Thank you." The other scenario is that sometimes I have a question to ask. Sometimes I am, let's say, the Shipping Service and I need to ship something. I don't actually know where this customer ID lives. I talk to another service and say, "Do you know where customer 526 lives?" You get the address and now you can continue processing. This is fine, but this creates those couple patterns that I didn't really like.

Event driven switches this pattern on its head and makes every service more autonomous, which is exactly what we're after in the first place. When we first came up with microservices - let's take a step five years back - we really wanted to make services independent, so teams have more autonomy. In this case, every service is responsible for broadcasting events. Events is anything that changes the state of the world: an order was created, a shipment was sent, a shipment has arrived, something was validated, the person moved to a new address. You keep broadcasting those changes to the state of the world.

Then, you have other services. Every service is also responsible to listening in to events on how the world change. When you get an event that tells you the world changed, you can do a bunch of stuff. You can discard it - this is not, "Yes, something happened, but I don't care that it happened." It can be, "An order was created. I have to handle it. I know what to do. It's my responsibility to pick up orders and handle them. Nobody's really telling me what to do, I noticed a change in the world." It can also create a local cache out of those events, and store basically a small copy of the state of the world for its own later use. Because this is a lot, I'm going to work through an example.

Events Are Both Facts and Triggers

The thing I want you to keep in mind to clarify the whole thing is that you have those events about, "The world changed." The fact that the world changed is a fact, you can store it in a database, and it's a trigger. It can cause you to do something.

Let's talk about buying iPads. I do maybe slightly too much of that. We have an Order Service, and in order to buy an iPad, the Order Service has to call the Shipping Service and tell it, "There's a new order. Please ship something." The Shipping Service is, "Ok, a new order. I have to ship something to customer 526, but I don't know where it lives." It's calling the customer service, getting the customer address and shipping it. This is reasonable, we've all done that. Of course, things can go wrong. Shipping Service can go down and now Orders have to figure out what to do with all those extra orders that we cannot temporarily ship. Customer Service can go down, we can suddenly basically stop shipping stuff accidentally. That will be a problem. We want to improve this.

The first order of improvement is to start using events for notification. When an order is created, Order Service doesn't call the Shipping Service. It updates this stream of events and say, "An order was created and here's another one." Note that those orders get logged to a persistent stream no matter whether the Shipping Service is up listening or not. This is fantastic because when the Shipping Service has time, has energy, is available, it can go start picking orders and shipping them. Then, what if it doesn't? No worry. We get a log of all the orders, maybe a log of orders and cancellations. Really, even if the Shipping Service is up, but overloaded, or it's up, but the customer service database is down, all those things don't matter, orders will keep getting created, customers will get acknowledged. We'll ship it to them maybe five hours later when we deal with our outage, but nobody ever noticed a five-hour delay in shipping, unless it's like those two-hour stuff that Whole Foods does. That will be a problem. Other than that, nobody really cares. This is much more resilient architecture than we had before.

To improve it a bit more, we can start using the events as the facts, as data that is shared between multiple services. Whenever the Customer Service sees that a customer changed the address, changed the phone number, changed the gender, this gets written to the database, but an event is published because the state of the world has changed, and everyone needs to know that the world has changed. Shipping Service maybe doesn't care whether or not I've changed my gender, but it definitely cares that I changed my address. The Shipping Service listens to those events and creates a small copy of the database - really small, just customer ID, shipping address. That's usually enough.

This small copy of database allows it that whenever it gets a request to ship something - which happens by the way, far more than people change their home address - it can basically answer the question from its own local database, which is very fast and very powerful, and it makes it completely independent from whether the customer service is up or not. Now we've created a much more fully decoupled architecture where people really can deploy things independently and don't have to say, "Customer Service needs to be doing maintenance. Is it ok if we stopped shipping stuff a few hours?" This is much better.

The one objection that always comes up is whether it's really safe to have a DB for each microservice. Just walking the hallway, I heard a bunch of people say, "These days we're just duplicating data everywhere and I'm not very comfortable duplicating data everywhere." I get that, but I want to point out a few things. The first thing I want to point out is that it's actually safer than you think, because all these databases are created from a shared stream of events about the state of the world and the stream of events is persistent. If you know about event sourcing, you know exactly what I'm talking about. It means that all those different databases will have a common source of truth. They're not going to diverge into La La Land.

The other thing to keep in mind is that if I suspect that the database went wrong and something doesn't look right, I can completely delete the database and recreate it from the history of events. It's going to take some time, this is not actually something you can do in seconds. You can try to paralyze and speed it up, you have the ability to do that. It's much safer than you imagine.

The other thing is that you get those custom projections. Every service will have the data it needs - just the data it needs - in the exact format that it needs it because it creates its own small database without ever bothering a DBA. No more going into the DBA and, "I want to add this field, but this may take too much space and I have to approve it with the other department as well." It's your own database, you store data the way you want. If any of you attended the data mesh presentation - very similar idea. You own your destiny and your data in your context of execution. Then, the obvious benefit is that you get the reduced dependencies and you get low latency. Everyone loves lower latency.

Event Driven Microservices Are Stateful

The thing to note here is that when I talk about event-driven microservices, by and large, they are stateful because they have their own databases they maintain. This thing is important because you can see how much faster, more independent, and more powerful stateful microservices are. This is important because in about 10 slides, we're going to lose the state again to new advancement, and we're going to miss it when it's gone.

The last thing I want to point out about event-driven microservices is that they enable the innovation that I really want to see. I had the Order Services and I had the Inventory Service, suppose that I want to add a Pricing Service to the mix. We only had something that it's pricing, but it was very fixed, the Inventory Service just had a fixed price for everything, but I think it can do better.

I watched airlines very carefully and I know that you can subtly shift prices in response to supply and demand. Maybe I used to work at Uber and I actually know how to do it, but I don't know how to convince other people in the company that it's also a good idea to do it for my iPad Shipping Service and not just for Uber. I can go and create my own Pricing Service based on everything that I learned at my previous job, and plug it into the shared universal source of events for my company. Now I'm going to know about every change to the inventory, every order that happened, every order that maybe cannot be fulfilled because we are out of stock, and I can fine-tune my pricing algorithm.

Then, I can start publishing pricing to the stream of events. Anyone who wants to check how would our revenue look if you used my pricing versus the existing pricing, can basically look at all those events and compare them. I can actually go to different departments, like the iPad Services, and say, "Do you want to use my pricing versus the old pricing? Because look how much better it's going to be." I don't have to ask them to trust me. I actually have proof or at least some semblance of proof - shadow proof - of how much better life would be. It's a much healthier process. You can make more data-driven decisions that way, which is something that I'm a big fan of.


We still have all those services talking to each other, whether it's request response, whether some of them are now talking in an event-driven way. They're still talking to each other and this still has some issues involved. Yes, it can be either way, it doesn't matter. Either you send those commands, or you write events, it doesn't matter how you do it. Until now, we've only talked about the medium, how you send events. We haven't talked about the events themselves and what's in them. The medium is the message of this presentation, but it's not the messages that are sent in your architecture.

This is what your messages will probably look like. It’s a JSON, it has a bunch of fields. It has Social ID, Property ID, a bunch of stuff in there, lots and lots of metadata about everything that's going on in the world. This is pretty good, it has some problems. One of the big ones is that, as we said in the beginning, HTTP and JSON can be incredibly slow. I can't say it's easily solvable, but a very popular solution these days is to use HTTP/2 and gRPC. We've started using gRPC. If you use GO, it's fairly easy, it's fairly built in. It doesn't solve all of the problems, but it does make things significantly faster, so definitely worth exploring.

The other thing is, it doesn't matter if you use gRPC, or JSON. Some types of changes are still quite risky. I want to talk a tiny bit about those. I'm going to use event-driven as an example because I lived in the ward for a long time, but you can have the exact same problems in a slightly lighter weight fashion if you are using Request Response. The problem is that when you communicate with messages, the messages have schema. It has fields and the fields have data types. If you really did any validation or testing, you have a bunch of things that depend on the exact data types and the exact field names and you maybe not even know about them. You make what you think is a fairly innocent change and it's very likely that it will immediately break everything. Now, you'll think that it will be an incredibly silly change. How would anyone not notice that you changed something from a long to a string? Clearly, it was not a compatible change. It will break everything. It turns out that basically, it's been done over and over.

There's this one customer, I'm, "These guys are new. Things happened. They didn't know very well." Then, you go to a talk by Uber and I discovered that it also happened to Uber. I was, "They are maybe chaotic companies. They move fast and break things. Surely, it will never happen to me." Guess what? March, earlier in the year. This is incredibly embarrassing, I'll talk about it a bit more later. The key is that if you don't have a good way to test that your schema changes are compatible, you are likely to cause huge amount of damage. It's worse if you have this event stream because remember, we want to store it forever. If you wrote things that are incompatible, at any point in time, a service of any version can read a data point in any version. A new service can go all the way back to the past, or an old service can just stick around and keep on reading new messages. You have basically no control over who is reading what, which is incredibly scary.

The way to really look at it is, whether you're using gRPC, or REST, or you're writing events, you have those contracts of what the communication looks like, what is in the message. Those contracts are APIs and they have to be tested and they have to be validated. The way we do it in the event-driven world, if you use Kafka as your big message queue - we normally use schema registries. Confluent has one, a lot of other companies have one. The idea is that when you produce events, you register the schema of the event in a registry and it automatically gets validated with all the existing schemas in the registry. If it turns out to break a compatibility rule, the producer will get incompatible data synchronization error. You physically cannot produce an incompatible event, which is fantastic because you just prevented a massive breakdown for everything downstream.

Obviously, waiting until something hits production in order to reject your messages is a crappy developer experience. What we really want is to catch it in the development phase. If you use Schema Registry, there's a Maven plugin. You run Maven plugin, validate schema, you give it your schema definition, it goes up to a Schema Registry of your choice and just checks compatibility for you. Then, you can do it before you merge, you can do it on your test system, etc.

I don't exactly know how to do it with gRPC, but I know how not to do it, because we ended up with a system that I'm not a huge fan of. In gRPC, you create all those tracts. Obviously, those tracts are used to communicate between microservices, so they have to be shared. We created this repository with all those structs that everyone has dependency on so they can share it. Imagine that you go to that repository - which you remember does not really have code, it only has structs, which means that it doesn't really have tests - and you make a change. You changed it and you need it in your service, you go to your service, you bump up the version. Now, you depend on the new version of that request. Everything works fantastic. Now, I want to make a change. I go in, I make a change in my own struct, but when I go in and bump the version, my service also gets your changes, so now I have a new version of two different structs. My tests could still fail with the new version, even though I haven't touched it. It works for you, but we both have dependency on that struct and only your tests ran after you made the change, not my tests.

It looks like if we have this repository of structs, and someone makes a change, we actually have to bump up the version across the entire universe and run tests across the entire universe, which makes me think that I'm all the way back in my distributed monolith again, which is not where I wanted to be. We are still trying to make sense of it. I am a bigger fan of event-driven and those easy to validate. You have a Schema Repository and can validate changes within the Schema Repository itself. I haven't found any thing that allows me to automate compatibility validations of gRPC changes, which is what I really want in life.


We solved some problems, we created some problems, but then we discovered that the really big problem is that running services is not very easy. Deploying services, monitoring them, making sure that you have enough capacity, making sure you scale, all this takes a lot of time and a lot of effort, and we started thinking that maybe we can do better with that as well, so we ended up with serverless. If you lived under a rock for the last year or two, serverless is incredibly popular function as a service AWS Lambda. There was a talk earlier about similar things for Microsoft.

The idea is that you just write your function. The function has one main method, which is handle event, so the only thing is that you get an event and you spit out the response - everything else is up to you - and you give it to your cloud provider. If someone sends an event to an endpoint, the cloud provider is responsible to spin up a VM, start a function and run it for you. That's already quite useful. It gets better. If nobody sends an event, this is not running and note this, you are not paying for it versus in the old way, you had the microservice and whether or not it handled events, I still had to pay for the fact that it's running on a machine. For this one, if there is no events, you don't pay, which is a really big improvement and everyone really likes that.

The other big thing is that sometimes I have a lot of events and I don't have to know about it in advance and I don't have to plan, and I don't have to do capacity planning, which is really cool. I just have to open my wallet. That's the only condition. A cloud provider makes money when you handle events. He does not make money if you are rate limited and cannot handle events. They have every interest in making sure you can scale immediately far and wide to handle every single event that comes your way, and they mostly do a fantastic job of that. I'll give them that. This is a very nice and easy way to scale simple microservices. It also matches some of the very simple event-driven patterns where you get an event, you do something to it, and you speed back a response.

The validation service that I mentioned earlier is that you get orders and validate them - it's like that. You get an order, you do some checks, and you produce an output. This is really cool.

Up Next: Stateful Serverless

There is only one thing that's missing in my picture, which is that we lost the state on the way. I mentioned earlier that having state in my microservice is incredibly powerful and we'll miss it when it's gone. It's gone and I miss it. Why do I miss state so much? I miss having states because sometimes my rules are dynamic. I cannot hard code them, I have to look them up somewhere. Sometimes, my events contain some of the data that I need, an ID, but not the rest of the data that they need, so I have to look it up somewhere. Sometimes, I have to join multiple events. Netflix had a good talk about it earlier in the day. Sometimes, I just want to aggregate number of events per second, number of orders per second, dollars per hour. All these things are important, so I need state.

The way I currently do state is something like that. Once my function is running, I call the database, I do a select, I call the database, I do an insert. This database is likely S3, DynamoDB, all of them are quite popular. Note that where I'm paying, I'm paying for running my function. I'm paying for every minute that my function is up, for the memory it's using, for the CPU it's using, and also, I'm paying for memory CPU and IOPS on the database side. This is significantly expensive when you try doing it at scale. Can we do better? A bit, we can do a bit better.

We know our logic and we can say, "It's very likely that if someone checks the status of the order, there will be other people who want to check the status of the order." I know that when the function is running, I have about five minute periods where if I get more events, this function instance will be reused, so I can actually create a small cache of all the hot orders, do one database call, get all of them, cache them, and maybe save on some data calls in the future. This is what some people will call an ugly hack and what some finance departments will think is genius, depending on exactly where you work, but you definitely don't want to do too much of that. Ideally, this will all be hidden from you completely.

The things that I really want back is really these sorts of events that I could always create my own personal data store from. I really don't have it and I miss it. If I had it, I could do things like create order just writes events to the stream of events, validate orders, pick them up, populate the local database. Maybe it also reads rules in inventory from another database into local database, validate stuff locally, posts the status. The Status Service also has its own small database that it can handle locally. You can actually do really cool things if you could have local state in the cloud that is continuously updated by new events, and maybe even continuously synced with a longer-term database. This would be amazing.

Rumors are that some cloud services are doing it. Maybe some of you attended a talk about durable function earlier in the day is that does some of what we're really looking for. It doesn't have great integration with databases, though, which brings us to the things that I would really like to see in a serverless world. One of them is the whole idea of durable functions. Microsoft is doing it apparently in Azure. People told me that AWS has something, but to be honest, I searched, and I haven't managed to find anything. As long as it's not the exact same thing in every cloud, it will be really hard, I think, to get a lot of traction. People seem to be afraid to implement something that is so deeply coupled with architectures that are only runnable in Azure, even if you don't have any immediate plan to run multi cloud, which by the way, after two years of experience, if you don't have to, don't. You still don't want to be that tied into something that relatively few people are doing, and you don't know if it caught on or not.

I really want to see this idea of functions with state catching on. I want to see them much better integrated with databases, because as we said, data is a type of event and event is a type of data. You can see flows going back and forth. I think that Amazon actually has something that gets events from DynamoDB as a function events. You could trigger functions based on things that happened in Dynamo, which is pretty cool. Another thing that we really want is a unified view of current state. What is happening in the local data store of every one of my functions and every one of my microservices? Being able to pull together unified reports where if you have a shared source of events, it becomes much more tractable.

I hope I gave you a good overview and some ideas of maybe what one of you could implement in the future.

Questions and Answers

Participant 1: I have a question about the life cycle of schemas. Do you ever deprecate schemas once it's registered?

Shapira: I can only deprecate schemas if I know for sure that, A, nobody is using them, and B, they are not stored anywhere. If all of my streams have very short retention policy, say, 7 days or 30 days, and if I know which one of my microservices is using which schema, which I could know because they're all validating it against the registry, I can try to track who is talking to the registry and which questions is it asking, then I could deprecate schemas. Currently, I'm not doing it. All schemas are cheap in my mind. Again, consumers that want to stop supporting them can stop supporting them. Producers that want to stop producing all the schemas can stop. If you maintain compatibility, all schemas are quite cheap. I'm not too worried, but if I really have to, there are ways that I can do it.

Participant 2: You mentioned that we should separate our databases for microservices. In my workplace, the problem was that if you separate out the databases you will have to pay more, and they were a little bit hesitant to do that. We sort of compromised with having separate schemas instead of separate databases. What do you think about that?

Shapira: This works to an extent, I think. Even if you separate them to separate schemas, your database is still a monolith. Everyone is talking to the same database. Everyone is limited by how much your database can scale. If you chose a relational database, then everyone has to be relational. If you chose Cassandra, everyone has to do Cassandra. I think to an extent it's possible, but I think we're going towards a world where engineering teams have good reasons for choosing the databases and the data stores that they choose and the cost of running additional databases is usually lower than the cost of forcing every team to be tied to one big database. Also, scaling one big database is not as free as one may want to think.

Participant 3: In the event-driven architecture how do you solve the duplicate message? Because of every retry, you will get the same message created, for a single order two shipping event message is created. How do we solve that?

Shapira: There are two ways to solve it. You either have someone detect duplicates and cleans them up for you, which is expensive, or you make sure you design your entire event architecture, so events are idempotent. Idempotent events means that if you get the same event twice, nothing bad happens. This means that the event cannot be, "Ship an iPad." It can be, "Ship a single iPad on this date and no more to this person." It cannot be, "Add $50 to an account." It can be, "The account balance in this account is now $350." Then, if you get two notifications like that, it's fine. They basically act exactly like they are one, so either you figure out a good way to do it, and a lot of places figure out good ways to do it because it's so much more efficient than having a service that tries to figure out what to duplicate and cleans it up.


See more presentations with transcripts


Recorded at:

Dec 23, 2019