BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Evolution of a Backend for a Streaming Application

Evolution of a Backend for a Streaming Application

46:36

Summary

Daniele Frasca explains the architectural evolution of Joyn, a German streaming giant. He discusses moving from fragile single-node setups to resilient serverless architectures using AWS. He shares insights on the Hub and Spoke pattern for data consistency, cell-based isolation to reduce blast radius, and cost-optimization strategies for achieving affordable multi-region active-active setups.

Bio

Daniele Frasca is an AWS Serverless Community Builder. Daniele is focusing on building and architecting serverless applications at scale. Daniele is architecting media services for millions of users in his current role, leveraging a multi-region architecture with the AWS Serverless service. Daniele is pushing the limits using Rust to get the extra milliseconds to final customers.

About the conference

InfoQ Dev Summit Munich software development conference focuses on the critical software challenges senior dev teams face today. Gain valuable real-world technical insights from 20+ senior software developers, connect with speakers and peers, and enjoy social events.

Transcript

Daniele Frasca: Today, I want to share a story with you. Everything started one and a half years ago, when my company asked me to scale our streaming applications in a way for our international users. The problem was, the request was painful. Today I'm going to share how a team of two developers with no experience in AWS, led by me, changed completely our core services, where we removed a single point of failure, we increased availability and scalability of our applications. The truth is, there is not a blueprint to follow. It's just iteration, learning to make things suck less. This is the reality. I would like to start with the original architecture. If you see at a high level, practically there is a worker that subscribes topics on Kafka, do some transformation, store it in the DB. We have the other component, that API behind GraphQL, that gets bombarded by our frontend. There is nothing wrong with this architecture, if made right. What I learned one and a half years ago is, when we are talking about technical debt, we are usually talking about code. This architecture didn't grow with the business. Servers were on fire. Database was crying. Every time there was a spike, everything crashed. The database was in a single node, no cache, nothing. It was painful. Data was completely inconsistent through services. This team of two, to our joy, we had six services that had no standards. This is where we say, we need to change the way of working. We moved to serverless. Not because serverless is cool, only because it lets us focus on what really matters, the code. During this one year and a half, we end up from active-active services, and a different type of multi-regional, active-passive, in different shades, depends on the severity of the services. This architecture pretty much solved all our problems, because serverless just scales. We leverage all the managed services from AWS. All the problems like availability, resiliency, are taken care of by AWS. We actually solved one of the major issues, that was deployment. Originally, the deployment takes one hour and a half. Now we move to minutes. Once we moved to this, we start to evolve, and we changed the services to make multi-region more affordable.

Background

I'm Daniele Frasca. I'm Italian. I work in Munich for ProSiebenSat.1 Media, that is a German TV broadcaster, operating in multiple countries. I'm part of a group which is building a streaming application called Joyn. This application is available in the DACH regions, that means Austria, Switzerland, Germany of course. We are receiving millions of requests any time. What you see on these slides, is what our managers write, easy-peasy: everything available, everything scalable, and they need to be cheap. The reality is, you cannot have it, you cannot have it both ways. The quality of the infrastructure reflects the user experience, and in this business, it's very simple. If something goes down, you know this immediately. You cannot have, again, highly available and scalable services at a cheap cost. This needs to be very clear for the management. In fact, our user expectation is very simple. They open the app, and they want to play video. They don't care if there is the Bundesliga here in Germany, or there is another event somewhere in another country. They want everything in the app, almost real-time synchronization through platforms. These actually are our user expectations that are actually very difficult to meet when nothing really works.

I split the talk in two sections. There were our major issues. We start with data quality, because we had issues where you are on a page, you see a video, you check the details, and you are moving to a different page, and now this video is not available. We had major issues in data consistency. The second part, how we improve scalability and resiliency of the system.

Data Consistency and Quality

Back to our beautiful original architecture. Now imagine that one team owns multiple services with no standards. Each of those services subscribe to the same topic, and they do, for some reason, completely different validations, transformations, some save in the database, some not, some throw error, some does not throw error. We had issues where you couldn't see the same video in different pages. This makes our life miserable a lot, because something that should take a few minutes, takes hours to investigate. Another problem with this architecture, especially when we have company bus and all the central bus, was internal services that should communicate to each other, they were exposing internal state to the company bus, that this is actually an antipattern. That this has been made because when you're using Kafka, you need to have another maybe Kafka bus, so you need to do other things, and people usually are lazy to do the job right in the first place. I am lazy as well. The solution for this was clear boundaries, and we solved with a pattern.

During the talk, I will show you infrastructure patterns that you can apply to solve issues, and those are so generic that I think they apply for different types of applications, not only for media streaming. I'm not going deep in each of the services, so we will see high level how we solve the issues. This is called Hub and Spoke pattern, also called, for me maybe, a bus mesh. Practically, we have three actors here. Kafka, that is our company bus that acts as an event store, where all the events live. We have EventBridge for fanning out all the messages. Between Kafka and EventBridge, there is another service from AWS called EventBridge Pipe that is a point-to-point service that acts as a middleman, intercepts the messages, and from there you can do whatever you want. You can do validation, transformation, and everything. The beauty of this pattern is each service interfaces only with its local bus, that is EventBridge. If you want to communicate internally inside your microservices or with another microservice or with the company bus, you have only one interface, that is EventBridge, and you can route the message through rules. It's like Pub/Sub, but you don't care if the subscriber is SQS, SNS, whatever. Everything is hidden by this abstraction layer that is EventBridge. With this pattern, we also solve the problem of the single source of truth, because now we have everything passing through EventBridge, and from there we fan out to the internal teams. Because we're talking about, again, event-driven applications, we always need to think about the tradeoff. This is a very simple example. We have two types of messages, sparse or full state. Sparse contains the basic information, and this means that the subscriber needs to fetch the data. This means that once you need to fetch the data, you need to build some API, and you need to handle massive requests. Full state, on the other hand, everything is there. You are going to call upon publisher and subscriber, because maybe you want to change properties, so you start to have breaking changes, this kind of stuff. On the other hand, everything is there to make your life easier, so you always need to think about networking and everything, because maybe you are moving around megabytes of event instead of a few kilobytes. For us, the choice was very simple. This is the reason why I'm showing you this, because with Kafka, you can go up to 30 megabytes, 40 megabytes of messages, and in streaming, when we're doing media, it's normal, apparently. While with EventBridge, you can only handle 256 kilobytes, so that is actually a completely different world.

We actually solved this problem with another pattern that is called claiming checking pattern. This, practically, is very simple. Leveraging Amazon EventBridge Pipe, there is a feature called enrichment, where you can intercept the message and do something. In our case, we're doing transformation, validation. In the end, we store the event into S3. At this point, we get the S3 key, move it to EventBridge. EventBridge fanning out to everybody, and all the consumer access, theoretically, access S3 and fetch the data from there. In this way, we actually had an API that scales out of the box without building our own API and maintain it. With these two patterns, we actually solved the major issue, data consistency, this was the number one problem that we had, because for the scalability you put 500 tasks to handle three users, and this is how you survive, you're not scaling vertically and horizontally. Before we move on to scalability and resiliency, I want to show you an alternative. The alternative is replicating data. In the last 5 years, we always talk about event driven, but there is another way to duplicate data through data replication. In this case, it's with Postgres, pglogical replication. Imagine that you get this Kafka, you normalize the message in a database, and all your other services, they require only partial data. Maybe you can replicate through data replication tables, maybe you have 20 tables total, and you only need 2. They are both valid, it really depends on the company requirements. There are tradeoffs that you need to consider, and I try to put it here on this table. We know that event driven gives you decoupling, so practically you send the event, they subscribe, take the event, and rebuild the data as they wish. You can have Aurora, DynamoDB, a text file, Excel, whatever, you can do whatever you want. Instead, with data replication, practically you have one database to rule them all. You start with Postgres, everybody must have Postgres. It's not bad. From Postgres you can build your own pipe and move the data to a different database, but I don't see the point. The real issue is, you are going to couple now the teams, because the source becomes the bottleneck. All the subscribers must have a database that is the same or bigger than the source, and if the source changes the schema, it breaks everything. You are going back to coordinating, asking the other team to change. You deploy first, I deploy after. Again, it can work, and there are benefits. I want to show you a real comparison. What I do not like from these options is the operational complexity. You need to take care of subnets, subgroups, CDR, because all the databases now need to live in a completely separate network. It's very complicated. Why? It's there. I was doing this 10 years ago.

Scalability and Resilience

We've finished the first part, that is how we solved the data consistency. Finally, with this team of two, we can concentrate on resiliency and availability. Imagine everybody shows up at the same time. Only two things can happen, your architecture gracefully scales, or you fail spectacularly. There is no other way. This is where we start thinking, because the real problem is not Lambda or cluster. The problem was the autoscaling rules that we had. All the cache was missing. Best practices that are there to achieve a certain scaling were all missing. This is where we move to serverless or managed services. This is a combination of services that we use. Just to go through quickly, we have Route 53, the classic DNS. If you're using only this, you're subject to DNS issues, because you're on the public internet. I always suggest to use any way, either CloudFront or Global Accelerator. Without going in details, means that when you're making a request, you're moved into the edge point, and from there, you reach your region through the private network of AWS. The main difference is with CloudFront, it's a CDN. You can cache at the edge point. Front door is an application load balancer or API Gateway. I always start with API Gateway, because I work at a higher-level networking, so you get CORS, you get compression out of the box. They are just routing your request to your computational service. I bring in this example only Lambda and Fargate. They are both serverless. Lambda is magic, it scales from zero to 1000 in milliseconds. It has its own limitations, but you don't need to do anything. You just need to make sure that you code and select the right memory for the runtime that you're using, and you are good to go. Fargate, you have more control, but you also have more ways to fail. From Fargate, of course, we can have cache, we can have database. There are millions of databases in AWS. I just make an example with Aurora, DynamoDB for the NoSQL, and RDS for who's still using it.

Because we are building for availability, we need to talk about the SLA of each of these services. To my surprise, API Gateway and Lambda were not the most available. They are fully serverless. They allow you to concentrate on your business rules, coding, and everything. The best of this combination is actually application load balancer and Lambda. Those numbers are very theoretical, and this gives you the foundations, but if you are not applying your best practice like circuit breaker, retry, timeouts, you're still bringing down a service like Lambda that is 99.99% available. Best practice always applies. The architecture gives you the foundation, but the code allows you to reach these numbers. The same applies for the databases. The fully serverless database in AWS are right now only DynamoDB and Aurora DSQL. This allows you to use them as fire and forget. They're an API. You don't need to think about subnet, VPC, this kind of stuff. Easy-peasy, you use it, and everything just works. AWS does all the job for you. If you are more traditional, you have Aurora or RDS. Here already you need to think about VPC, subnet, all this kind of stuff. There is always something that you need to take care of. The choice of the database usually is the hard problem. You can solve everything with a relational database. I was doing this 20 years ago. Everything works with a relational database. There is a choice. This table actually shows you you're trading operational complexity for cost, and at the same time you're trading reliability for simplicity. Each of those services has its own benefits. I'm bringing here, RDS single node plus replicas, because you can go to production with this setup. Nobody tells you not to. It works until it doesn't. You need to think about all the possible failures. This is actually the real question. How much are you willing to be down when the problem occurs?

There are many other little details that we can think about that are never showing up during the proof-of-concept. I can do a proof-of-concept with a text file as a database. It works. There are other things sometimes. In this case, there is the replication lag, the failover, the split-brain scenario that means, did the right node go down. Automatically you move the read as a write, and now you have two writes. The write is coming back. It was a false alarm. Now you have two writes that accept and are completely different data, you cannot reconcile. All of these little things are the tradeoffs that we are thinking about when we are building an API. Everything works until it doesn't. I believe we as a builder, developer, we need just to bring those problems to the decision makers, the managers, and force them to decide. Remember, you cannot get both. You need to wait. Once you decide which service you want, you can figure out what is your availability. You just multiply those numbers, and you get 99.78%. Just quickly, the difference between single region and multi-region, if you're considering two regions, this is the worst-case scenario, which you can be down. The difference is actually a happy customer and an angry customer that are on social, writing, your service doesn't work. The problem is your manager that calls you any time and tells you nothing works. Again, these are only theoretical numbers, but it's very important to think, especially if you're building an international application. It really depends what you want to achieve. From there, this is our service template. It's the same services that we talked about before. There is only one thing that I didn't mention before, Momento Cache. Instead of using Redis, Valkey, and also for the sizing and everything, we moved this responsibility to a third party. Everything else is managed services. The idea of this infrastructure is, never go down and recover gracefully. This virtually solves all our issues because everything is resilient. We're using global tables, so that's a feature from DynamoDB or Aurora. We have the same data replicated in another region, for disaster recovery and so on. We have regional services that can be application load balancer, Lambda, Fargate, that are all managed. We don't need to do anything. If we are zooming in a single region, we have a single point of failure. You cannot escape this.

Before moving to multi-region, we did a series of iterations making our application more available, more scalable. One of those patterns is the cell-based architecture. To make it simpler, we have three countries, Germany, Austria, Switzerland. I can build one Fargate service, one Lambda, to serve all the requests. If for some reason this service or this Lambda goes down with the limit, you're bringing down the entire applications. We start to split the traffic. We split the traffic based on the country, based on the type of user. We have paid user, free user. Theoretically, already we move from one Lambda to six Lambda. This means that Lambda, for example, that can scale from zero to 1000 in milliseconds. Now we have 6,000. We can split even more by platform. We have five platforms. We had 30 Lambdas, so this means we can have 30,000 requests in a millisecond. Of course, you need to change the quotas with AWS, but you write the code once and you deploy through the CI. It doesn't really matter if you have one Lambda or 30. The same applies for Fargate services. Instead of having one monolithic service, you have many of them, and they can scale up and down based on the traffic pattern, memory, CPU, and so on. The key here for us was to reduce the blast radius in case of a problem. This also allows us to deploy. We deploy maybe only in Germany, iOS, free user. We can test there. We can monitor. We can get metrics, and roll out later on. It gives us those kinds of flexibility.

The cell-based architecture can be applied to the database. This is where we can discuss. With fully serverless database like DynamoDB and DSQL, you can do it, because it doesn't cost you anything. If you have 10 of them or one, it's the same. When you're using RDS, Aurora, or OpenSearch, this is where AWS makes money. This is where for our scale, it didn't make sense to split the database. We have, for example, one database, one OpenSearch in each region, that serves the traffic for all. We managed to overcome this limitation, apply other things that the previous architecture didn't have, cache. It's a streaming application. Everybody loads exactly the same things. Give me the profile of Brad Pitt. It's always the same. It doesn't change. There are little changes based on the platform and everything, but it's always the same. There is no point to use the database as an expensive cache. We apply three layers of cache. We have CloudFront for the repetitive request. We have in-memory where we only store the Lambda base or Fargate. We have different settings, but where we store the hotkeys. If there is the miss, we have the cache in front of the database that, in our case, is Momento. Again, it's a service that does all the cache, does real-time notification and everything. Finally, if we do not have anything inside the cache, we are going to the database. We actually reach less than 10%, even 5%, during prime time. This allows us to have a very small cluster instead of having a giant cluster with super memory just to handle the number of requests. This allows us to move to serverless and do scalability based on the request, or apply a strategy like active-active. Because, for example, if you're using Aurora global tables, the writes are only in one region. Once you're thinking about multi-region, I don't like this setup. I would prefer to have active-active like DynamoDB so you can write in multiple regions. When you have a small cluster, actually, it becomes cheaper or the same as having the global tables. There are benefits to have a good caching strategy. Of course, you always need to think about invalidations, but tradeoff. It's impossible. We managed to do it. There are cases where we cache in CDN only for a few seconds, those are huge benefits for services that are receiving 100 million requests simultaneously. It's fantastic.

We talk about cell isolation, cache. The third thing that we invest in is data plane. Because now we have a lot of Lambda, a lot of services, we start doing automatic operation. With application load balancer, we were monitoring our computation, and we built some automatic failover. If something goes wrong in the region, there is Route 53, and it will switch to a different country. We also have other type of alarms that monitor resources like CPU, memory, and those will emit events. With these events, we can shift the traffic between Fargate and Lambda. We are not serving the traffic only with one service, but with multiple services, and based on the traffic pattern in that moment, we decide, based on some rules, which one is the best. This brings us to the famous multi-region. Why multi-region? I won't say that everybody needs to do multi-region, but it really depends. The practice that we just saw makes our service more scalable and available. We reduce the blast radius, but we are not really resilient. No? For resiliency, you need to go to multi-region. You don't need to do multi-region for every single service. Internally, we have many services. The one that requires multi-region active-active are the ones that if they go down, brings the entire application down. If bookmarks go down, nobody cares. We can go there and spin it up, whatever. It really depends. We prepare all our services to be deployed in multi-region. There are multiple strategies, there is backup and restore, pilot light, warm standby, and active-active. It really depends, from the severity of the services we decide which we want to use.

The real problem for multi-region, I think, is the mentality. In the last 10 years, I always did the same thing. Why do I now need to complicate my infrastructure? Or if they do it, there is an SRE team, we will do it. You go to SRE team, SRE team says, no, we are busy, we don't want to do it. There is always a pushback that actually creates a culture of, if the problem happens, let's brace, let's take the heat for the manager and wait that it passes. We go ahead always in this way. Nowadays, I think multi-region, it's easy. The cloud providers are improving, making it easy to deploy the same stacks in multiple regions. Again, we are doing streaming applications. Right now, three regions. There is the plan to move to the rest of Europe, because we got acquired by another company. This is where I think my job, the job of us builders, is to make it very transparent. I do not decide what the business wants. I go to my manager and I try to make it transparent. If my manager gets the responsibility to get a downtime during prime time, to me, I can put Excel as a database.

When I bring this problem to the manager, I try to make sure that they understand the problem. What if this is happening? Because most of the time, we say, if multi-region costs x and we have one issue per year, maybe this x is greater than the loss of revenue. It makes sense. We are not talking about Frankfurt going down. What happened, I think, back in 2021, was 8 hours down? Something happened. It was for many hours. I'm talking about issues that are constantly there. In the last few months, I can recall DNS issues where the CloudFront application load balancer couldn't reach the source. I recall issues with Lambda where at some point, everything is recycled. That is fine. This happens to have a lot of cold start that has a domino effect. Sometimes, we see a Fargate task disappearing. You have 10, and suddenly, you go to zero. It happens. Everything is auto-recovering. The problem is, when this is happening, there are multiple people in a call. We need to investigate. We need to bring the VP in, the CTO. How much does it cost, all of this? Just to turn out. This is where the question is not if multi-region is the right approach or not, it's to make multi-region more affordable. When the managers understand the problem and get the responsibility, they can decide if the cost has the benefits to cover the risk. Because I'm pretty sure that if we have a problem on prime time, the infrastructure costs way less than all the lost revenue that we have.

These are the other things that we need to think about, how we make multi-region affordable. Multi-region is more expensive than a region. You don't need to be a genius, but it's clear. We are using services like databases. They need data replication. All of these costs money. This is where they are expensive in a single region. AWS charge you for the data networking and everything. This will always cost money. When I say affordable, it's the evolution of our service. We start servers first. We do API Gateway, but API Gateway is very famous for being way more expensive at scale than application load balancer. How much? It depends. We have a 90% saving only switching from API Gateway to application load balancer. We had to implement inside the code, CORS, and that is just a bunch of headers and networking compressions. Not a big deal. There are other techniques to make everything cheaper. One of them, for example, where we saw 60% decrease in cost is switching between Fargate and Lambda. If you are around with AWS Heroes and everything, or from AWS, they always tell you, Fargate is cheaper than Lambda. It is, if you see the price, but it really depends on the scale. From our calculations, under 30, 50 million requests per day, Fargate is actually more expensive than Lambda. What we are doing is switching traffic. We calculate the current traffic level, the users. We do some calculation about how many requests a task can handle, and we are shifting the traffic. We always have Lambda ready to take care of the overwhelm, because maybe you have a spike and this spike can bring the memory on the CPU of the Fargate task service to a certain level. Instead of scaling up Fargate, which we do, but it can take five minutes, or six minutes, in that moment we are shifting the traffic to Lambda. The Lambda will have some cold start, maybe, maybe not, but this is how we handle the traffic and reduce the cost. During the night when nobody is watching TV, we switch off Fargate, Fargate scales to zero, and we move the traffic to Lambda. It really depends. There are all these little things that make multi-region affordable. The other one that I really see the value is the automations. Practically, we track everything, and we try to automate the actions. This means that we build an infrastructure that does not need humans. The time that there is an incident, I see the email, I try to log in with everything. I try to figure out what's happening, it's gone. This is the key to making it affordable. In fact, the bottom line in the end is you are not eliminating the cost, but you are making it reasonable for the protection that you gain. This is how I see multi-region nowadays. I hope that AWS will allow us easy deployments because my CI pipelines are huge. Now imagine that I need to deploy 30 Lambdas, it's one click, but I can see every single route. I wish to have, for example, a list of regions that I apply and that's it. I click once and it's all done for me automatically.

Summary

In one year and a half, we practically managed to rewrite all of these services. We did all of these little changes and everything. We are not in code. My ex-team, because now I moved somewhere else, never had any issues that came from availability or reliability. The bugs that we had is only because we, human, we didn't test it correctly, deployed, everything is working. I figured out later that it was a bug. We reduced the cost. We do these little changes, and everything just works fine. Again, many people also in my company see this very complicated because they are counting the components. "You're using EventBridge. You're using S3. You're using this. Three components, I can do everything with a cluster." Serverless just gives us the visibility that you're actually building inside your giant box. They see complexity, I see delegation to AWS. I let others take care of the problems, and allow the team to focus on what really counts, instead of introducing bugs through code.

Questions and Answers

Participant 1: If I understand correctly that when a region would fail, should that happen, you are redirecting the traffic to another region. You also mentioned there is one DB per region. How are you keeping these DBs in sync? It would be a risk that they are out of sync.

Daniele Frasca: They are not going out of sync for one simple reason, because we have, for example, Kafka. We have two pipes, so we're building the right pipeline at the same time in each region. They are actually storing the same things almost in real time. There is the possibility of failure, but again, if we see some message in the DLQ, there are automatic retries with everything.

Participant 1: All regions?

Daniele Frasca: Yes, they're all exactly the same. We do write to write in each region when we do active-active, or if we don't think the active-active for that specific service is not requested, we're using global tables. You write only once, but you're reading from both.

Renato Losio: I was pretty curious to know how you test regional failure.

Daniele Frasca: Regional failure? AWS gives you fault injection service. I call it Daniele monkey's chaos engineering. Practically, I go there and stop things and see what's happened. Many things, I can catch it by myself with the team at every stage when we are doing simulations. We do major simulation every time that we are shipping a big change. During this one year and a half, many things we are not covering. When something happened, we always try to automate. Usually, it's all done immediately. When we are building it, I build for failure. This is the thing. I'm not thinking about the green part. I'm thinking how this can fail, and I try to regrade the issues. Sometimes it's very difficult. Maybe we put inside the code just for testing purposes, a delay, or a throw and everything. We force this. This is how I usually do it.

Renato Losio: Actually, you mentioned that you scale during the night, that switch between ECS, I think, and Lambda, according to traffic and low traffic. It's responsive according to the traffic. I was wondering if you are as well in this scenario sometimes that you predict the peak, that you know there's some big event or something you expect and you need to pre-warm, or you always just react to the traffic.

Daniele Frasca: For sure, in media, especially when you have live TV, prime time is between 7:00 and 11:00 in the night. I do not like the ways of many, let's put for under the task, scale everything up. I think it should be more reactive. This is why we have Lambda first. The configuration of the application load balancer is Lambda weight 100%. Through time and traffic, I start shifting. I always reach maximum 90% of Fargate, high traffic, and 10% Lambda. The Lambda is there only for the overflow. If there is a spike, maybe we decrease Fargate, we put 70%. In the meantime, Fargate slowly but steadily will spin up the other task, but more traffic is on Lambda. This is how we do it. It's Lambda first and immediately start to shift automatically.

Renato Losio: You don't need to think in advance.

Daniele Frasca: Yes. I do not care if I have one user or 1000 users or 100,000, 200,000. It's all taken care of automatically. We did some calculations. Practically, our tasks for Fargate is like 4 gigs of memory and two vCPUs. Through the load testing, we know, we can reach 3,000, 4,000 concurrent requests for this task before CPU and memory starts to grow. We don't want to reach 60% or 70% because it will be too late in case of traffic. We allow, like everybody, we're clustering this little overflow. This is how we theoretically catch all of this traffic. It doesn't really matter how much we have.

 

See more presentations with transcripts

 

Recorded at:

May 11, 2026

BT