InfoQ Homepage Presentations Taming Large State: Lessons from Building Stream Processing

Taming Large State: Lessons from Building Stream Processing

View Presentation

Speed:

Download

43:48

Summary

Sonali Sharma and Shriya Arora describe how Netflix solved a complex join of two high-volume event streams using Flink. They also talk about managing out of order events and processing late arriving data, exploring keyed state for maintaining large state, fault tolerance of a stateful application, strategies for failure recovery, data validation batch vs streaming, and more.

Bio

Sonali Sharma is a Data Engineer on the data personalization team at Netflix. Shriya Arora is a Senior Software Engineer at Netflix.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Sharma: This is a data scientist at Netflix, who's really pepped up about this great idea that he has. He really wants to try it out. I don't know if everybody knows, at Netflix, we are really big on A/B testing. Somebody has a great idea, we like to A/B test it immediately. Everything is set, and an A/B test is launched. This data scientist is very eagerly waiting to look at the data in order to get the first read of the test. A few hours go by after the launch of the test, but he's still waiting for the data to arrive. It's evening now, but still no signs of data. It has been 10 hours since the launch of the test already. It's morning, and now the data scientist is really restless. Nobody wants to be this data scientist. In today's world, we want access to data faster, learn from it, make quick decisions, and move on.

You all must be wondering, who are we and why do we care? I'm Sonali Sharma. Here is my colleague, Shriya Arora. We both work as data engineers in the Netflix data science and engineering org. Our team's charter is to build and maintain data products that drive personalization at Netflix. In the last few weeks, we've been focusing on building a lot of low latency data pipelines. We care deeply about this problem, because we don't want there to be any data scientist who have to wait for their data for this long.

What you'll hear from us in the next 40 minutes is a use case for which we built a stateful streaming app. In doing so, we converted a fairly large batch data processing pipeline into a streaming pipeline. This is not just any streaming pipeline, but we introduced a very big state into it because the idea was to combine the two streams together. There are certain concepts and building blocks, which are very specific to streaming applications. We will go through these concepts. As I describe these concepts, I'll try to draw parallels with how that concept applies in case of batch processing. Next, we'll talk in detail about how we actually implemented this join using a large state. We'll also go through some of the challenges that we encountered in building a low latency pipeline. There are always challenges, and you always pay a pioneer tax when you process data at this scale.

Use Case for Streaming Pipeline

How many of you here watch Netflix or are Netflix subscribers? That's a lot of people. How many of you have never watched Netflix or don't know what Netflix is? I do see a few hands. Both for the sake of people who are Netflix users, and for people who've never seen Netflix, let's watch this quick video to remind ourselves about what the Netflix UI looks like. For those who have never seen Netflix this will be a good video to know what it is.

What you would have seen are the different ways in which people play their favorite shows on Netflix. As you play these shows you also go through different sections of Netflix just to find out what you would like to play. In doing so, anything that you see on the screen is an impression. A lot of people who are from the ads world would know what impressions are. The concept is the same. You see an entity on screen, and there's a corresponding impression record for it. Every time a user interacts with Netflix, the corresponding interaction events are generated and logged. These interaction events flow through our logging framework and land into data streams.

There was a study done by Sandvine. They reported that Netflix accounts for 15% of the entire internet traffic worldwide. That's a huge number. It's not surprising to us because on a daily basis, we see about 1 trillion events in our data streams. Those streams translate to about 100 PBs of data that we store in a cloud based data warehouse on a daily basis. We are really dealing with a very large scale of data. Majority of these events that you see are playback events or impression events.

Where is all of this data used? There are no surprises here. It gets used heavily for recommendation and personalization. Our algorithms try to select shows that we think the users will like. The algorithms also put these shows into different rows. For example, a popular on Netflix row versus shows with Spanish language. Not just the rows, but also the order in which these rows appear on the product that is also personalized. Apart from that, the order in which these movies and shows appear within each row is also something that the algorithm tries to predict for every user. Another important area where this data is used is for the purpose of artwork personalization. The image that we select to represent a show on the service is personalized for every single user. This personalization is based on the user's viewing history.

Also, an important metric that this personalization model makes use of is something called a take fraction. Let's try to understand what take fraction is. In its simplistic terms, a take fraction is the ratio between the plays and the impressions. In this example, we have three Netflix users, and when they log on to Netflix, take for example, they see these four shows on the screen. User A goes ahead and plays this show, "Luke Cage" whereas user B and C, they just see it on the screen, but they decide not to play the show. In this case, our take fraction is one by three. This is an important metric that we can get by combining the playback data with impression data. We can get access to this metric faster if we were to do this join in real-time instead of doing it in batch. This is the use case that we are trying to solve for in a real-time context.

Making a Case for Streaming ETL

There are a variety of reasons for using stream processing. There is a certain category of problems that are very good candidates for a streaming application. Whereas, I would say, equally, there are other set of problems which should never be solved through a streaming solution. Some typical use case of a streaming application is when you want to do some real-time reporting. For example, a show just launched on Netflix, it just came out. You want to know within a matter of minutes or hours, how that show is performing on the service. For that, you would need to process data more real-time. Similarly, you could be parsing or processing some logs, some raw logs, and you want to report any errors encountered while parsing those logs in real-time, so that feedback can be provided to the teams who are responsible for generating these logs and preventive actions could be taken to mitigate the problem. For that, you could use a streaming application.

The third, rather obvious use case is when you want the models to be trained on fresher data and you want the training to happen more frequently. For that, using an output of a streaming application makes sense because you get the data faster. The fourth category is an interesting category. You might be dealing with the problem where the objective is not to do real-time reporting. You don't want to do any real-time alerting. The objective is not even to train the models faster. It's ok for the models to be trained on data whenever it arrives. The data that you're dealing with could be of the type that you could get tremendous computational gains by processing it through a streaming application rather than processing it in batch. Let me give you a very simple example to explain what I'm trying to say. For example, you have a series of numbers which are flowing in and all that you want to do is to find out what is the min and the max. You can come to this decision by looking at each number in the stream, on the fly. You don't really have to hold all of these numbers in memory just to compute a min or a max. If your data is such that you can process it as it comes, on the fly, which was true in our case where the playback and the impression, they happen in close proximity of each other. As we get an impression event, we can also check if we have a corresponding playback event, summarize the data, and then output it. We thought that this is the data that we would want to have a streaming application for instead of processing it in batch.

Our objective was to not just train the models faster, but our data was of the form that we saw computational gains by creating a streaming application for it rather than processing it in batch.

We created a stateful streaming application where we joined impression events with playback events in order to compute a metric like take fraction in real-time. This allowed us to train the models faster. The models didn't have to wait for a day or two for the data to arrive and then process it all together. In doing so, we converted a very large batch data processing pipeline into a real-time stateful streaming application.

There is a fundamental shift in thinking about the data when you process it in batch versus when you process it in real-time. There are certain concepts and building blocks, which are specific to streaming application. As we go through these streaming concepts, I'll also try to draw parallels with the corresponding batch concept.

There are a lot of stream processing frameworks out there. We ended up using Apache Flink, because it was more suited for our use case. You'll see that some of the concepts that I described might be Flink specific. However, it's important to note that a lot of these concepts are common and shared across all of these stream processing frameworks. One should choose whatever framework works best for their own use case. I also have a link to some previous QCon talks that talk specifically about stream processing, and they do talk about some of these frameworks as well, as a general concept.

Bounded vs. Unbounded Data

The first thing to understand is the difference between bounded and unbounded data. When you're dealing with data in batch, the data is at rest, and it already has hard boundaries defined. For example, you might want to read one day's worth of data, or one hour's worth of data. You can do that by specifying the boundary as a filter easily. As opposed to that, when you're dealing with data streams, data streams have continuously flowing events. These events are not bounded. If you want to apply any computation over these events, or join them together, or do whatever aggregation logic you need, you first need to define what those boundaries are.

The way you define those boundaries is by using windows. Windows lets you provide those bounds across the data. When you have the bounds you're basically bucketing your data into finite size. People who do a lot of batch processing would know that when you join multiple datasets together, a lot of times a common practice is to first group the individual datasets, and then combine them using a common key. The way this translates into a streaming application is you have a stream, and then you specify a keyBY clause. In the keyBY clause, you would specify the set of columns or attributes that can be used to group the data. Once you've done that, you would specify the window, which is the bounds, over which you want to collect these events. Once you've collected these events together, you can apply your own custom Reduce aggregation logic on top of it. This is first collecting the events and then applying the custom Reduce logic.

A typical join in a streaming application would look something like this. Once you have the output of the group data, you can join the two streams together. In the Where clause, you would specify the set of common columns that can act as a key to combine the two data streams together. Again, you would specify the window, because remember that streaming data is inherently unbounded. At every stage you're defining those bounds over which you want to do the computation. Then in the apply logic, you can specify whatever further processing you want to do. The first part is similar to the Group By operation that you would do in batch. Then the second part is similar to the join that you would typically do in a batch. In our case, we could not use the window method to define these bounds. Shriya, will talk about what we ended up replacing this window operation with. We ended up using some lower-level APIs because they were better suited for our use case.

Event Time vs. Processing Time

Next important thing to understand is the difference between event time and processing time. Event time is the time when the event took place in the timeline of the user interaction, so the actual time when the event took place. These events, they flow through different systems before they are made available in the data stream for processing. There could be a lag between when the event actually occurred versus when the event was made available in the data stream. In this case, that event that took place at timestamp 4, came through a few minutes later, and therefore, you see a shift in that timeline. A lot of times in a streaming application we process the data by event time, and that is called event time processing. The data itself would come with both the timestamps and event time and a processing timestamp, or we can assign our own.

Another important thing to consider is out-of-order and late-arriving events. In the example that I've been using so far, when somebody plays something on Netflix, before you play, you would have seen that show presented to you on the screen, which is an impression record. In an ideal scenario, you should see an impression first, and then a playback event. In reality, what happens is that the playback events could come before the impression events, or they could come completely out of order. The other thing to consider is that these events pass through different systems, and these systems could introduce lags in these events. It's important to keep into account that there are out-of-order and late-arriving events.

The way you handle this in the case of a streaming application is through this notion of a watermark. A watermark is used to keep track of the progress of your application using event time. Watermarks, they flow as a part of the data stream. It is represented in the form of a timestamp. What a watermark does is that, it tells the streaming app that we don't expect to see any events. If a watermark is set at a timestamp T, then what it's telling the streaming app is that we don't expect to receive any more events with the timestamp of less than T. In this case, the watermark that is set to 11 would imply that at this stage, the streaming application does not expect to receive any events with a timestamp of less than 11. There's always this setting called allowed lateness that you can configure where you can specify that if the events are late by X amount of time, just let those events through.

Another important thing to consider is how do you enrich the data further by using external data sources? In a typical batch pipeline, you often join the data with other dimensional data. You might want to do something similar in case of the streaming application. As these events are flowing in a stream, you might want to enrich them with some other data. A couple of ways of doing that is you can either make other external calls to other services, but one thing to keep in mind is that if you have a large volume of stream, then you might end up overwhelming the service that you're making a call to, which means you should consider either batching these calls or caching the data so that the other services don't get overwhelmed. Similarly, you could also load the data from external static paths. You could load the data from S3 into memory, and then use that to enrich the streams.

Fault Tolerance

The last thing to consider here is that a batch pipeline runs on a fixed schedule. It would wake up. It would do some processing. Then it would sleep. Unlike that, streaming applications are long running applications, which are meant to run forever, which means that they have to be highly fault tolerant. What if the application goes down? You cannot afford to bring them up with a data loss. Checkpoint is what lets you keep a snapshot of the state of the app, as of a particular point in time. What it allows you to do is to recover the app gracefully from the last checkpoint that you would have taken. One important thing to consider with respect to checkpoints is that based on the size of your app, checkpoints might take some time to finish. You want to allow enough time between the two checkpoints so that it doesn't overwhelm the app. You want to make sure that there are enough pauses, which are represented by the green bar here. The checkpoint interval itself is a combination of the time it takes to checkpoint and the pause between the two checkpoints.

Recap: Concepts and Building Blocks

We saw how streaming data is unbounded, so you need to specify some bounds on it using windows. We talked about what event time processing is. Watermarks can be used to handle out-of-order and late-arriving events. Out-of-orderness is not necessarily a problem in batch because you read everything together. You can always enrich the data by making external calls to services or loading data from other places in memory. Finally, it's very important to make sure that the streaming app is fault tolerant.

Now it's over to Shriya, who will talk about how we actually did the join to combine these two large streaming pipelines.

Arora: Let's go through our specific use case of making this real-time join work in Flink. You'll see some constructs that are specific to Flink. If they are not available in other stream processing applications, I'm sure there are parallels or other alternatives.

Data Flow Architecture

This is what our current data flow architecture looks like. Sonali couldn't show you the video before, but we have a stream of impressions coming to us from the Netflix service. We also have a stream of playbacks that the users are playing on the service. These playback and impression streams are getting from the microservices to our back-end systems using Kafka. From Kafka, we read these events into Flink, and then transform them. I know there's a lot of information in this slide here, but I'm going to double click into all of them. We transform them and we assign them timestamps. After we've done assigning watermarks to both streams with their timestamps, we key them, and then we summarize them. Finally, we output them to our output sources, which is both a Kafka topic and a Hive table. Let's dig into this diagram and see what everything is doing.

Transformation is all the steps you need to do in your stream processing to make more sense of the data you're getting. In this use case, we are parsing the raw stream converting it into strongly typed objects that our application will have an easier time dealing with. We are then filtering them, throwing out stuff we don't need. The most important part, we are assigning them timestamps. What does this mean? This is an event time stream processing application. For event time, it needs to know the event time which is some field that's coming in the impressions, or plays. It's important to know that this is the time at which the event happened in the user's timeline, not the time that we received the event in our systems.

Let's talk about the keyBy operator. The keyBy operator in Flink will convert a regular data stream that's coming into a keyed stream. A keyed stream is nothing but a partitioned stream that contains all the events for a given key. In our application, one keyed stream will contain both impressions and plays as long as they have the same key. The way we generate our keys is we find fields that occur in both events and will deterministically allow us to join these two events to each other.

Stream Joins in Flink: Maintaining State

Which brings us to my favorite topic, which is maintaining state in stream processing application. This really is at the heart of the problem. Events that we'll be holding and aggregating need to be maintained in state. What do we mean when we say this is a stateful application or we need to maintain state? This is essentially all the data that we'll be holding in memory, while either we wait for time to go by, or more impressions, or more events, or a play to come by. Usually, in our application, the longest time we are spending and holding these events is for plays to come. Because the behavior of users on our service is such that plays are significantly delayed from the impressions that users see, and we need to wait for them for some time. As long as we are accumulating state, we also need to take care of cleaning up the state because we cannot be holding data in memory forever. All this data lifecycle of maintaining your state, cleaning up your state, is what we mean when we say maintaining the state of an application.

Aggregating Streams: Windows

Why will we need state? We will need state because we will be holding events so that we can aggregate them all together. The out of the box way of doing this in most applications, including Flink, is to use windows. There are various kinds of windows available. The one that made more sense for our use case was a Sliding Time Window. What windows will do is that it will split the stream into buckets of finite size and then you can apply aggregations on them. Make note that these windows are being applied on top of keyed streams. You already have a group of events, which all have the same key and occurred in the time-frame you wanted them to. Now you can aggregate them. However, there's a caveat here. The nature of our data is that we will get repeating events for the same key. Repeating is different from duplicate. These are not duplicate events. These are just events that have the exact same key and have a lot of information that's repeating in them. However, they're not the same, but a lot of information that's in them is common and can be abstracted out.

This got us thinking that why hold these events in the window when we actually don't need all the information they're carrying? We only need to know how many of them were there. What's all the common information? Or, what's that one different information and event? It made us realize that if we actually summarize these events, on the fly, we would save a lot of state that we'd have to maintain. It actually cuts our state by about one-fourth, because we get two or three repeating impressions for every key and two or three play events for every key.

Updating State: CoProcess Function

Let me tell you how we actually went about maintaining the state. We couldn't do these through windows, so we had access to a lower-level API that Flink exposes, which is the process function. However, the downside of using a process function is that all your data life-cycle you have to do yourself, unlike windows where you can just specify parameters and the state is cleaned up by Flink. What do we do? We have a state that we are maintaining. As we get the first key for the impression, we are going to inject it into a state. We are not going to inject it in its raw form. We have a composite type that can hold information about both impressions and plays. This will help us maintain only one value for a key at any given point of time. We get this impression. We put it in state. We get another impression for the same key. We're going to append some information. We're going to increment some counters and throw the raw event away. We get a play event for the exact same key. We'll do the same thing. We'll take the information we need from plays, throw away information that was common or repeating. Then throw the raw event away. We keep doing this. What this enables us to do is only hold this one composite type, which is a bigger event than the individual raw events. However, we are not holding multiple events of the same key. This really helped us cut down our state from about 5 TBs to 2 TBs. As we are going, they're going around injecting data into our state. At some point, we'll have to clean up the state too.

Let's talk about, how do we clean up our state? Flink has this concept of a TimerService. This is really the service that's managing your event time or your processing time, depending on what paradigm you're choosing. It's managing the timing of your app. What you have the ability to do is attach timers to every event that you process. With every event that we see in our system, we are going to attach a timer, and saying, "For this event, please expire it in 4 hours." When the 4 hours go by, Flink can invoke an onTimer function, which is nothing but really a callback function, and say, "Your timer for this event is up, do what you want to do with this event." That really is at the crux of what we're doing. We get these impressions, if we find the same key in the state that we had before, we're going to append them in the state and attach a timer to them. While we were appending the state, what's not shown here is we were also attaching timers so that Flink can call us back when the events have expired. When the events expire, Flink will take that event and send it back to us through the onTimer function. In the onTimer function, now we don't actually need to do any aggregation because we've done all the heavy lifting of summarizing and aggregation when we were updating our state. When we get the event in our onTimer function, we're just like, "This event's time has expired. We're not going to wait any longer for this key to get any more events. We're done waiting and we're going to evict this event." We evict this event and that's that.

Recap

This is the exact same diagram we saw before with all its elements. Once we are done summarizing these elements and they're evicted from onTimer, we write them back into our output Kafka topic. Also, it gets persisted in the final table that's used by data scientists.

Challenges

Here's the fun part. We're done coding. We've written up this app. We're trying to productionize it. It's all great if you can get data fast. What if we went back to the original data scientist, and said, "Now we have your data, but it's not complete." That's not going to cut it. Especially with personalization systems, for event processing, or operational insights, the trade-offs are different for completeness. When the data is being used for personalization, really, the completeness cannot be compromised.

One of the biggest challenges of going from a batch system that waits for data for 24 hours, and collects it for 24 hours and then processes it, whereas a streaming application window, which is waiting for the data for 4 hours or 5 hours, is the data correctness. How do you do the trade-off between the latency and the completeness of the data? The way we've chosen to do is we've erred on the side of completeness. We experimented with different expiry states from 4 hours to 5 hours, and landed at a time that gets us into the 99th percentile of what the data would be if we were doing it in batch.

The other really hard problem is duplicates. As amazing as checkpoints are, because they make the system fault tolerant, when the app restarts from a checkpoint, it will reprocess the data it was processing before it could take the next checkpoint. Every time your app restarts, you're going to produce duplicates. That's just the reality. Handling duplicates is on a spectrum. There are various ways of dealing with it. You can dedupe them in your tables as a post-processing step. You can dedupe them in the Flink application by maintaining a state of all the keys you've seen so far and throwing away what's a duplicate. For now we are choosing to dedupe them after we audit them and see if the number of duplicates is higher than the threshold that's acceptable to us. We are thinking about how to do it more intelligently.

The third challenge we had to face was data validation. Most of our batch data pipelines have audits at the end of the batch processing. These audits can help protect you and not write batch data into your downstream tables. That's because a batch process has a start and an end. A Flink application, it's running all day long, you could get a small amount of batch data and things could resolve themselves, or something could have gone wrong upstream and you now have this floodgate of batch data coming in. What do you do? How do you react? Do you bring your entire app down? Do you protect your downstream consumers? We have a mixed bag of solutions depending on what data error or what auditing we get. It's something to think about. It's something to solve for.

Going from a batch system or a suite of batch pipelines to maintaining a streaming pipeline, the biggest change to your life is the amount of operational overhead you have. Some of the things that add complexity into these operations is the ability to not have visibility into event time progression. Because this app, so much of what it does, and the correctness of data that it produces, is dependent on how its internal event clock is moving forward, which you have no visibility into. The most important thing for you to do is log metrics that will give you that insight into what's happening inside your app.

These are two of the dozens of graphs that we have that track the event time progression of our app. The first graph shows the watermark progression. It's very important that your watermark is always moving forward, because if your watermark is either stagnant or dropping, then there's something definitely wrong in your app. Also, to know, we talked about how we do upserts and append events to the same key. We are seeing here how impressions are being inserted and evicted. They're completely offset by a fixed time window. This graph can show me that all the events are being evicted at the exact same offset of each other. We don't have wavering expiry for different events.

In the same operations, just like it's very important for you to have visibility into the event time progression of your app, it's also really important to have visibility into the state of your app. Having a state grow unboundedly can crash your app. Having checkpoints that are too large and will take too long to complete can cause instability in your app. Having checkpoints that are taking up all the disk space in your app is going to cause instability. It's really important for you to be mindful of the state that your app is maintaining for it to be a healthy app. Logging metrics, we log a lot of metrics around the checkpoint size, the duration, the number of failed checkpoints, the amount of disk they're using. That's really important for you because remediation is so much easier than recovery in streamful processing that you always want to do what you can to try and detect your errors early.

Which gets me to the hard part, which is data recovery. Data recovery is really the biggest challenge when going from a batch system to a streaming solution. In a batch system, because your data is at rest and your batch process can be rerun and that data isn't going anywhere, you can always fix the code in your batch app and reprocess the data that your app processed. That's not true in the data in motion. The data that was there 5 hours ago or 2 days ago, simply isn't there in Kafka when you want to reprocess it. Unless, of course, there, you have a really long retention in Kafka, you could reprocess that. That's not true for most of us, especially not at Netflix volume. We have Kafka topics that have a million events per second. We cannot hold them for an infinite, long time.

Replaying from Kafka can be done trivially if you do have your checkpoints. This is assuming this is that one day of the month, where something has gone wrong in the app. There's something severely wrong, either you got your data corruption, or the app simply cannot recover very gracefully from a checkpoint, so you've lost your checkpoint. With checkpoint being gone, you've also lost the offset information in the checkpoint that will allow it from knowing what its place was when it was reading the data from Kafka. Not only have you lost your data that you were maintaining in your state, you've also lost your position in the queue. However, there is a way to replay from Kafka if the data is still there in Kafka. Replaying from Hive is even harder, but would be really nice if we had it. Replaying from Hive makes it easier for us to reprocess much older data, and also to simply restate our data if we want to add new features to it. We do, at Netflix, have a way of rereading from Hive for stateless applications. We don't yet have it for stateful applications.

What's our hack/workaround from replaying from Kafka? The Kafka producer we have at Netflix stamps all the events with the wall clock time at which it wrote them into Kafka. Using this time, we can ballpark on what are the events we want to reread, depending on our application outage. If my app went down at T2 and came back up at T7, and there are millions of events in the Kafka topic from its earliest checkpoint, I don't want to be reading all of them and reprocessing all of them. Based on this timestamp that was stamped on the event by the producer, I can read the entire queue and throw away the events that I don't need. This is what we've implemented. We've lost our checkpoint. We have no offset information. We tell Kafka, "We'll read everything from you from earliest and throw away what we don't need".

The last challenge I want to talk about is for those of us who are in the cloud, and have to face region failovers. The reason region failovers are pesky for event processing apps is that if the app is moving forward based on event time, it's going to stop moving forward when it gets events because your event time is only moving forward as it gets the next timestamp from the events you process. If you don't see events, the event clock stops. What do you do? You're going to hold the state forever because the state won't expire if the watermarks don't move forward. We've put our own custom watermark operator, which will detect if there has been an inactivity and we haven't seen any events for a few hours, and then force advance the watermark to the system time.

Challenges We Are Working On

There are still a bunch of stuff we are solving. Replaying and restating data from persistent storage is a challenge that we are working on. We are trying to enable ourselves to recover from checkpoint, even when we have changed the schema of the state. We're working on schema evolution. Not relying on post-processing steps to do deduplication is a challenge we're working on. Auto-scaling is also something we're working on.

What Sparked Our Joy

Finally, for all its trouble and for all the operational overhead, we have seen huge gains on taking the system and migrating into a streaming pipeline. We are getting data much faster. Data scientists are able to innovate more, which is leading to improved customer experience for Netflix. We are enabling some of our stakeholders to make early decisions. If we are sending them early signals about how a new launch, or a new movie, and a TV show is doing on the service, they can make earlier decisions. We are definitely storing on computational costs because we are not holding all these raw events in memory anymore. We are also actually helping our upstream engineering teams detect their logging errors much faster.

See more presentations with transcripts

Recorded at:

Jun 04, 2020

InfoQ Software Architects' Newsletter