Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Real-Time Machine Learning: Architecture and Challenges

Real-Time Machine Learning: Architecture and Challenges



Chip Huyen discusses the value of fresh data as well as different types of architecture and challenges of online prediction.


Chip Huyen is a co-founder of Claypot AI, a platform for real-time machine learning. Previously, she was with Snorkel AI and NVIDIA. She teaches CS 329S: Machine Learning Systems Design at Stanford. She’s the author of the book Designing Machine Learning Systems (O’Reilly, 2022).

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.


Chip Huyen: Recently, my credit card was stolen. I only found out when the bank texted me, it looks like you're trying to spend $300 at a Safeway in San Jose, is that you? It wasn't me. I responded no, and the bank locked my credit card, preventing further transaction with this credit card. If the transaction had gone through, I would have disputed the transaction, and the bank would have refunded me $300 causing the bank a net loss.

However, the bank was able to prevent this loss because the bank was doing real-time machine learning. They predict whether a transaction is fraudulent or not before the transaction goes through. We say that the bank is doing real-time machine learning. Credit card fraudulent transactions prevention is a classic use case of real-time machine learning. Previously, we wouldn't think that there are many use cases for real-time machine learning. However, earlier this year, Databricks CEO did a very interesting interview when he said that there has been explosions of machine learning use cases that don't make sense if they aren't in real-time. More people are doing machine learning in production and most cases have to be streamed. This is a pretty incredible statement from the company that created Spark.

Here are just some of the use cases that I think could benefit a lot from real-time machine learning. Of course, there is fraudulent detection and prevention. There are also account takeover prevention. Say, when a hacker takes over an account, whether it's a bank account or a social media account, they usually follow some predetermined steps. They might change locations. They might change the email, change the password. If it's a bank account, they might start draining the money, sending the money to different places. Or if it's a social media account they might use that account to send a lot of messages to scam other people. If we are able to detect those account takeover early, then we can prevent a lot of damages down the line.

Also, there's a use case of personalizations. Say a website may have a lot of new users or users who don't log in enough for the site to have enough historical data to make recommendation for them. It's going to be a lot more powerful to be able to recommend those users items while they're still on the site, rather than recommend them items like next day after they already left the site, because they might never come back to see the recommendations. There are a lot of use cases like that for real-time machine learning, and we're going to go over one of them.

Background & Outline

My name is Chip. I'm a co-founder of Claypot AI, a startup that helps companies do real-time machine learning. I co-founded this company with Xu, who used to be on the Real-Time Data Platform team at Netflix. We have written a lot about the topic. We're going to go over four main topics. The first is latency and staleness. The second, we'll talk about real-time machine learning architecture and different types of features used for real-time machine learning. Then, we're going to talk about a huge challenge of real-time machine learning, which is the train-predict inconsistency. Last, we're going to talk about how to truly unify stream and batch computation.

Latency vs. Staleness

The first topic is latency and staleness. We have talked a lot about how latency matters. Users just don't want to wait, no matter how good the machine learning models are. If they take just like milliseconds to launch to make a prediction, the users might not wait and might go do something else. However, I don't think that we have talked enough about staleness. We're going to go over an example to see what staleness means.

Let's go back to the credit card fraudulent preventions example, and let's say that we are using a very simple feature to detect whether a transaction is fraudulent or not, which is like, given a transaction with credit card x, we want you to count the number of transactions this credit card x has made in the last hour. These are pretty realistic features, because when a thief takes charge of a credit card, they might want to use that credit card for a lot of transactions to maximize profit. If a credit card has not been used a lot over the last two years, and suddenly over the last hour it's been used for a lot of transactions, then that might be a sign of fraud. Latency is usually defined as the time from the prediction to request is initiated, until the operation is received.

Say, we have a fraudulent transaction like that at different timestamps like that. For transaction 4, when the transaction 4 is initiated, we want to start to predict whether this transaction is fraud or not. The time from that request to the predictions, may be 100 milliseconds. Let's say for this prediction, we want to use a feature that was computed at 10:45. At 10:45, the number of transactions this credit card has seen is only two. By the time it's used, then we can see that this value is stale, and the stale is caused by the delay from the time this feature was computed until the time it was used. Staleness is caused by delay, and the higher the delay, the more likely the features will be stale.

Staleness has quite an effect on the model performance. This is a very interesting post from LinkedIn, when they showed that for the job recommender model, if the feature is stale by an hour instead of by a minute, that the model performance is going to degrade by 3.5%, which is a non-trivial amount. Of course, like the other use cases, we can see very clearly that feature staleness can cause a huge drop in performance.

However, staleness tolerance is very much feature dependent and also model dependent. Say, if you have features like the number of transactions over the last two years, then the feature staleness by a day is probably not a big deal. However, if you want to compute the number of transactions over the last hour, then staleness by 10 minutes is quite significant.

Real-Time ML Architecture

We have talked about latency and staleness. We've gone over how latency and staleness might look like for different types of features used in real-time machine learning. First, the classic type of features that you might want to use for the online prediction model is batch features. These features are generated in batch processes. You might have something like Airflow scheduled for once a day, or once an hour to compute this feature from a data warehouse.

Then these features are made available in a key-value store like Redis or DynamoDB, for low latency access by the prediction service. These kinds of computations can help with latency, because at prediction time, we don't have to wait for this feature to be computed anymore, we just need to fetch the previously pre-computed features. However, these types of features may have very high levels of staleness, because we have to first wait for the data to arrive at the data warehouse and then to wait for the job to be kicked off. The staleness can be hours, or even days. To overcome staleness, a lot of companies are moving towards real-time features. What real-time features means is that you might have a database like Postgres, or Bigtable, at prediction time, you might write a SQL query to extract features from this database.

This type of feature has very low staleness because it just computes this value, as soon as the prediction request arrives, so very low staleness. However, it has high latency, because the feature computation time now contributes directly to the prediction latency.

There are also other challenges with real-time features. First, because it contributes directly to the prediction latency, it's hard for companies who want to use complex SQL queries because then the computation time will just be too long and users just don't want to wait. Second, because this computation is done on demand, it might be very hard to scale the service when you might have an unpredictable amount of traffic. Also, it's very much dependent on the service availability.

Another point that is a little bit more subtle is that it's very hard to architect these real-time computation features to be stateful. Globally, what does stateful computation mean? Say we want to compute the number of transactions a credit card has made until a certain point in time, so say if you do compute this feature at 10:30 a.m., then we go for like from beginning of time to 10:30 a.m. Now, if you want to go compute the same value, at 10:33 a.m. Stateless computation means that we again have to go through only transactions from beginning of time up until 10:33, which means that the computations from beginning of time to 10:30 a.m. is repeated and redundant. Stateful computation means that you can just continue counting from the last time the feature was computed, which was 10:30 a.m.

In theory, we can probably add some state checking, state management to that so that we can make the computation stateful. First of all, it might keep a count of the transaction for each credit card. Whenever we see a new transaction by a new credit card, we just increment that count by one. However, this makes the assumption that all the transactions are in order. Every transaction that is seen after always happens after this point in time. I know it's a little bit convoluted.

Say, if you have a transaction that was originally started at 10:30 a.m. Then there's some parameters, and there's some delay in the system and the transaction is only processed at 10:45 a.m., which means the transaction is delayed and arrives late. That means that if when we receive the late arrival features, we have to go back and update on the previous count. It can be quite tricky to have it like that. Streaming computation engines like Flink actually deal with that really well. If you use a stream computation engine, we also call that near real-time computation engines, we can actually avoid this out-of-order challenge.

We talk about how batch features may have low latency, high staleness, and real-time features may have high staleness and low latency. Near real-time features can achieve the pretty sweet spot when it can be both low latency and low staleness. What that means is that like, instead of using a database to record transactions, you might use a real-time transport like Kafka and Kinesis. Instead of computing these features at prediction time, we might pre-compute them.

The difference compared to batch features is that instead of computing these features off a data warehouse, you compute them when they are still in real-time transport. Instead of computing them like once a day or once an hour, you might compute them once every minute, or once every 10 minutes, or you can even trigger the computation based on an event. Say, if you see an event, a new transaction by credit card x, you just increment, you can update the count for credit card x. Similar to batch features, you might want to store the pre-computed near real-time features in a key-value store, so that it can be accessed by a prediction service or prediction time. You just fetch the latest value of the features from the key-value store.

You might say, ok, so real-time features have low latency and low staleness, then what is the catch? The catch I've heard is that people complain that it can be very difficult to set up and maintain near real-time pipeline. However, I do believe that as technology progress, it's going to get easier. I do hope that as we build our modules, building near real-time pipeline will be just as easy as building out a batch pipeline.

Train-Predict Inconsistency

The next topic is a very big challenge that people talk about a lot for real-time machine learning, the train-predict inconsistency. Say, during training, data scientists might create features in Python using batch processing engines like Pandas, Dask, Spark to compute the features of data warehouse for training. However, prediction times of models may want to compute this feature through the stream processing in near real-time transport. That means it now has a discrepancy, so you have one model, but two different pipelines. This discrepancy can cause a lot of headaches.

We've seen a lot of models that do really well during training and evaluations, they just don't do well when it comes to production. Currently, a lot of companies follow what we call the development-first machine learning workflows. In this workflow, data scientists create a model with batch features, and they train and evaluate this model offline. At production, data engineers or machine learning engineers will translate these features to the online pipeline. The bottleneck here is the translation process.

We have talked to quite a few companies when this process can take weeks, even months. One company in particular told me that like it takes them a quarter for data scientists to add a new streaming feature. This process is so painful that data scientists can just resort to using batch features instead. They might have found out that like a streaming feature can be really useful as a model, but they just don't want to wait. Why waste an entire quarter working on a streaming feature when they can work on 10 different batch features, even though these batch features may not be optimal, might be like worst performer, but you get to develop more features. We hope that as the process of adding streaming features becomes easier, then data scientists will be more inclined to use them.

Another approach that companies use for prediction is the log-and-wait. In this approach, data scientists create new online features. Then, data engineers deploy these features to incoming data, and the data logs the computed values. They just wait to have enough computed values to train the models with those. This is a pretty good workaround, because it's pretty easy to implement. Anybody who has to wait for it, you just need to wait without having to do anything. Also, the challenge is that you have to wait for it. If you want three weeks of trained data, you have to wait three weeks. Say, if you collect the data and then you realize it's not exactly the features you want and you want to change it a little bit, they have to repeat the process again, which can be painful.

An approach that we have found to be very promising is backfilling. In this process, data scientists also create new online features. Instead of deploying these features to log-and-wait for the values, you backfill these features on historical data to generate training data. I say historical data here to encompass all the data that has been generated, whether that data is already in data warehouse or it's still in a real-time transport like Kafka or Kinesis. We did backfilling to look like this, so data scientists can write feature definitions in Python or SQL.

I know that Python is an official language for data scientists, but we've seen more people ok with SQL. Then, in production, to make predictions, you might use a stream processing engine like Flink or Spark streaming to compute the features and to generate training data. To train the models with these features, you use a backfilling process. Even though this may be the other way around, you create features to train a model with that feature first before deploying that feature to production.

Backfilling can be quite tricky. Backfilling is not a new process for batch data. Backfilling typically only means that like retrospectively process historical data with new logic. New logic in this case is a new feature. However, backfilling can be quite challenging when you have data in both batch and stream sources. Also, if you care about the time semantics. Time semantics means that you might have a feature compute over a time window, like over the last hour, over the last day.

The time semantics cause the backfilling process to have to care about the point in time correctness. Say, if the transaction happened at 10:30 a.m., yesterday, then you want to count the number of transactions that happened exactly until 10:30 a.m., because you don't want to count that feature past that point, because that could be leakage from the future data. To be able to backfill, then we need to be able to reconstruct the data at different points in the past. Some companies, if you don't have a data platform set up properly, it could be impossible. Say, you might do product recommendations, so one feature you use is the price of this product to make predictions, and the price of the product can change over time.

If the platform overwrites the old price with the new price. In place, so you have a product table that has the column for the price, and then whenever the price for the product changes, if you overwrite that price directly, it might be impossible for you to figure out what was the price of a product in the past. We have talked to several companies when they say they want to do this wonderful new real-time machine learning stuff, but then they realize the first thing they have to do is to update their data platform so that it can handle time semantics.

Another is that if you have features that combine data from multiple sources. Say you want to count the number of transactions over the last three days. You might use Kinesis. Kinesis has a limit on the data retention, so Kinesis only allows you to retain stored data for seven days, and then the same data is made available in the data warehouse after some delay, say after a day.

You might have the data at time T. Right now, if the time is T, then you have data from T minus 7 days to now in the stream, and all the data up to T minus 1 day in the data warehouse. Then, how can you construct these features to combine data from both sources? There's a backfilling challenge, because now we have new features, and you want to compute this new feature over historical data.

Historical data contains the data in both stream and data warehouse. One approach is stateless backfilling. In this case, your batch pipeline and the stream pipeline share no state at all. You might need to construct two features, one to count the number of transactions over the last day, which is like to count the transaction in stream. It's a near real-time feature. Another feature is to count the number of transactions over the last 30 days in the data warehouse, which makes it a batch feature. It can work. However, it's up to users to maintain the consistency between these two features to make sure that the count match.

Another approach which is a lot harder to do is stateful backfilling, when the batch pipeline and the stream pipeline share state. In this approach, users say, give me the count on the transactions over the last 3 days. Then the infrastructure may think, yes, over the last 3 days, then we have to go into the data warehouse. It will just count the transactions in the data warehouse, when you run out of transaction, and then you cut over and continue counting from the stream. It's pretty cool, however, it's pretty challenging to do.

There are certain technologies to help with that, like Apache Iceberg, created by Netflix. Also, Flink is working on something that's going to be a hybrid source to have you connect on data from both stream and batch sources. However, it's still work in progress. I'm very excited to see more progress in this area.

Unify Stream + Batch Computation

The last topic is to unify stream and batch computations. This is a very interesting topic. It's more of a thought experiment than an actual solution that I see right now. A lot of companies have talked about unifying stream and batch. However, what I think they actually mean is to make stream API look like batch API. Say, like Spark and Spark streaming. What they do is that they make the stream API look very similar to the traditional Spark API. As data scientists, you will still have to care about whether a feature or transformation is stream or batch, whether it's applied to data in stream or whether the transformation is applied to data in a data warehouse. I've been thinking about it like as a data scientist, I actually don't care about where data is, so, say, if we want to count the number of transactions. All I care about is the transactions.

I don't care about whether those transactions occur in stream, or are they currently in Snowflake. I just want you to write the transformations, and I want my infrastructure, my tooling to figure out for me, you want to transition over the last six months, so I want to compute off Snowflake for you. If you want to compute your transaction over the last hour, then I can compute it off the stream for you. That is my dream. I wish I could have something like this, what I think of as a more data scientist friendly API.

You can start to define the transformation near the transaction count, and it can just compute that on the transaction data itself. This data transaction can be across multiple sources. It can define arbitrary window size. You don't have to worry about like, whether this window size is going to stretch out across different sources.

Key Takeaways

The first takeaway is that latency matters, so does feature freshness. We talk a lot about latency but not enough on feature freshness or staleness. Also, we talk about different types of features used in real-time machine learning. Not all those features require streaming, however, streaming can help a lot with balancing out the latency and staleness, and can also help a lot at scale. Also, we talk about the train-predict inconsistency and backfilling can be very helpful to ensure the train-predict consistency. The last point is more futuristic, is I really hope that we can have an abstraction so that data scientists can work with data directly without having to worry about whether data is in stream or batch. What I think of as a truly unifying stream and batch API.

Impact on Model Performance, When Offline and Online Data is Combined

You're asking about, if we combine offline and online data, then the time to access offline data might be unpredictable. This is for the backfilling, is when we have a new logic that has to look back in time, and recompute that on data that already exists. In production, we are not going to go back, we just keep on continuing computing in real-time for incoming data.


Hien Luu: With this kind of thing, would this drastically impact the model performance?

Huyen: I don't think it will. We mentioned that crossing data boundaries only happens when you do backfilling. It's for when you want to maybe create data for training, then, yes, this might take a little while. That's true for a lot of data processing. In production, when you serve data, you keep on computing on incoming data only, then it shouldn't affect it.

Emerging and New Use Cases

Luu: Are there any emerging new use cases that you are not aware of since you started on this journey, that is interesting to share?

Huyen: There have been a lot of interesting use cases. There are classical use cases that we talk about, like detection of fraud, or a recommender system search. We see a lot recently is account takeover, and especially over COVID is what we call claim takeover, is when the government started giving out a lot of unemployment benefits. It's possible if somebody takes over the identity or the bank account, then they can use that to claim the benefit and then disappear. We actually see quite a bit of that.

Another use case is we've seen more dynamic pricing. I think it's for a lot of use cases especially where we see pricing change very frequently [inaudible 00:29:01]. Then we saw more, first of all like when people view things selling on Amazon. Like Amazon is selling something for 5 bucks, you might have a lot of merchants selling the same product. Amazon only shows like certain merchant, and as a merchant of that product you might want to be able to show your offering there comparing other merchants. It's very important for you to be able to set up price dynamically so that you can win a slot in the Buy Box [inaudible 00:29:34] to see that product.

Of course, a lot of that also for Uber, Lyft ride sharing. For any product that you want to determine the price at the time of query, at the time of sale instead of like predetermine the price before. When thinking about pricing, it's not just the intrinsic price, but also the discount or promotions, you might want to give each visitor to encourage them to buy.

Because for a lot of sellers, they still prefer to make a sale at a discount than not to make a sale at all, and being able to dynamically determine that price is very important. Also, we see a lot of use cases in crypto as well. If we have prices being dynamically changed, then of course, it can be another use case, when you try to predict when's the best time to buy something. One example is from Ethereum, you have a gas fee, and the gas fee can change quite fast. When the gas fee is high, then transactions cost a lot of money, so you want to predict when is the best optimal time to do so. We've seen a lot more use cases.

Successful Use Cases, with Real-Time ML

Luu: There's a lot of interesting use cases that leverages this real-time information to make predictions and such. It takes a bit of work to set this up successfully and run it end-to-end. Have you seen any companies out there that are successful at doing this?

Huyen: I think DoorDash has been doing that pretty well. I read a post recently by Instacart on their journey from batch machine learning to real-time machine learning. They divide it into two phases. The first phase is when they move their prediction service, like you have where you do batch predictions, and they have to build an online prediction service to be able to scale, or a service to serve requests in real-time. That's what you see a lot of companies do initially.

After that, they want to move from batch feature computations to online feature computations. For online feature computations, you can have both on demand, like real-time feature computation, but also near real-time feature computation. That's the second phase of moving feature computation from batch to real-time to online that we see that companies have a lot of trouble with. The reason is that for a prediction service, we can do that without touching much on the user experience. It can still train a model and it can still deploy, and it's mostly infrastructure scaling. For feature engineering, we would need to be able to create an interface for data scientists to be able to sell the features.

A lot of frameworks to compute things in near real-time in online environments like JVM like Java, Scala. It's pretty hard for a data scientist to be able to use Scala or a Java interface directly. You also touched on data sources as well. If there's no good way for data scientists to be able to explore and discover what data sources are available, what existing features that have been computed, then it's very hard for them to quickly create features. We see that quite challenging. Of course, there's a question of cost. We talk to a lot of companies who do real-time machine learning, and their infra cost is quite high, and a lot of it is in feature computations.

How about you, what have you seen? You have done quite a lot of work with both LinkedIn and DoorDash.

Luu: Certainly, what you have discussed is a reflection of the experience that I've seen at LinkedIn and DoorDash. It does require that massive infrastructure to make it very simple for data scientists to do all this. I think once you have it in place, you can unlock a lot of use cases that they will love to start experimenting with. It's kind of a flywheel. Initially, it takes a lot of friction. Once it's available, it will unlock people's imagination in terms of how they want to use these real-time features for all kinds of use cases that they may have not thought of before. This area is challenging because of that. My sense is, the benefits long-term wise will be tremendous, if and when this infrastructure is available.

Cost and Benefit Tradeoff, from Migrating to Real-Time

Huyen: I think we do talk to some companies, and they see a lot of companies getting a huge return from investing in infrastructure to move to real-time. How do they know that their company will also benefit from it because the upfront investment is not something that everyone can afford?

Luu: I think it's all about taking baby steps. It also depends on where the company is building their product. If they have a lot of low hanging fruits, then they can start with batch, get that done and make sure that it's solid and working. Once you got to a point that you start to think about from the customer experience, if such real-time features are available, what additional capability or experience can the company unlock or provide? Then it would be an easy conversation to invest.

Huyen: Some companies we talk to have successfully made the transition. They actually mentioned that the operational cost is really lower after they made the transitions, because now they only have one unified system to maintain instead of having separate pipelines: batch, streaming.

Luu: Different companies are on a slightly different journey in what phase they are at in terms of incorporating ML into their company, and the expertise that is needed as well. Hopefully, this will become easier as time goes on with technologies from your company, or other similar companies to make it easier.

Huyen: We see a lot of work in the space and not just from our company. A few days ago, we were looking at Facebook, and they have a lot of like, building things built in-house to unify their stream and batch to be able to compute things in real-time with Pandas' interface. Also, a lot of other companies as well. I do expect in the next couple of years, the landscape is going to be very different.


See more presentations with transcripts


Recorded at:

Aug 15, 2023