Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Chip Huyen on Streaming-First Infrastructure for Real-Time ML

Chip Huyen on Streaming-First Infrastructure for Real-Time ML

This item in japanese

At the recent QCon Plus online conference, Chip Huyen gave a talk on continual machine learning titled "Streaming-First Infrastructure for Real-Time ML." Some key takeaways included the advantages of a streaming-first infrastructure for real-time and continual machine learning, the benefits of real-time ML, and the challenges of implementing real-time ML.

Huyen, a machine-learning startup founder and Lecturer at Stanford, began by defining two aspects of real-time ML: online prediction, using a trained model to quickly produce an output given some input data; and continual learning, updating a trained model with recently observed data. She then described an event-driven microservice architecture and showed how the centralized streaming component can be used as the foundation of a "streaming-first" real-time ML system; in particular, she noted that the traditional batch processing in ML is a special case of streaming, and that the length of the iteration cycle of model updates is simply a "a knob to turn" in the system. Finally, she covered several advantages of real-time ML, and challenges that organizations may face when implementing it. According to Huyen,

It's important to make a bet on the future. You can make a lot of incremental updates, but maybe it's cheaper to make a big jump to streaming. Many companies are moving to streaming-first because their metric increase has plateaued and they know that for a big metric win they need to try new technology.

Huyen began her discussion of online prediction by noting that it is usually "straightforward" to get an initial deployment, since many platforms, for example AWS, have standard services to deploy a trained model and create a web service endpoint for invoking prediction or inference. The problem then is managing latency: website visitors can notice even a few milliseconds of extra latency. While there are methods for improving the inference latency of large models, such as quantization or distillation, Huyen focused her talk on improving the pipeline for feeding data into the model and returning results.

Huyen noted that input features for prediction often include data about recent events. These events, she suggested, were best kept in a fast in-memory storage. She made the distinction between these dynamic streaming events and static data. Static data is data that has already been generated and doesn't change, or changes very slowly, such as a user's email address. Static data is bounded, is often stored in a file format such as CSV or Parquet, and can be processed by batch jobs which will eventually complete. Streaming data, by contrast, represents values that change frequently, such as a mobile device's physical location. In particular, streaming data is unbounded, and requires specialized streaming processing instead of batch processing.

One Model, Two Pipelines

Moving prediction to streaming processing while leaving model training as a batch process, however, means that there are now two data pipelines which must be maintained, and this is a common source of errors. Huyen pointed out that because static data is bounded, it can be considered a subset of streaming data, and so can also be handled by streaming processing; thus the two pipelines can be unified. To support this, she recommended an event-driven microservice architecture, where instead of using REST API calls to communicate, microservices use a centralized event bus or stream to send and receive messages.

Once model training is also converted to a streaming process, the stage is set for continual learning. Huyen pointed out several advantages of frequent model updates. First, it is well known that model accuracy in production tends to degrade over time: as real-world conditions change, data distributions drift. This can be due to seasonal factors, such as holidays, or sudden world-wide events such as the COVID-19 pandemic. There are many available solutions for monitoring accuracy in production, but Huyen claimed these are often "shallow"; they may point out a drop in accuracy but they do not provide a remedy. The solution is to continually update and deploy new models.

Another use case for continual learning is for the "cold start" problem, for example, with new or infrequent users of a website, the model has very little information about the user. The goal, then, is to update a model during that user's session. Continual learning is also beneficial for use cases with "natural" labels, such as a click-through prediction model, where the label is simply whether a user clicked on a link or not. Other use cases typically involve situations with short feedback loops, such as reading an article or watching a short video.

After concluding her talk, Huyen answered several questions from the audience. One wondered, since models are constantly being updated, if reproducibility would be a problem. Huyen replied that many reproducibility problems are caused by the separate pipelines for training and inference; however, she recommended tracking the lineage of models using a model store system. Another asked about the risk of models performing worse after an update. Huyen agreed that could happen, and recommended several techniques such as A/B testing and canary analysis to ensure that model accuracy does not degrade. Finally, a user asked about the cost of continual learning in the cloud. Huyen noted that anecdotal evidence suggests that cloud costs may actually decrease, since training on smaller amounts of data requires less compute power. However, she said it was not true in every case and "would love to see more research on this."

Rate this Article