At QConSF 2024, Cody Yu presented how Anyscale’s Ray can more effectively handle scaling out batch inference. Some of the problems Ray can assist with include scaling large datasets (hundreds of GBs or more), ensuring reliability with spot and on-demand instances, managing multi-stage heterogeneous compute, and managing tradeoffs with cost and latency.
Ray Data offers scalable data processing solutions that maximize GPU utilization and minimize data movement costs through optimized task scheduling and streaming execution. The integration of Ray Data with vLLM, an open-source framework for LLM inference, has enabled scalable batch inference, significantly reducing processing times.
"The demand for batch inference is getting higher, because we now have multi-modality data sources such as cameras, mic sensors, and PDF files. By processing these structued and unstructued data, you will be able to retrieve lots of information and knowledge to improve your products and services." - Cody Yu
Features such as streaming execution in Ray Data were discussed, which enhance system throughput and efficiency. A case study on generating embeddings from PDF files efficiently and cost-effectively using Ray Data was highlighted, where the process costs less than $1 for processing with ~20 GPUs.
In addition, pipeline parallelism is also an important feature for models that need to be served on mulitple GPUs. By optimizing batch sizes and employing chunk prefill, the system has been fine-tuned for maximum efficiency. This approach not only improves throughput, but also strategically manages computational resources across heterogeneous systems. Ray Tune might also potentially be used to optimize batch processing workflows through hyperparameter tuning.
The session also briefly discussed Ray Serve Batch. Dynamic request batching in Ray Serve enhances service throughput by efficiently processing multiple requests simultaneously, leveraging ML models' vectorized computation capabilities. This feature is particularly useful for expensive models, ensuring optimal hardware utilization. Batching is enabled using the ray.serve.batch
decorator, which requires the method to be asynchronous.
Continuing the presentation, the speaker highlighted advancements in large language model (LLM) inference, focusing on the vLLM framework, speculative decoding, and inference engine optimization. vLLM is an open-source LLM inference engine known for its high throughput and memory efficiency. It features efficient key-value cache memory management with PagedAttention, continuous batching of incoming requests, and optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
The discussion also covered a set of important features, such as chunked prefill and prefix caching. The presentation also covered speculative decoding, a technique that accelerates text generation by using a smaller draft model to propose multiple tokens, which a larger target model then verifies in parallel. This method reduces inter-token latency in memory-bound LLM inference, enhancing efficiency without compromising accuracy.
Readers interested in learning more about batch inference with Ray can visit InfoQ.com in the coming months for access to the full presentation.