Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Microsoft Releases DeepSpeed-FastGen for High-Throughput Text Generation

Microsoft Releases DeepSpeed-FastGen for High-Throughput Text Generation

Microsoft has announced the alpha release of DeepSpeed-FastGen, a system designed to improve the deployment and serving of large language models (LLMs). DeepSpeed-FastGen is the synergistic composition of DeepSpeed-MII and DeepSpeed-Inference. DeepSpeed-FastGen is based on the Dynamic SplitFuse technique. The system currently supports several model architectures.



The Dynamic SplitFuse technique is a new token composition strategy for prompt processing and token generation. SplitFuse enables it to offer up to 2.3 times higher effective throughput compared to systems like vLLM. It allows DeepSpeed-FastGen to run at a consistent forward size by taking partial tokens from prompts and composing this with generation. This results in improved responsiveness, efficiency, and lower variance, providing lower latency and higher throughput streaming generation to all clients compared to other serving systems.

According to  on LinkedIn:

FastGen effectively synthesizes novel batch scheduling techniques with efficient KV cahce management, communication optimized tensor-parallelsim and ultra fast CUDA kernels. Furthermore, it integrates low-overhead load-balancer that offers perfect linear scaling on dozens of replicas.

In terms of performance, DeepSpeed-FastGen outperforms vLLM in both throughput and latency. It provides equivalent latency with greater throughput or more responsive latency and the same throughput. For example, on Llama-2 70B with 4 A100x80GB, DeepSpeed-FastGen demonstrates up to 2x higher throughput (1.36 rps vs. 0.67 rps) at identical latency (9 seconds) or up to 50% latency reduction (7 seconds vs. 14 seconds) while achieving the same throughput (1.2 rps). DeepSpeed-FastGen also offers replica-level load balancing that evenly distributes requests across multiple servers, allowing for easy scalability. The throughput with 16 replicas reaches 23.7 queries/sec, marking a linear 16x increase compared to a single replica.



DeepSpeed-FastGen currently supports model architectures, including LLaMA, LLaMA2, Mistral, and OPT. All current models leverage HuggingFace APIs in the backend to provide both the model weights and the model's corresponding tokenizer. Microsoft says it plans to add additional models in the future.

DeepSpeed-FastGen offers two deployment options: an interactive non-persistent pipeline or a persistent serving deployment. The non-persistent pipeline deployment is suitable for temporary interactive sessions, while the persistent deployment is designed for use with long-running and production applications.

DeepSpeed is an open-source deep learning optimization library for PyTorch from Microsoft that makes distributed training and inference easy, efficient, and effective. It is designed to reduce computing power and memory use and to train large distributed models with better parallelism on existing computer hardware. Some of the key features of DeepSpeed include the Zero Redundancy Optimizer (ZeRO), mixed precision training, and single-GPU, multi-GPU, and multi-node training support.

Developers interested in learning more about DeepSpeed-FastGen can follow the Getting Started guide for more details and also visit the DeepSpeed-MII Github repository. Those wishing to contribute may also read the contributing guide for more details.

About the Author

Rate this Article