BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Hugging Face's Guide to Optimizing LLMs in Production

Hugging Face's Guide to Optimizing LLMs in Production

This item in japanese

When it comes to deploying Large Language Models (LLMs) in production, the two major challenges originate from the huge amount of parameters they require and the necessity of handling very long input sequences to represent contextual information. Hugging Face has documented a list of techniques to tackle those hurdles based on their experience serving such models.

The three techniques Hugging Face researches Patrick von Platen describes in his article are operating at a reduced numerical precision, using a variation of the attention algorithm known as Flash Attention, and using specialized architectures for inference.

LLMs require huge quantity of VRAM to be loaded, ranging from dozens (bigcode/starcoder) to hundreds of gigabytes (Llama, Bloom, GPT3). A first optimization is possible by switching from float32`` to bfloat16`` precision:

Almost all models are trained in bfloat16 nowadays, there is no reason to run the model in full float32 precision if your GPU supports bfloat16. Float32 won't give better inference results than the precision that was used to train the model.

This enables halving the overall memory consumption but, unfortunately, required memory could still be too much in many cases. A more aggressive approach consists then in quantizing model weights to 8-bit or 4-bits, which has been shown not to cause a significant performance loss.

Quantization works especially well for text generation since all we care about is choosing the set of most likely next tokens and don't really care about the exact values of the next token logit distribution.

This will enable further reducing required memory, which makes it possible to run the smaller models on off-the-shelf GPUs with just 16GB of VRAM, albeit at the cost of slightly longer inference time.

Using Flash Attention, a novel algorithm for the self-attention layers LLMs apply to understand the contextual relationships between input tokens, is another key optimization, says von Platen, since it makes it possible to break the quadratic growth of the layers with the number of input tokens.

The algorithm is too complex to summarize here, but it suffices to say that using softmax normalization statistics and some smart mathematics, it provides identical outputs while only requiring memory that grows linearly with input tokens. Inference performance is also benefited thanks to the algorithm using faster SRAM instead of the slower GPU VRAM.

In practice, there is currently absolutely no reason to not use Flash Attention if available. The algorithm gives mathematically the same outputs, and is both faster and more memory-efficient.

The third area where LLMs can be optimized in production is choosing the right architecture for them to be able to handle long text inputs effectively and efficiently. Here recent research can help to make the right choice with two components that quickly become bottlenecks, says von Platen, positional embeddings and the key-value cache.

Positional embeddings provide a cue for an LLM to understand sequence order by encoding the position of each token into a numerical presentation. For LLMs intended to solve tasks requiring handling large text inputs, relative positional embeddings such as RoPE and ALiBi should be used for training.

Both RoPE and ALiBi position encodings can extrapolate to input lengths not seen during training whereas it has been shown that extrapolation works much better out-of-the-box for ALiBi as compared to RoPE.

Both algorithms are already available in a number of current LLMs.

The key-value chat can be used as a means to encode the context of a conversation. The key-value cache grows by one element at each new interaction, which is much more effective than the alternative approach of encoding/decoding the context with each request. von Platen goes into the details of two classes of key-value caches, namely Multi-Query-Attention (MQA) and Grouped-Query-Attention (GQA) to show the advantages they bring.

von Platen's article cover much more ground than can be summarized here and provides hands-on examples to demonstrate his points, so do not miss his article for the full picture.

About the Author

Rate this Article

Adoption
Style

BT