Navigating LLM Deployment: Tips, Tricks and Techniques by Meryem Arik at QCon London

At QCon London, Meryem Arik discussed deploying Large Language Models (LLMs). She highlighted that while initial proofs of concept benefit from hosted solutions, scaling the solution demands the use of self-hosting to cut costs, enhance performance with tailored models, and meet privacy and security requirements. Arik emphasized understanding deployment limits, using quantization for efficiency, and optimizing inference to fully use GPU resources.

Arik, co-founder and CEO of TitanML, explained that businesses nowadays start by using external APIs for their LLM applications. Although the OpenAI API is a great start for many applications, developers should opt for self-hosting rather than relying on these APIs once they want to scale their model deployment. The decision between hosted APIs and self-hosting boils down to three important arguments: scale of deployment, performance, and the privacy/security aspect.

While leveraging hosted services like OpenAI's API is cost-effective for proof-of-concept stages, scaling operations to many queries will make self-hosting more economical. In terms of performance, Arik explained that task-specific LLMs can offer superior performance, either in terms of the quality of generated domain-specific text or the size of the model needed to generate the task-specific responses. On top of this, it's possible to adhere to compliance requirements depending on the domain you are working in, such as GDPR and HIPAA. She mentioned that for large enterprises, the control over the deployment is critical, which you can't rely on when using an external API.

Unfortunately, self-hosting is a complex task with non-obvious pitfalls. The three challenges are the size of LLMs (the L stands for Large...), the need for robust GPU infrastructure, and the fast pace of technological advancements. Arik notes, "Half of the techniques used today didn't exist a year ago".

Arik provided seven tips for deploying LLMs during her QCon talk. The first tip is to understand the deployment boundaries and work backwards from that. Boundaries can be latency requirements, the expected load for your API, and available hardware resources you can use for this. With this foresight, you can select the most suitable models and infrastructure.

The second tip is to quantize your models. She showed a graph from the paper "The case for 4-bit precision: k-bit Inference Scaling Laws", authored by Tim Dettmers. This graph plots the performance of the model on the Y-axis and the size of the model in bits on the X-axis. She showed that under fixed resources for your number of bits (as one has on a GPU), it's always better to choose a model quantized to INT4. In this case, you can start with the available infra and work backwards to see which model fits.

The third tip is to optimize the inference of your model. Even obvious tips like dynamic batching might not work well to fully utilize all GPU resources. She advised using a Tensor Parallel strategy to divide your model over three GPUs rather than dividing the whole layers over multiple GPUs.

The fourth tip is consolidating computational resources into a central infrastructure. This enables more efficient resource management and provides a uniform platform for multiple development teams. This approach not only simplifies operations but also mirrors the seamless experience offered by services like OpenAI, with the added benefit of privacy and control.

The fifth tip is to prepare for changing your model. Newer models are frequently released, often outperforming their predecessors. By designing systems with flexibility in mind, you can allow for easy model updates or replacements. This adaptive approach ensures that businesses can always leverage the best available technology.

The last two tips Arik gave included that GPUs might look expensive, but they are great value for money compared to CPUs. You can keep costs down by using small models. In her words: "GPT-4 is king, but don't get the king to do the dishes". The state-of-the-art large language models are trained on a large range of tasks and can perform a wide variety of them. However, that also makes it expensive to run inference with them. By choosing a smaller domain-specific model, you can both improve performance and save costs.

Access recorded QCon London 2024 talks with a Video-Only Pass.

About the Author

Roland Meertens

Show moreShow less

InfoQ Software Architects' Newsletter

Write for InfoQ

About the Author

Roland Meertens

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter