Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Running Large Language Models Natively on Mobile and Laptops

Running Large Language Models Natively on Mobile and Laptops

This item in japanese

MLC LLM is a new open source project aimed to enable deploying large language models on a variety of hardware platforms and applications. It additionally includes a framework to optimize model performance for each specific use case.

Our mission is to enable everyone to develop, optimize and deploy AI models natively on everyone's devices. Everything runs locally with no server support and accelerated with local GPUs on your phone and laptops.

At the foundations of MLC LLM lies an approach called machine learning compilation (MLC) which combines ML programming abstractions, learning-driven search, compilation, and an optimized library runtime for easy of deployment.

In comparison with the more controlled case of deployment on server-class systems, the main complexity that the project faces is heterogeneity of supported hardware specs. This includes supporting different models of CPUs, GPUs, and potentially other co-processors and accelerators; addressing memory constraints; and dealing with OS environment variation, where some dependencies, for example Python or specific packages, could not always be granted.

To achieve these goals, MLC LLM is based on Apache TVM Unity, a compiler stack for deep learning systems, and leverages tokenizers from Hugging Face and Google, open-source LLMs such as Llama, Vicuna, Dolly, and others.

The project includes both a C++ CLI tool and an iOS chat app showcasing how to integrate the compiled artifacts and the required pre/post-processing.

MLC LLM can be deployed on recent Apple Silicon, including iPhone 14 Pro, iPad Pro with M1 or the A12Z chip, and M1-based MacBook Pro and later models; AMD GPUs including Raden Pro 5300M, AMD GPU on Steam Deck, RX6800 16G VRAM, and others; NVIDIA GPUs including GTX 1060 (6GB), RTX 3080, RTX 2080Ti, and others; and the Intel UHD Graphics 630 GPU. Support for Android devices is in the works.

Performance varies significantly across supported hardware, with several NVIDIA GPUs, the AMD RX6800 16G VRAM, and the 2021 MacBook Pro M1 Max scoring above 20 tokens/second. For comparison, the M1 iPad Pro reaches 10.6 tokens/second and the iPhone 14 Pro 7.2 tokens/second.

According to the project maintainers, the MLC LLM makes it possible to run quick experiments and try out compiler optimizations and eventually deploy to the desired targets easily.

If you want to find out more about MLC, you can check out the official documentation, which will guide you through the key abstractions used to represent machine learning programs, automatic optimization techniques, and how to optimize for dependencies, memory, and performance.

As a related note, MLC LLM has a companion project focused on Web browsers, WebLLM.

About the Author

Rate this Article