Meta MultiRay Allows Efficiency on Large-Scale AI Models

Meta developed MultiRay, a platform that allows the cost-effective running state of the art machine learning models. MultiRay allows models to run on the same input in order to share the majority of the running cost with a little addictive cost per model.

Current AI models are very large because they can handle text, images, and other modalities. The best possible results are obtained to train a huge model with an immense amount of data and then specialize the model on a specific task (i.e armful speech). This process is very expensive. The best solution is to compute the expensive part of the training once (embedding) and then reuse this training, with another dedicated training step, for the specific task.

This is the goal of MutliRay, the platform developed by Meta. MutliRay’s universal models are trained to perform well across a wide set of tasks and domains. In this way, the ML teams in Meta can iterate fast on a ton of models across different applications, increasing the efficiency instead each team developing all the models on its own.

MultiRay uses large foundation models that return a point, called embeddings, in a high-dimensional vector space that represents an ML-friendly version of the original input. In this way, the model can consume this input instead of the real one which is much simpler to handle. The foundation models deployed in MultiRay are optimized to work on different tasks.

Another goal of MultiRay is to centralize the execution of GPUs and cache in order to save time and money, democratizing access to large fundamental models. Centralization brings some benefits like amortization across many teams; the specialized end costly hardware like GPUs are used by all the teams splitting the cost among them. Simple development and operations: instead of each team managing the models, infrastructure, and model upkeep, centralizing allows to applicate of optimization techniques across different groups. Faster research to production, MultiRay is the sandbox of the ML models and serve 125 clients among Meta, supports up to 20 million queries per second (QPS) while serving 800 billion queries per day, this allows systems specialist to contribute to key optimizations.

Another two great advantages of MultiRay are efficiency on accelerators and cache management. Accelerator hardware works well when it processes an aggregated group of requests in parallel (batch). The batch can change if the hardware change so MultiRay can manage the requests to create the optimized batches based on the hardware available. Cache management is a key part of MultiRay because it uses a cache to optimize the costs of recomputation. MultiRay models are large and the cache is finite so is not possible to cache the results for a long time. MultiRay has an optimized algorithm that analyzes the request patterns across clients and tunes the cache size, time to live, and update policy to have the best cache management strategy and save more cost of recomputation.

Centralized service has challenges too, like quotas, client management, and cost attribution. These problems are considered solved for large-scale systems but had to be adapted for the AI domain. MultiRay models work well; if they are widely used this requires a state-of-the-art model across multiple use cases, which brings a lot of work in model versioning and refresh and investments in new training architecture and training flows to reduce the research to production time.

About the Author

Claudio Masolo

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Write for InfoQ

About the Author

Claudio Masolo

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter