Meta (formerly Facebook) has reported substantial improvements in the efficiency and reliability of its machine-learning model serving infrastructure by focusing on optimising tail utilisation.
According to an article on their Engineering blog, the company's efforts have resulted in a 35% increase in work output without adding resources, a two-thirds reduction in timeout error rates, and a 50% decrease in tail latency at the 99th percentile. Tail utilisation refers to the utilisation level of the top 5% of servers when ranked by usage. It is a critical factor in system performance, particularly for large-scale operations like Meta's advertising platform, which relies on sophisticated machine-learning models to deliver ads in real time.
The article explains how the challenges of tail utilisation stem from the non-linear relationship between traffic increases and server utilisation. As traffic grows, servers contributing to high tail utilisation can become overloaded and fail, impacting service level agreements (SLAs). This often leads to the overallocation of capacity across the entire system to maintain headroom on constrained servers.
Optimising tail utilisation is an emergent trend, with Meta being an early exponent in publishing its efforts in this area. In an article for Middleware summarising areas where utilisation can be improved, Sam Suthar adds context to work occurring in this area:
"Effective server utilisation is more about maintaining the health and capability of the installed hardware and systems and drawing out better performance without consuming more resources than necessary." - Sam Suthar
Suthar suggests that effective monitoring and alerting for resources, performance and capacity are all critical to understanding this area.
Meta's approach to addressing these issues involved two main strategies: tuning load-balancing mechanisms and implementing system-level changes in model deployment.
For load balancing, Meta leveraged the "power of two choices" algorithm, a randomised load balancing technique that selects the least loaded of two randomly chosen servers for each request. This approach, implemented through Meta's ServiceRouter infrastructure, helped to avoid heavily loaded hosts and improve tail utilisation.
The blog post further explains how Meta also focused on placement load balancing, which involves moving model replicas across hosts to balance the load. They did this by fine-tuning configurations in Meta's Shard Manager, a system that facilitates the development and operation of sharded applications.
Several system-level optimisations were also used. One critical insight was considering memory bandwidth as a resource during replica placement. The team discovered that CPU spikes observed when new replicas began serving traffic were due to increased memory latency rather than pure CPU utilisation. Another significant change was the implementation of per-model load counters. This approach helped align the expectations of different system components, including the ServiceRouter, Shard Manager, and ReplicaEstimator, leading to more accurate load balancing and resource allocation.
Meta also addressed challenges related to snapshot transitions, the process of updating models with new versions. By introducing a snapshot transition budget capability, the team minimised disruptions during peak traffic periods.
Cross-service load balancing was another area of focus. Meta implemented a feedback controller to adjust traffic routing percentages across different hardware types and capacity pools, achieving a better balance between service tiers.
Lastly, the team developed a predictive replica estimation system that forecasts resource usage up to two hours in advance. This proactive approach helped reduce failure rates during peak periods by ensuring adequate resources were available before they were needed.
These optimisations have significantly improved Meta's ads model inference service, a critical component of the company's advertising system. The service handles client requests for ad placement, typically resulting in multiple model inferences per request depending on factors such as experiment setup, page type, and ad attributes.
The improvements in tail utilisation have allowed Meta to support a 35% increase in load without adding capacity while also significantly enhancing system reliability and reducing latency. Given the growing complexity and computational intensity of machine learning models used in advertising and other applications, these gains are significant.
The article concludes by discussing Meta's plans to apply these learnings to new system architectures and platforms. These include IPnext, its next-generation unified platform for managing the entire lifecycle of machine learning model deployments. As machine learning continues to play an increasingly important role in various applications, the ability to efficiently serve models at scale will remain a critical area of focus for technology companies and researchers alike.