BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Meta Announces Next Generation AI Hardware Platform Grand Teton

Meta Announces Next Generation AI Hardware Platform Grand Teton

Meta recently announced Grand Teton, their next-generation hardware platform for AI training. Grand Teton features several improvements over the previous generation, including 2x the network bandwidth and 4x the host-to-GPU bandwidth.

Meta's VP of engineering Alex Bjorlin made the announcement in a keynote presentation at the recent Open Compute Project (OCP) Global Summit. Grand Teton is the latest iteration of Meta's contribution to open-hardware designs for AI workloads in the data center. Unlike the previous generation, Zion-EX, which consists of three "boxes," Grand Teton has an integrated chassis, making it easier and faster to deploy. Meta also designed a new data center rack and cooling system to support the power needs of a fleet of the servers training large AI models. According to Bjorlin,

At Meta, we’re all-in on AI. But the future of AI won’t come from us alone. It’ll come from collaboration – the sharing of ideas and technologies through organizations like OCP. We’re eager to continue working together to build new tools and technologies to drive the future of AI. And we hope that you’ll all join us in our various efforts. Whether it’s developing new approaches to AI today or radically rethinking hardware design and software for the future, we’re excited to see what the industry has in store next.

Meta trains and deploys many large-scale AI models. These often contain trillions of parameters and require datasets of a similar magnitude. Training these models requires both data and model parallelism, which in turn means a fleet of interconnected servers with many GPUs. Meta began open-sourcing their AI hardware designs in 2016, with the Big Sur platform. Last year, InfoQ covered Meta's announcement of their latest iteration, Zion-EX, which consisted of a cluster of thousands of compute nodes, each with 4-socket CPUs and 8 GPUs.

Meta's contributions to open AI hardware

Image source: https://engineering.fb.com/2022/10/18/open-source/ocp-summit-2022-grand-teton/

However, each node of the Zion platform required external cabling to integrate three different components: a CPU "head", a GPU system, and a switching system. The new Grand Teton server integrates all these components into a single chassis which also includes power, compute, and network interfaces "for better overall performance, signal integrity, and thermal performance." According to NVIDIA, Grand Teton includes NVIDIA H100 Tensor Core GPUs based on the Hopper architecture. Meta also updated their underlying storage platform: Grand Canyon, the new version, improves on the efficiency of the previous Bryce Canyon architecture by allowing Meta to "push drives to their limits."

In addition to the Grand Teton design, Meta has released a data center rack design, Open Rack v3 (ORV3). Unlike racks where the power shelf is attached to a busbar, ORV3's power shelf can be installed in any location, allowing for more flexible designs. The improved battery backup can supply power for up to four minutes, compared to the previous 90 seconds. ORV3 also supports multiple power shelves and 48VDC output for deploying racks handling up to 30kW. The higher power capacities led Meta to also design new cooling strategies; ORV3 supports air-assisted liquid cooling, facility water cooling, and "an optional blind mate liquid cooling interface design."

A video of Bjorlin's keynote presentation is available on YouTube. Interactive 3D models of Meta's hardware designs are available at https://metainfrahardware.com.

About the Author

Rate this Article

Adoption
Style

BT