BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Meta Open-Sources 175 Billion Parameter AI Language Model OPT

Meta Open-Sources 175 Billion Parameter AI Language Model OPT

This item in japanese

Bookmarks

Meta AI Research released Open Pre-trained Transformer (OPT-175B), a 175B parameter AI language model. The model was trained on a dataset containing 180B tokens and exhibits performance comparable with GPT-3, while only requiring 1/7th GPT-3's training carbon footprint.

The release was announced in a blog post written by Meta researchers Susan Zhang, Mona Diab, and Luke Zettlemoyer. To help promote open and reproducible research in AI, Meta has released not only the code and trained model weights, but also a full operational logbook documenting challenges encountered during the training process. The model is released under a non-commercial license and is intended for use by researchers "affiliated with organizations in government, civil society, and academia" as well as industry researchers. Although access to the full 175B model must be granted via an application process, smaller versions ranging from 125M to 30B parameters can be downloaded as part of the HuggingFace Transformers library. According to Zhang, et. al.:

A much broader segment of the AI community needs access to these models in order to conduct reproducible research and collectively drive the field forward. With the release of OPT-175B and smaller-scale baselines, we hope to increase the diversity of voices defining the ethical considerations of such technologies.

The Transformer deep-learning architecture has become the de-facto standard for language models, and researchers have achieved impressive results by increasing the size of both the models and the training datasets. Much of the research has focused on auto-regressive decoder-only models, such as GPT-3 and PaLM, which can perform as well as the average human on many natural language processing (NLP) benchmarks. Although some research organizations, such as EleutherAI, have made their trained model weights available, most commercial models are either completely inaccessible to the public, or else gated by an API. This lack of access makes it difficult for researchers to gain insight into the cause of known model performance problem areas, such as toxicity and bias.

The Meta researchers based the OPT design on GPT-3, and used the architecture and hyperparameters outlined in OpenAI's research paper. For training data, the team concatenated the dataset used for training RoBERTa with the Pile and the PushShift.io Reddit dataset. Overall, after the combined dataset was cleaned and de-duplicated, the final corpus contained around 180B tokens. Using a combination of Meta's Fully Sharded Data Parallel (FSDP) tool and NVIDIA's Megatron-LM framework, the training process achieved both high-throughput and energy efficiency.

Unlike many previous research efforts, the OPT team also released a logbook which includes notes from experimental training runs, runtime exceptions and on-call engineer responses, and a debugging playbook. The researchers also call out several  adjustments made to their process during two months of training. There were a "significant" number of hardware failures leading to 35 training restarts and over 100 hosts cycled. The team also made several code changes during training, including switching training optimizers from AdamW to "vanilla SGD" and back as well as upgrading to a new version of Megatron.

In a discussion about the logbook on Hacker News, one user noted how "hacky" the process seemed, while others noted that making adjustments on the fly was actually commonplace. Another user stated:

Even without the huge amounts of hardware/driver issues they seemed to be having with the GPUs in their big training cluster(s), this puts into perspective how hard it is to train enormous models like this. Many of the failures don't have an immediately obvious cause. Plus, there aren't all that many places out there doing training at this scale so I imagine many of these things need to get figured out on their own.

The OPT code and logbook are available on GitHub.

About the Author

Rate this Article

Adoption
Style

BT