Google Trains 540 Billion Parameter AI Language Model PaLM

Google Research recently announced the Pathways Language Model (PaLM), a 540-billion-parameter AI natural language processing (NLP) model that surpasses average human performance on the BIG-bench benchmark. PaLM outperforms other state-of-the-art systems on many evaluation tasks, and shows strong results on tasks such as logical inference and joke explanation.

Software engineers Sharan Narang and Aakanksha Chowdhery described PaLM in a post on the Google Research blog. The model uses an autoregressive decoder-only Transformer architecture and was trained on a cluster of 6144 TPU chips, the largest such cluster known to date, using Google's Pathways technology. When evaluated on a set of 29 natural language processing (NLP) tasks, PaLM surpassed current records on all but one. Coupled with a new chain of thought prompting method for generating responses, PaLM also achieves state-of-the-art performance on several reasoning benchmarks and also exhibits capabilities on two novel reasoning tasks: logical inference and explaining a joke. According to Narang and Chowdhery,

PaLM paves the way for even more capable models by combining the scaling capabilities with novel architectural choices and training schemes, and brings us closer to the Pathways vision: "Enable a single AI system to generalize across thousands or millions of tasks, to understand different types of data, and to do so with remarkable efficiency."

Language models predict the next item or token in a sequence of text, given the previous tokens; when such a model is used iteratively, with the predicted output fed back as the input, the model is termed autoregressive. Autoregressive language models based on the Transformer deep-learning architecture have set state-of-the-art performance records on many NLP tasks, and many researchers have developed very large-scale Transformer models. Training these large models can be challenging, since they are often too big to fit into the memory of a single GPU or TPU accelerator, and the large training datasets require many hours, or even days, of processing.

Late last year, Google announced their plans to develop a system called Pathways, a new AI architecture designed to handle many different tasks and data types. As part of this work, they developed an orchestration layer for large-scale use of TPU accelerators. Using Pathways, the PaLM team scaled their training process to use 6144 TPUs "without needing to use any pipeline parallelism." Using Pathways also made training more efficient: based on model FLOPs utilization, PaLM training was 46.2% efficient, compared to 21.3% for GPT-3.

As with similar models, PaLM was first pre-trained via self-supervised learning on a large text corpus drawn from web pages, Wikipedia, books, and open-source code repositories. In addition to setting new state-of-the-art records on English-only NLP tasks and "competitive" performance on multilingual tasks, PaLM achieved "outstanding" results on several text-to-code and code-to-code tasks, performing as well or better than OpenAI's Davinci Codex API. The researchers also investigated PaLM's performance on two multi-step reasoning tasks, including explaining jokes:

Input: I tried 10,000 random restarts of my neural network, but I was accused of overfitting. I guess no good seed goes unpunished.

Model Output: This joke is a pun. A neural network is a computer program that can learn from data. A "seed" is a number that is used to initialize a random number generator. A "good seed" is a number that produces a good random number generator. The phrase "no good deed goes unpunished" means that if you do something good, you will be punished for it.

PaLM team member William Fedus shared some of his thoughts on the work in a Twitter thread. Although his reaction overall was positive, Fedus did note two areas for improvement:

While this model runs extremely efficiently on TPUv4, we're compute-inefficient based on model size. 540B parameters is too large for this compute budget...[Also] PaLM is decoder-only, but we still find that encoder-decoder models fine-tune better.

PaLM's rank on several NLP benchmark leaderboards is available on Papers with Code.

About the Author

Anthony Alford

Show moreShow less

InfoQ Software Architects' Newsletter

Write for InfoQ

About the Author

Anthony Alford

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter