BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Google Open-Sources AI Fine-Tuning Method Distilling Step-by-Step

Google Open-Sources AI Fine-Tuning Method Distilling Step-by-Step

This item in japanese

A team from the University of Washington and Google Research recently open-sourced Distilling Step-by-Step, a technique for fine-tuning smaller language models. Distilling Step-by-Step requires less training data than standard fine-tuning and results in smaller models that can outperform few-shot prompted large language models (LLMs) that have 700x the parameters.

Although LLMs can often perform well on a wide range of tasks with few-shot prompting, hosting the models is challenging due to their memory and compute requirements. Smaller models can also perform well when fine-tuned, but that requires a manually created task-specific dataset. The key idea of Distilling Step-by-Step is to use a LLM to automatically generate a small fine-tuning dataset that contains both an input with an output label, as well as a "rationale" for why the output label was chosen. The fine-tuning process trains the small model to predict both the output label as well as generate the rationale. When evaluated on NLP benchmarks, the small fine-tuned models outperformed the 540B PaLM model while requiring only 80% of the benchmark's fine-tuning data. According to Google:

We show that distilling step-by-step reduces both the training dataset required to curate task-specific smaller models and the model size required to achieve, and even surpass, a few-shot prompted LLM’s performance. Overall, distilling step-by-step presents a resource-efficient paradigm that tackles the trade-off between model size and training data required.

Research has shown that increasing the number of parameters in an LLM can improve its performance, with the current state of the art models such as PaLM having 100s of billions of parameters. However, these large models are expensive and difficult to use at inference time, as they require multiple parallel GPUs simply to hold the parameters in memory. Recent efforts have produced slightly smaller models, such as Meta's Llama 2, that can perform nearly as well but with an order of magnitude fewer parameters; however, these models are still quite large and compute-intensive.

One way to get a smaller model that performs well on a certain task is to fine-tune a smaller language model with a task-specific dataset. While this dataset might be relatively small---on the order of thousands of examples---it may still be costly and time-consuming to collect. Another option is knowledge distillation, where a large model is used as a teacher for a smaller model. InfoQ recently covered such a technique developed by Google that uses a PaLM LLM to create training datasets, producing fine-tuned models that performed comparable to LLMs that were 10x larger.

Distilling Step-by-Step does require a fine-tuning dataset, but it reduces the amount of data needed to create a high-performing model. The source dataset is fed to a PaLM LLM via a chain-of-thought prompt that asks the model to give the rationale for its answer. The result is a modified fine-tuning dataset that contains the original input and answer as well as the rationale. The smaller target model is fine-tuned to perform two tasks: answer the original question and generate a rationale.

Google evaluated their technique using four NLP benchmarks, each of which contains a fine-tuning dataset. They used Distilling Step-by-Step to modify these datasets and fine-tune T5 models with fewer than 1B parameters. They found that their models could outperform baseline fine-tuned models while using only a fraction of the dataset; as little as 12.5% in some cases. They also found that their 770M parameter model outperformed the 700x larger 540B parameter PaLM on the ANLI benchmark, while needing only 80% of the fine-tuning dataset.

In a discussion about the work on X (formerly Twitter), AI entrepreneur Otto von Zastrow wrote:

These results are very strong. I would call it synthetic data generation, not distillation, and I am really curious to see what happens if you train the original LLM on this synthetic rationale per sample question.

The Distilling Step-by-Step source code and training dataset are available on GitHub. Google Cloud's Vertex AI platform also offers a private preview of the algorithm.

About the Author

Rate this Article

Adoption
Style

BT