Meta's Toolformer Uses APIs to Outperform GPT-3 on Zero-Shot NLP Tasks

Meta AI Research announced Toolformer, a language model that learns to call APIs to help solve natural language processing (NLP) tasks. Toolformer automatically annotates a training dataset which is used to fine-tune the model and can outperform the much larger GPT-3 model on several zero-shot NLP tasks.

Toolformer is based on a 6.7B parameter pre-trained GPT-J large language model. The model is given human-written examples of API calls, with input and matching output, as prompts that are prepended to training data samples. When fed into the model, these produce annotated samples, showing where API calls should be inserted to generate a result; for example, calling a calculator API to answer an arithmetic question. The model is then fine tuned on that annotated dataset. Experiments on this fine tuned model show that by using API calls it outperforms larger models, such as the 175B parameter GPT-3, on several zero-shot NLP benchmarks. According to Meta:

Toolformer achieves substantially improved zero-shot performance across a variety of downstream tasks, often competitive with much larger models, without sacrificing its core language modeling abilities.

Large language models (LLM) like GPT-3 have good zero-shot performance on a wide range of NLP tasks, and usually the larger the model, the better its performance. However, LLMs often struggle with some tasks, such as arithmetic, regardless of their scale. Also, regardless of scale, they will incorrectly answer questions about events that occurred after the model was trained; for example, "Which team does Cristiano Ronaldo play for?" Meta's solution to this problem is to teach the LLM to use external tools, or APIs, such as a web search engine or a calculator, to help the model with tasks where it would otherwise perform poorly.

Toolformer Dataset Annotation Prompt

Dataset Annotation Prompt. Image Source: https://arxiv.org/abs/2302.04761

The key idea is to use the language model to generate a training dataset for itself. This dataset is generated by taking a subset of the Common Crawl dataset, then for each example in the dataset, the example is prepended with a prompt asking the model to add API calls and their results to the text. The researchers also developed a loss metric or "fitness score" for the API calls: if adding their results to the text produces a worse prediction for the next tokens in the text, the edit is discarded.

Toolformer Dataset Annotation Process

Dataset Annotation Process. Image Source: https://arxiv.org/abs/2302.04761

Toolformer was trained to use five different tools: a question-answering API, a Wikipedia search engine, a machine translation system, a calculator, and a calendar. Meta conducted several experiments comparing its performance against baseline GPT-J models as well as a 66B parameter OPT model and 175B parameter GPT-3 model. Toolformer outperformed the baselines on almost all tasks: GPT-3 performed better on question answering, and the baseline GPT-J performed better on some non-English languages in multilingual question answering. The researchers attribute this to Toolformer's finetuning on the annotated English-only dataset.

AI developer Jay Hack reviewed the Toolformer paper in a Twitter thread, pointing out:

The authors haven't even tried doing this training process iteratively yet! You could use a pre-trained toolformer to bootstrap an even more comprehensive dataset, with more complex usages of APIs etc., then repeat. Major potential upside. Someone should do this.

Although Meta has not released their code for Toolformer, independent AI developers Phil Wang and Enrico Shippole have each open-sourced their own implementations based on Meta's paper.

About the Author

Anthony Alford

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

About the Author

Anthony Alford

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter