UC Berkeley Researchers Open-Source API-Calling Language Model Gorilla

Researchers from UC Berkeley and Microsoft Research have open-sourced Gorilla, a large language model (LLM) that can write code to call APIs. In experiments measuring generated code accuracy, Gorilla outperforms several baseline models, including GPT-4.

Described as "an API appstore for LLMs," Gorilla is based on the LLaMA open-source LLM. The LLM is finetuned on APIBench, a new dataset of API descriptions of ML models hosted on HuggingFace, TorchHub, and TensorHub. Gorilla can also call out to an external document database of API definitions, which allows it access to new APIs without re-training. Using Gorilla, developers can create natural language descriptions of a problem, such as "Invoke an image classification model that uses less than 10M parameters, but maintains an ImageNet accuracy of at least 70%." Gorilla would then output the Python code to invoke the appropriate ML model with the proper options. According to the authors,

LLMs are swiftly gaining popularity across diverse domains. In our study, we spotlight techniques designed to enhance the LLM’s ability to accurately identify the appropriate API for a specific task—a significant but often overlooked aspect in the advancement of this technology. Since APIs function as a universal language enabling diverse systems to communicate effectively, their correct usage can boost the ability of LLMs to interact with tools in the wider world.

LLMs like GPT-4 have excellent performance on a wide range of tasks, including generating code. However, their knowledge of APIs is "frozen" at training time, so that they cannot generate code to call newer APIs. Further, they often hallucinate---in the case of code generation, they might output a call to an API that does not exist. InfoQ has covered several recent efforts to address these issues; for example, Meta's Toolformer which can invoke external service APIS, and ChatGPT's plugin system that augments the LLM with external resources.

The Berkeley team points out, however, that these approaches are based on prompting the LLM with examples of API calls. By contrast, the Gorilla approach focuses on "systematic evaluation and building a pipeline for future use." The researchers began by assembling the APIBench dataset. The team first collected all the model cards from the HuggingFace model hub, PyTorch hub, and TensorFlow hub. After filtering, this produced a collection of 1,645 API calls. For each of those, the researchers used GPT-4 to generate a dataset of instruction-api pairs for fine-tuning Gorilla.

A major challenge in evaluating Gorilla's output was to identify hallucinations. First, the team defined a hallucination as any model output that calls an API not in the model's external database of API definitions. This is contrasted with an error, which is an output that simply calls a "real" API incorrectly. The team used the abstract syntax tree (AST) of the generated code to match with APIs in the database and test set for evaluation purposes. Using this AST accuracy metric on zero-shot tasks, Gorilla performed 20.43% better than GPT-4.

Gorilla's lead author Shishir Patil joined a Hacker News discussion about the work, answering several questions. When asked whether the model's license allowed commercial use, Patil pointed out that there are three versions of Gorilla. The one based on LLaMA is not licensed for commercial use, but the ones based on MPT-7 base and Falcon-7B are. Another user asked how Gorilla compared to LangChain; Patil replied:

Langchain is a terrific project that tries to teach agents how to use tools using prompting. Our take on this is that prompting is not scalable if you want to pick between 1000s of APIs. So Gorilla is a LLM that can pick and write the semantically and syntactically correct API for you to call! A drop in replacement into Langchain!

The Gorilla code and model files are available on GitHub. There is also a Google Colab notebook demo of the model.

About the Author

Anthony Alford

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

About the Author

Anthony Alford

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter