Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Roland Meertens on the Unreasonable Effectiveness of Zero Shot Learning

Roland Meertens on the Unreasonable Effectiveness of Zero Shot Learning

This item in japanese

At the recent QCon Plus online conference, Roland Meertens gave a talk on developing AI-based applications titled "The Unreasonable Effectiveness of Zero Shot Learning." He demonstrated two examples of using foundation models and zero shot learning to rapidly deploy prototype applications and gain feedback without needing to gather large datasets and train models.

Meertens, a product manager at Annotell, began by describing the typical cycle of development for an AI-based app, which requires collecting and annotating data, training and evaluating a model, and deploying the model, all before any users can provide feedback on the app; in most cases, user feedback will refine the requirements, and the process must be repeated. Meertens proposed shortening this cycle by using so-called foundation models---powerful pre-trained models that can handle many downstream tasks---to provide the AI functionality in the app. Developers can call a foundation model from their apps with only a few lines of code, giving them the ability to deploy and iterate quickly, taking advantage of the model's ability to perform zero shot learning; that is, perform a task without being trained to do it. Meertens demonstrated the use of two of these foundation models, GTP-3 and CLIP, to create two prototype apps: one for identifying the language of song lyrics and another for identifying fruits and vegetables from images.

Meertens described a common scenario that occurs during development of an AI-based application. Developers will begin with gathering a training dataset; for example, if the app is intended to identify objects in images, the dataset will contain many hundreds or thousands of example images. Next, the images must be annotated or labelled by hand to indicate the objects they represent. This expensive and time-consuming process is followed by training AI models---or sometimes several such models, to find the one with the best performance---which takes additional time and cost. Developers should also expect that during the software development lifecycle, requirements may change, which means more training data must be collected and new models trained.

As an alternative to this approach, Meertens suggested using foundation models: models which are "trained on broad data at scale and are adaptable to a wide range of downstream tasks." The term was coined by researchers at Stanford University, who earlier this year launched the Center for Research on Foundation Models to "make fundamental advances in the study, development, and deployment of foundation models." Because foundation models have already been trained on extremely large datasets, they can be used "out of the box" in many applications. Although some of these models are available to download and run locally, others are too large to fit in the memory of a typical developer workstation and must be invoked using a web API. Meertens focused his presentation on two of these models, both of them created by OpenAI: GPT-3 and CLIP.

GPT-3 is a well-known generative model for natural language processing (NLP); Meertens described it as essentially an autocompletion algorithm. Given a sequence of tokens as input (for example, a series of words), GPT-3 will output its prediction for the next tokens in the sequence. Meertens noted that many tasks can be framed as completing a text string; for example, a simple arithmetic problem can be defined as completing a string: the string "2 + 2 =" would be completed with the string "4." Because GPT-3 was trained on such a large amount of data, it can actually produce the correct output for many of these tasks. Meertens joked that GPT-3 could do "anything that a bored high-schooler can do on a test."

Meertens then demonstrated that GPT-3 can identify a song lyric's language with zero-shot learning simply by "prompting" GPT-3 with the phrase: "this is a program which determines which language song lyrics are written in," then concatenating the lyric. GPT-3 would then output the language. Meertens also provided several "pro-tips" for using GPT-3. The first was to give GPT-3 a set of possible outputs; for example, for song lyrics, list all the possible languages (e.g., French, German, English, etc.) Another was to use "few shot" learning if possible; that is, to give GPT-3 one or more example inputs *and* desired outputs. He pointed out that GPT-3 can actually return multiple "autocompleted" outputs for a single input and recommended using a heuristic to choose the best one.

Next Meertens demonstrated how to use OpenAI's CLIP model to recognize fruits and vegetables in an image. CLIP is a deep-learning model trained on a dataset of images paired with text scraped from the internet, that combines GPT's natural language capabilities with computer vision (CV). CLIP maps both images and text into an embedding space, with images and text that were paired in its training data mapped to similar values. Meertens showed how with a few lines of code, CLIP can be used as an image classifier. When given an image along with a list of possible text descriptions, CLIP will output the text description that best matches the image. For example, when shown an image of an apple, and given a list of fruits and vegetables, such as "apple," "carrot," "grapes," etc., CLIP outputs "apple."

Meertens concluded his talk by answering several questions from the audience. Several users asked about the range of objects recognized by CLIP; in response Meertens suggested that using CLIP would be a starting point, and developers should save images where CLIP was wrong or had low confidence, to build a training dataset for a more app-specific model. When asked about the cost of using foundation models, he mentioned that while GPT-3 does have a cost to use, the cost varies by model size: the larger the model, the higher the cost. The larger models perform better, but the smaller models may work well enough for some applications, such as identifying a song's language.

Meertens's demo app code is available on GitHub.

Rate this Article