Google's New Imagen AI Outperforms DALL-E on Text-to-Image Generation Benchmarks

Researchers from Google's Brain Team have announced Imagen, a text-to-image AI model that can generate photorealistic images of a scene given a textual description. Imagen outperforms DALL-E 2 on the COCO benchmark, and unlike many similar models, is pre-trained only on text data.

The model and several experiments were described in a paper published on arXiv. Imagen uses a Transformer language model to convert the input text into a sequence of embedding vectors. A series of three diffusion models then convert the embeddings into a 1024x1024 pixel image. As part of their work, the team developed an improved diffusion model called Efficient U-Net, as well as a new benchmark suite for text-to-image models called DrawBench. On the COCO benchmark, Imagen achieved a zero-shot FID score of 7.27, outperforming DALL-E 2, the previous best-performing model. The researchers also discussed the potential societal impact of their work, noting:

Our primary aim with Imagen is to advance research on generative methods, using text-to-image synthesis as a test bed. While end-user applications of generative methods remain largely out of scope, we recognize the potential downstream applications of this research are varied and may impact society in complex ways...In future work we will explore a framework for responsible externalization that balances the value of external auditing with the risks of unrestricted open-access.

In recent years, several researchers have investigated training multimodal AI models: systems that operate on different types of data, such as text and images. In 2021, OpenAI announced CLIP, a deep-learning model that can map both text and images into the same embedding space, allowing users to tell if a textual description is a good match for a given image. This model has proven effective at many computer-vision tasks, and OpenAI also used it to create DALL-E, a model that can generate realistic-looking images from text descriptions. CLIP and similar models were trained on a dataset of image-text pairs which are scraped from the internet, similar to the LAION-5B dataset that InfoQ reported on earlier this year.

Instead of using an image-text dataset for training Imagen, the Google team simply used an "off-the-shelf" text encoder, T5, to convert input text into embeddings. To convert the embedding into an image Imagen uses a sequence of diffusion models. These generative AI models use an iterative denoising process to convert Gaussian noise into samples from a data distribution---in this case, images. The de-noising conditioned on some input. For the first diffusion model, the condition is the input text embedding; this model outputs a 64x64 pixel image. This image is up-sampled by passing through two "super-resolution" diffusion models, to increase resolution to 1024x1024. For these models, Google developed a new deep-learning architecture called Efficient U-Net, which is "simpler, converges faster, and is more memory efficient" than previous U-Net implementations.

Image generated by Imagen

"A cute corgi lives in a house made out of sushi" - image source: https://imagen.research.google

In addition to evaluating Imagen on the COCO validation set, the researchers developed a new image-generation benchmark, DrawBench. The benchmark consists of a collection of text prompts that are "designed to probe different semantic properties of models," including composition, cardinality, and spatial relations. DrawBench uses human evaluators to compare two different models. First, each model generates images from the prompts. Then, the evaluators compare the results from the two, indicating which model produced the better image. Using DrawBench, the Brain team evaluated Imagen against DALL-E 2 and three other similar models; the team found that the judges "exceedingly" preferred the images generated by Imagen over the other models.

On Twitter, Google product manager Sharon Zhou discussed the work, noting that:

As always, [the] conclusion is that we need to keep scaling up [large language models]

In another thread, Google Brain team lead Douglas Eck posted a series of images generated by Imagen, all from variations on a single prompt; Eck modified the prompt by adding words to adjust the style, lighting, and other aspects of the image. Several other example images generated by Imagen can be found on the Imagen project site.

About the Author

Anthony Alford

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Write for InfoQ

About the Author

Anthony Alford

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter