Berkeley Open-Sources AI Image-Editing Model InstructPix2Pix

Researchers from the Berkeley Artificial Intelligence Research (BAIR) Lab have open-sourced InstructPix2Pix, a deep-learning model that follows human instructions to edit images. InstructPix2Pix was trained on synthetic data and outperforms a baseline AI image-editing model.

The BAIR team presented their work at the recent IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023. They first generated a synthetic training dataset, where the training examples are pairs of images along with an editing instruction to convert the first image to the second. This dataset was used to train an image generation diffusion model. The result is a model that can, given a source image, accept text-based instructions on how to edit the image; for example, given an image of a person riding a horse and the prompt "Have her ride a dragon," it will output the original image with the horse replaced by a dragon. According to the BAIR researchers:

Despite being trained entirely on synthetic examples, our model achieves zero-shot generalization to both arbitrary real images and natural human-written instructions. Our model enables intuitive image editing that can follow human instructions to perform a diverse collection of edits: replacing objects, changing the style of an image, changing the setting, the artistic medium, among others.

Earlier efforts in AI for image editing have often been based on style transfer, and popular text-to-image generation models such as DALL-E and Stable Diffusion also support image-to-image style transfer operations; however, targeted editing with these models is challenging. More recently, InfoQ covered Microft's Visual ChatGPT which can invoke external tools for editing images, given a textual description of the desired edit.

To train InstructPix2Pix, BAIR first created a synthetic dataset. To do this, the team fine-tuned GPT-3 on a small dataset of human-written examples consisting of an input caption, editing instructions, and a desired output caption. Then this fine-tuned model was given a large dataset of input image captions, from which it generated over 450k edits and output captions. The team then fed the input and output captions to a pre-trained Prompt-to-Prompt model, which generated pairs of similar images based on the captions.

InstructPix2Pix Architecture

InstructPix2Pix Architecture. Image Source: https://arxiv.org/abs/2211.09800

Given this dataset, the researchers trained InstructPix2Pix, which is based on Stable Diffusion. To evaluate its performance, the team compared its output with a baseline model, SDEdit. They used a tradeoff between two metrics: consistency, which is the cosine similarity between the CLIP embeddings of the input image and the edited image; and directional similarity, or how much the change in the edited caption agrees with the change in the edited image. In experiments, for a given value of directional similarity, InstructPix2Pix produced more consistent images than did SDEdit.

In his deep-learning newsletter The Batch, AI researcher Andrew Ng commented on InstructPix2Pix:

This work simplifies — and provides more coherent results when — revising both generated and human-made images. Clever use of pre-existing models enabled the authors to train their model on a new task using a relatively small number of human-labeled examples.

The InstructPix2Pix code is available on GitHub. The model and a web-based demo are available on Huggingface.

About the Author

Anthony Alford

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

About the Author

Anthony Alford

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter