Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Meta Announces Generative AI Models Emu Video and Emu Edit

Meta Announces Generative AI Models Emu Video and Emu Edit

This item in japanese

Meta AI Research announced two new generative AI models: Emu Video, which can generate short videos given a text prompt, and Emu Edit, which can edit images given text-based instructions. Both models are based on Meta's Emu foundation model and exhibit state-of-the-art performance on several benchmarks.

Emu Video uses a factorized or two-step approach for video generation: first generating an image based on the text prompt, then generating a video from the prompt and generated image. Both steps use a single fine-tuned Emu diffusion model, unlike previous methods such as Make-a-Video which use a pipeline of distinct models. Emu Edit is also based on the Emu diffusion model, but includes a task-embedding layer, which converts the text instruction prompt into an additional conditioning vector. Both Emu Video and Emu Edit were evaluated by human judges, who rated those models' outputs on generated image quality and instruction faithfulness. Both models outperformed baseline models a majority of the time; in the case of Emu Video, 91.8% of the time on quality and 86.6% on faithfulness. According to Meta,

While certainly no replacement for professional artists and animators, Emu Video, Emu Edit, and new technologies like them could help people express themselves in new ways—from an art director ideating on a new concept or a creator livening up their latest reel to a best friend sharing a unique birthday greeting. And we think that’s something worth celebrating.

The Emu foundation model was announced earlier this year at the Meta Connect event. It is a latent diffusion model that is pre-trained on over 1 billion image-text pairs, then fine tuned on "a few thousand carefully selected high-quality images." Emu can generate "highly visually appealing" images, with human judges preferring its output to Stable Diffusion XL over 70% of the time. 

To create Emu Video, the researchers used a dataset of 34 million video-text pairs to further fine-tune an Emu foundation model; the model learned to predict several future video frames given an initial frame image. The resulting model can produce four-second long videos of 512x512 pixels at 16 fps. In addition to text-to-video, the model can generate a video from a user's image; for this task, its output was preferred 96% of the time over that of the baseline VideoComposer model.

To train Emu Editor, the Meta team created a synthetic dataset of 10 million samples. Each sample consists of an input image, a textual instruction, a desired output image, and a task index. The index indicates which one of sixteen predefined tasks the instruction represents, such as removing an object or changing the image style. During training, the model learns an embedding for each task. The model can learn a new task by fine-tuning the embedding layer on just a "handful" of new examples.

In a discussion on Reddit, one user posted that:

The most interesting thing here is [the] appendix where they describe how they create the training dataset. They use a toolchain involving LLaMA, DINO, Segment Anything, and an image generator to create millions of image -> instruction -> output pairs. This is a real success story for synthetic data.

In a discussion on Hacker News, several users expressed disappointment that the models have not been open-sourced, stating that "Meta had been on an open source roll lately." Meta did create a demo website for both Emu Video and Emu Edit. Meta also released the Emu Edit benchmark dataset on Huggingface.

About the Author

Rate this Article