Google Announces Video Generation LLM VideoPoet

Google Research recently published their work on VideoPoet, a large language model (LLM) that can generate video. VideoPoet was trained on 2 trillion tokens of text, audio, image, and video data, and in evaluations by human judges its output was preferred over that of other models.

Unlike many image and video generation AI systems that use diffusion models, VideoPoet uses a Transformer architecture that is trained to handle multiple modalities. The model can handle multiple input and output modalities by using different tokenizers. After training, VideoPoet can perform a variety of zero-shot generative tasks, including text-to-video, image-to-video, video inpainting, and video style transfer. When evaluated on a variety of benchmarks, VideoPoet achieves "competitive" performance compared to state-of-the-art baselines. According to Google,

Through VideoPoet, we have demonstrated LLMs’ highly-competitive video generation quality across a wide variety of tasks, especially in producing interesting and high quality motions within videos. Our results suggest the promising potential of LLMs in the field of video generation. For future directions, our framework should be able to support "any-to-any" generation, e.g., extending to text-to-audio, audio-to-video, and video captioning should be possible, among many others.

Although OpenAI's ground-breaking DALL-E model was an early example of using Transformers or LLMs to generate images from text prompts, diffusion models such as Imagen and Stable Diffusion soon became the standard architecture for generating images. More recently, researchers have trained diffusion models to generate short videos; for example, Meta's Emu and Stability AI's Stable Video Diffusion, which InfoQ covered in 2023.

With VideoPoet, Google returns to the Transformer architecture, citing the advantage of re-using infrastructure and optimizations developed for LLMs. The architecture also supports multiple modalities and tasks, in contrast to diffusion models, which according to Google require "architectural changes and adapter modules" to perform different tasks.

The key to VideoPoet's support for multiple modalities is a set of tokenizers. The Google team used a video tokenizer called MAGVIT-v2 and an audio tokenizer called SoundStream; for text they used T5's pre-trained text embeddings. From there, the model uses a decoder-only autoregressive Transformer model to generate a sequence of tokens, which can then be converted into audio and video streams by the tokenizers.

VideoPoet was trained to perform eight different tasks: unconditioned video generation, text-to-video, video prediction, image-to-video, video inpainting, video stylization, audio-to-video, and video-to-audio. The model was trained on 2 trillion tokens, from a mix of 1 billion image-text pairs and 270 million videos.

The research team also discovered the model exhibited several emergent capabilities by chaining together several operations, for example, VideoPoet can use image-to-video to animate a single image, then apply stylization to apply visual effects. It can also generate long-form video, maintain consistent 3D structure, and apply camera motion from text prompts.

In a Hacker News discussion about VideoPoet, one user wrote:

The results look very impressive. The prompting however, is a bit weird - there's suspiciously many samples with an "8k"-suffix, presumably to get more photorealistic results? I really don't like that kind of stuff, when prompting becomes more like reciting sacred incantations instead of actual descriptions of what you want.

The VideoPoet demo site contains several examples of the model's output, including a one-minute video short-story.

About the Author

Anthony Alford

Show moreShow less

InfoQ Software Architects' Newsletter

Follow us on

About the Author

Anthony Alford

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter