Phenaki

Realistic video generation from open-domain textual descriptions

Realistic video generation from open-domain textual descriptions

Abstract

We present Phenaki, a model that can synthesize realistic videos from textual prompt sequences.

Generating videos from text is particularly challenging due to various factors, such as high computational cost, variable video lengths, and limited availability of high quality text-video data.

To address the first two issues, Phenaki leverages its two main components:

  1. An encoder-decoder model that compresses videos to discrete embeddings, or tokens, with a tokenizer that can work with variable-length videos thanks to its use of causal attention in time.

  2. A transformer model that translates text embeddings to video tokens: we use a bi-directional masked transformer conditioned on pre-computed text tokens to generate video tokens from text, which are subsequently de-tokenized to create the actual video.

To address the data issues, we demonstrate that joint training on a large corpus of image-text pairs and a smaller number of video-text examples can result in generalization beyond what is available in the video datasets alone.

When compared to prior video generation methods, we observed that Phenaki could generate arbitrarily long videos conditioned on an open-domain sequence of prompts in the form of time-variable text, or a story. To the best of our knowledge, this is the first time a paper studies generating videos from such time-variable prompts.

Furthermore, we observed that our video encoder-decoder outperformed all per-frame baselines currently used in the literature on both spatio-temporal quality and number of tokens per video.

Resources

Paper

Example videos generated by Phenaki

Example videos generated by Phenaki

Video generated from prompts: “A photorealistic teddy bear is swimming in the ocean in San Francisco. The teddy bear goes underwater. The teddy bear keeps swimming under the water with colorful fishes. A panda bear is swimming underwater.”

Video generated from prompts: “A teddy bear dives in the ocean. A teddy bear emerges from the water. A teddy bear walks on the beach. Camera zooms out to the teddy bear in the campfire by the beach.”

Video generated from prompts: “Side view of an astronaut who is walking through a puddle on Mars. The astronaut is dancing on Mars. The astronaut walks his dog on Mars. The astronaut and his dog watch fireworks.”

This 2:28 minute-long video was generated using a long sequence of prompts input to an older version of Phenaki, and then applied to a super resolution model.

Prompts:

“First person view of riding a motorcycle through a busy street.”

“First person view of riding a motorcycle through a busy road in the woods.”

“First person view of very slowly riding a motorcycle in the woods.”

“First person view braking in a motorcycle in the woods.”

“Running through the woods.”

“First person view of running through the woods towards a beautiful house.”

“First person view of running towards a large house.”

“Running through houses between the cats.”

“The backyard becomes empty.”

“An elephant walks into the backyard.”

“The backyard becomes empty.”

“A robot walks into the backyard.”

“A robot dances tango.”

“First person view of running between houses with robots.”

“First person view of running between houses; in the horizon, a lighthouse.”

“First person view of flying on the sea over the ships.”

“Zoom towards the ship.”

“Zoom out quickly to show the coastal city.”

“Zoom out quickly from the coastal city.”

This 2-minute story was generated using a long sequence of prompts, on an older version of Phenaki, then applied to a super res model

Prompts:

“Lots of traffic in futuristic city.”

“An alien spaceship arrives to the futuristic city.”

“The camera gets inside the alien spaceship.”

“The camera moves forward until showing an astronaut in the blue room.”

“The astronaut is typing in the keyboard.”

“The camera moves away from the astronaut.”

“The astronaut leaves the keyboard and walks to the left.”

“The astronaut leaves the keyboard and walks away.”

“The camera moves beyond the astronaut and looks at the screen.”

“The screen behind the astronaut displays fish swimming in the sea.”

“Crash zoom into the blue fish.”

“We follow the blue fish as it swims in the dark ocean.”

“The camera points up to the sky through the water.”

“The ocean and the coastline of a futuristic city.”

“Crash zoom towards a futuristic skyscraper.”

“The camera zooms into one of the many windows.”

“We are in an office room with empty desks.”

“A lion runs on top of the office desks.”

“The camera zooms into the lion's face, inside the office.”

“Zoom out to the lion wearing a dark suit in an office room.”

“The lion wearing looks at the camera and smiles.”

“The camera zooms out slowly to the skyscraper exterior.”

“Timelapse of sunset in the modern city.”

Phenaki can create coherent long-form visual stories from a chain of prompts, with a core resolution of 128x128 pixels.

We wanted to understand if it would be possible to leverage Imagen Video’s ability to generate high-resolution videos with unprecedented photorealistic fidelity, and benefit from its underlying super-resolution modules to enhance Phenaki’s output, with the objective of combining the strengths of these two approaches into something that could create beautiful visual stories.

To achieve this, we feed Phenaki’s output generated at a given time (plus the corresponding text prompt) to Imagen Video, which then performs spatial super-resolution. A distinct strength of Imagen Video, compared to other super-resolution systems, is its ability to incorporate the text into the super-resolution module. 

For an example showing how the end-to-end system works in practice, see the previous example.

The corresponding captions corresponding this example are the following:

Prompts:

“very close up of Penguin riding wave on yellow surfboard”

“penguin rides surf yellow surfboard unto beach. Penguin leaves yellow surfboard and keeps walking.”

“Penguin quickly walking on beach and camera following. Penguin waves to camera. Feet go by camera in foreground”

“A penguin runs into a 100 colorful bouncy balls’

“slow zoom out. penguin sitting on bird nest with a single colorful egg”

“zoom out. Aerial view of penguin sitting on bird nest in rainbow antarctic glacier”

Authors

Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, Dumitru Erhan

Acknowledgements

We give special thanks to the Imagen Video team for their collaboration and for providing their system to do super resolution. To our artist friends Irina Blok and Alonso Martinez for extensive creative exploration of the system and for using Phenaki to generate some of the videos showcased here. We also want to thank Niki Parmar for initial discussions. Special thanks to Gabriel Bender and Thang Luong for reviewing the paper and providing constructive feedback. We appreciate the efforts of Kevin Murphy and David Fleet for advising the project and providing feedback throughout. We are grateful to Evan Rapoport, Douglas Eck and Zoubin Ghahramani for supporting this work in a variety of ways. Tim Salimans and Chitwan Saharia helped us with brainstorming and coming up with shared benchmarks. Jason Baldridge was instrumental for bouncing ideas. Alex Rizkowsky was very helpful in keeping things organized, while Erica Moreira and Victor Gomes ensured smooth resourcing for the project. Sarah Laszlo and Kathy Meier-Hellstern have greatly helped us incorporate important responsible AI practices into this project, which we are immensely grateful for. Finally, Blake Hechtman and Anselm Levskaya were generous in helping us debug a number of JAX issues.

Credit for Phenakistoscope asset:

Creator: Muybridge, Eadweard, 1830-1904, artist

Title: The zoopraxiscope* - a couple waltzing (No. 35., title from item.)

Edits made: Extended background and converted file format to mp4