VideoGPT: Video Generation using VQ-VAE and Transformers
We present VideoGPT: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos. VideoGPT uses VQ-VAE that learns downsampled discrete latent representations of a raw video by employing 3D convolutions and axial self-attention. A simple GPT-like architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings. Despite the simplicity in formulation and ease of training, our architecture is able to generate samples competitive with state-of-the-art GAN models for video generation on the BAIR Robot dataset, and generate high fidelity natural images from UCF-101 and Tumbler GIF Dataset (TGIF). We hope our proposed architecture serves as a reproducible reference for a minimalistic implementation of transformer based video generation models. Samples and code are available at this https URL
FINALLY! A breakthrough in true video synthesis.
We've had to settle for pseudo video synthesis for a couple years now, with existing GAN and autoencoder methods being used on stitched-together animations (you can pull this off in Artbreeder, in fact!)
But while some of these are indeed novel video synthesis, they really aren't what I've been thinking about when I consider that term. Faces abstractly bleeding into each other or fractals reforming constantly like a DMT trip are certainly interesting to watch, but when I think of "novel video synthesis", I'm thinking of something much more coherent. Take a photograph or a screenshot.
Heck, take a screenshot of this page I'm typing on. A novel video synthesis algorithm would be able to "scroll up and down" the page. It doesn't have to know what the hell is on the page besides what's in the screenshot. It just has to remain consistent, creating forum posts that look at least somewhat legitimate. It's not scraping this forum for that data, either. More that it's "imagining" what the page could be and turning that into a moving sequence of images.
And you can get more complex from there, such as synthesizing an image of a typical street crowd and then animating it, having the people walk and cars move and bikes roll on, maybe even some clouds slowly moving on by. It looks like a video shot on a camera, but this video does not exist.
This is the first major step towards it!
I mean, DVD-GANS were interesting, but it's becoming clear that transformers are taking over on this front from GANs.