This concept started in funkervogt's thread
Far Beyond DADABots | The never-ending movies of tomorrow
I've decided to expand it further by explaining what we needed here: https://www.reddit.c...he_neverending/
Here's what we need to make a rudimentary 24/7 movie:
- Novel video synthesis. By this, I mean "a generative network produces full-motion video that is not directly based on an existing piece of data." That excludes deepfakes: they work by transferring one face to another. That excludes style transfer: making a pre-existing video look like a Van Gogh painting or pixel art doesn't count. It has to be novel, like ThisPersonDoesNotExist is for human faces. As far as I know, novel video synthesis remains at least a few good papers away. Needs another year or two.
- Text-to-image and text-to-video synthesis. We have rudimentary TTI models, but they are indeed rudimentary. Thus, text-to-video synthesis utterly experimental at best. It might be best described as "where novel image synthesis was in 2014" (back when GANs generated fuzzy, ethereal black and white images of human faces, a very far cry from ThisPersonDoesNotExist). Might need two or more years.
- Superior natural language generation abilities. NLG is actually quite a bit more advanced than some people presume. Networks like Transformer-LM and XLNet and Baidu's ERNIE team excel at semantic sentence-pair understanding, showing that these networks can derive meaning & understanding from at least a short paragraph of text. GPT-2 scores around a 70% on the Winograd Schema Challenge (which tests AI's ability for commonsense reasoning; a human reliably scores a 92% to 95%). Baidu's latest ERNIE model scores a 90.1%. This is fantastic for showing commonsense reasoning in a certain area of natural language processing and tells me that SOTA language models can indeed generate a text that makes sense. Of course, the Winograd Schema Challenge is based more on deducing if a sentence makes sense if the meaning is not immediately clear (which is still a massive skill necessary for proper NLU), so simply being as good as a human in figuring out a confusing sentence's unclear subject isn't going to lead to perfectly coherent scripts tomorrow. And what's more, I don't believe the SOTA models are available for public use like GPT-2 is. But that's besides the point, because we're discussing what ought to be possible in more than a few years. Capable of coherent scripts, as long as you're referring to SOTA natural language models.
- Audio synthesis. We're already capable of generating speech that almost perfectly matches a human, and we can also generate waveforms for music as well (that is to say, computers can 'imagine' the sounds of instruments rather than play MIDI files). With further improvements, we ought to be able to improve text-to-speech to a level that's close to being indistinguishable from natural speech. This is all possible today.
Of course, for the first 24/7 movies, we won't need scripts that are necessarily coherent, nor will we need video synthesis networks that can generate an infinite amount of detail. What I can foresee is something like a video being posted to YouTube that is run by a generative-adversarial network with some simple instructions: "take this endlessly-prompted script and generate video from it." It might only use the last couple of sentences from the script to serve as a prompt for the next generated part of the script, which will reduce its long-term coherency greatly. However, it will still function.
This, I can absolutely see being done by 2022 at the latest. We're but a few papers from a team demonstrating this live.
And yes, it will definitely be surreal and likely overly literal with things. And the novel video generator might break for unclear things, like "the man takes off."
By 2025, considering the rate at which compute is increasing (which means more data for models to use, which means greater accuracy and more competent outputs), it would be bizarre if we couldn't do a surrealist "indie" movie.
And yes, I will hold to the claim that it will become coherent by 2030.
One of the guys from DADAbots commented and revealed that they were doing something like a really rudimentary version of this: