Meta researchers have made a big leap in the field of art generation with artificial intelligence With Make-A-Video, the creatively named new technology for – you guessed it – to create a video from nothing but a text prompt. The results were impressive and varied, all, without exception, a little frightening.
We’ve seen text-to-video models before – it’s a natural extension of text-to-image models like DALL-E, which produce still images from the prompts. But while the conceptual jump from a static image to transferring one is small for the human brain, it is far from trivial to implement in a machine learning model.
Make-A-Video doesn’t change the game much on the back end — as the researchers note in the paper they describe, “A model that only saw text describing images is surprisingly effective at creating short videos.”
The AI uses the current and effective diffusion technology to create the images, which basically works in reverse from the pure optical constant, “noise reduction”, to the direction of the target vector. What is added here is that the model has also received unsupervised training (i.e. it examined the data itself without strong human guidance) on a set of unclassified video content.
What you know from the start is how to make a realistic picture. What he knows from a second is what the sequential frames of a video look like. Amazingly, he is able to put these elements together very effectively without any special training on how to combine them.
“Across all aspects, spatio-temporal resolution, text fidelity, and quality, Make-A-Video puts state-of-the-art technology into text-to-video creation, as determined by both qualitative and quantitative metrics,” write the researchers.
It’s hard not to agree. Previous text-to-video systems used a different approach and the results were not impressive, but promising. Now Make-A-Video is getting it out of the water, achieving resolution in line with images from perhaps 18 months ago on the original DALL-E or other previous generation systems.
But it must be said: there is certainly still something about them. Not that we should expect photorealism or completely natural motion, but the results all have some kind of…well, no other word for it – it’s a bit awesomeis not it?
There are just some terrible qualities that are dreamlike and horrible at the same time. The quality of the action is weird, like a stop-motion movie. The corruption and artifacts give each piece a surreal and furry feel, as if things are leaking. People get along with each other – there is no understanding of the limits of things or what something should end up or relate to.
I’m not saying all this as the kind of AI smug who only wants the best photo-realistic high definition images. I think it’s great that no matter how realistic these videos are on one hand, they are all very strange and separate in other respects. The possibility to produce them quickly and arbitrarily is incredible – and it will only get better. But even the best image generators still have such a surreal quality that it’s hard to put your finger on it.
Make-A-Video also allows converting still images and other videos into variants or extensions thereof, just like how image generators claim the images themselves. The results are a little less alarming.
This is really a huge step forward from what has been there before, and the team should be congratulated. It’s not public yet, but you can Register here To get the list for any form of access they decide later.