Despite the evocative nature of words like “movie” and “film”, the short snippets of video footage discussed below are, regrettably, nowhere near the level of what people commonly refer to as “movies”.
However, the new technique could eventually be used to train other machine learning algorithms, and even help witnesses reconstruct the scene of a crime.
Furthermore, while artificial intelligence has been getting better at identifying the content of images and providing labels, and so-called “generative” algorithms have been improving at producing image labels, this is the first time an algorithm has managed to generate a video image from text.
“As far as I know, it’s the first text-to-video work that gives such good results. They are not perfect, but at least they start to look like real videos. It’s really nice work,” said Tinne Tuytelaars, a computer scientist at the Katholieke Universiteit Lueven in Belgium.
The algorithm operates in two stages – first, it uses the relevant text to generate the text-conditioned back-ground colour and object layout structure (representing the static features extracted from the text), and then combines them with dynamic features by filtering the input to produce a short, one-second video.
During training, the algorithm is overseen by a second network acting as kind of “judge”. It sees the resulting video, and compares it to a “real” one depicting the same general idea (such as “sailing on the sea” or “playing golf on grass”). As the recursive process continues, the criticism it levels against the algorithm improves its generative capacity.
The algorithm was trained on 10 types of scenes which it then approximated by producing a rough video image resembling grainy VHS footage. The algorithm was even capable of “directing” movies based on nonsensical actions like “sailing on snow” and “playing golf at swimming pool”.
For now, the videos are only 32 frames long and not much larger than a US postage stamp — 64 by 64 pixels — because larger videos reduce accuracy.
The next step for the team will be to feed the algorithm human skeletal models to improve the appearance of human figures, which currently look like distorted, vaguely humanoid blobs.
An accompanying paper will be published after a meeting of the Association for the Advancement of Artificial Intelligence in New Orleans, Louisiana this month.