Text-to-video model

A video generated using OpenAI's Sora te hixt-to-video model, using the prompt

A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.

A text-to-video model is a machine learning model that uses a natural language description as input to produce a video relevant to the input text.^[1] Advancements during the 2020s in the generation of high-quality, text-conditioned videos have largely been driven by the development of video diffusion models.^[2]

Models

There are different models, including open source models. The demo version of CogVideo is an early text-to-video model "of 9.4 billion parameters", with their codes presented on GitHub.^[3] Meta Platforms has a partial text-to-video^{[note 1]} model called "Make-A-Video".^[4]^[5]^[6] Google's Brain has released a research paper introducing Imagen Video, a text-to-video model with 3D U-Net.^[7]^[8]^[9]^[10]^[11]

In March 2023, a research paper by Alibaba was published that applied many of the principles found in latent image diffusion models to video generation.^[12]^[13] The following year, Alibaba released ModelScope.^[14] Services like Kaiber and Reemix subsequently adopted similar approaches to video generation in their respective products.^{[citation needed]}

Matthias Niessner and Lourdes Agapito at AI company Synthesia work on developing 3D neural rendering techniques that can synthesise realistic video by using 2D and 3D neural representations of shape, appearances, and motion for controllable video synthesis of avatars.^[15] In June 2024, Luma AI launched its Dream Machine video tool.^[16]^[17] That same month,^[18] Kuaishou extended its Kling AI text-to-video model to international users. In July 2024, TikTok owner ByteDance released Jimeng AI in China, through its subsidiary, Faceu Technology.^[19]

Alternative approaches to text-to-video models include^[20] include Google's Phenaki, Hour One, Colossyan,^[21] Runway's Gen-3 Alpha,^[22]^[23] and OpenAI's unreleased (as at August 2024) Sora,^[24] available only to alpha testers.^[25]

Footnotes

^ It can also generate videos from images, video insertion between two images, and videos variations.

References

^ Artificial Intelligence Index Report 2023 (PDF) (Report). Stanford Institute for Human-Centered Artificial Intelligence. p. 98. Multiple high quality text-to-video models, AI systems that can generate video clips from prompted text, were released in 2022.
^ Melnik, Andrew; Ljubljanac, Michal; Lu, Cong; Yan, Qi; Ren, Weiming; Ritter, Helge (2024-05-06). "Video Diffusion Models: A Survey". arXiv:2405.03150 [cs.CV].
^ CogVideo, THUDM, 2022-10-12, retrieved 2022-10-12
^ Davies, Teli (2022-09-29). "Make-A-Video: Meta AI's New Model For Text-To-Video Generation". Weights & Biases. Retrieved 2022-10-12.
^ Monge, Jim Clyde (2022-08-03). "This AI Can Create Video From Text Prompt". Medium. Retrieved 2022-10-12.
^ "Meta's Make-A-Video AI creates videos from text". www.fonearena.com. Retrieved 2022-10-12.
^ "google: Google takes on Meta, introduces own video-generating AI". The Economic Times. 6 October 2022. Retrieved 2022-10-12.
^ Monge, Jim Clyde (2022-08-03). "This AI Can Create Video From Text Prompt". Medium. Retrieved 2022-10-12.
^ "Nuh-uh, Meta, we can do text-to-video AI, too, says Google". www.theregister.com. Retrieved 2022-10-12.
^ "Papers with Code - See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction". paperswithcode.com. Retrieved 2022-10-12.
^ "Papers with Code - Text-driven Video Prediction". paperswithcode.com. Retrieved 2022-10-12.
^ "Home - DAMO Academy". damo.alibaba.com. Retrieved 2023-08-12.
^ Luo, Zhengxiong; Chen, Dayou; Zhang, Yingya; Huang, Yan; Wang, Liang; Shen, Yujun; Zhao, Deli; Zhou, Jingren; Tan, Tieniu (2023). "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation". arXiv:2303.08320 [cs.CV].
^ Alibaba Cloud unleashes thousands of Chinese AI models to the world The Register accessed August 16, 2024.
^ "Text to Speech for Videos". Retrieved 2023-10-17.
^ Luma AI debuts 'Dream Machine' for realistic video generation, heating up AI media race VentureBeat accessed August 16, 2024.
^ Apple Debuts Intelligence, Mistral Raises $600 Million, New AI Text-To-Video Forbes accessed August 16, 2024.
^ What you need to know about Kling, the AI video generator rival to Sora that’s wowing creators VentureBeat accessed August 16, 2024.
^ ByteDance joins OpenAI's Sora rivals with AI video app launch Reuters accessed August 16, 2024.
^ Text2Video-Zero, Picsart AI Research (PAIR), 2023-08-12, retrieved 2023-08-12
^ Text-to-Video Generative AI Models: The Definitive List AI Business accessed August 16, 2024.
^ Runway's Sora competitor Gen-3 Alpha now available The Decoder accessed August 16, 2024.
^ Generative AI's Next Frontier Is Video Bloomberg accessed August 16, 2024.
^ OpenAI teases 'Sora,' its new text-to-video AI model NBC News accessed August 16, 2024.
^ Toys R Us creates first brand film to use OpenAI’s text-to-video tool Marketing Dive accessed August 16, 2024.

[4] It can also generate videos from images, video insertion between two images, and videos variations.

[AIIR-1] Artificial Intelligence Index Report 2023 (PDF) (Report). Stanford Institute for Human-Centered Artificial Intelligence. p. 98. Multiple high quality text-to-video models, AI systems that can generate video clips from prompted text, were released in 2022.

[2] Melnik, Andrew; Ljubljanac, Michal; Lu, Cong; Yan, Qi; Ren, Weiming; Ritter, Helge (2024-05-06). "Video Diffusion Models: A Survey". arXiv:2405.03150 [cs.CV].

[3] CogVideo, THUDM, 2022-10-12, retrieved 2022-10-12

[5] Davies, Teli (2022-09-29). "Make-A-Video: Meta AI's New Model For Text-To-Video Generation". Weights & Biases. Retrieved 2022-10-12.

[6] Monge, Jim Clyde (2022-08-03). "This AI Can Create Video From Text Prompt". Medium. Retrieved 2022-10-12.

[7] "Meta's Make-A-Video AI creates videos from text". www.fonearena.com. Retrieved 2022-10-12.

[8] "google: Google takes on Meta, introduces own video-generating AI". The Economic Times. 6 October 2022. Retrieved 2022-10-12.

[9] Monge, Jim Clyde (2022-08-03). "This AI Can Create Video From Text Prompt". Medium. Retrieved 2022-10-12.

[10] "Nuh-uh, Meta, we can do text-to-video AI, too, says Google". www.theregister.com. Retrieved 2022-10-12.

[11] "Papers with Code - See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction". paperswithcode.com. Retrieved 2022-10-12.

[12] "Papers with Code - Text-driven Video Prediction". paperswithcode.com. Retrieved 2022-10-12.

[13] "Home - DAMO Academy". damo.alibaba.com. Retrieved 2023-08-12.

[14] Luo, Zhengxiong; Chen, Dayou; Zhang, Yingya; Huang, Yan; Wang, Liang; Shen, Yujun; Zhao, Deli; Zhou, Jingren; Tan, Tieniu (2023). "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation". arXiv:2303.08320 [cs.CV].

[15] Alibaba Cloud unleashes thousands of Chinese AI models to the world The Register accessed August 16, 2024.

[16] "Text to Speech for Videos". Retrieved 2023-10-17.

[17] Luma AI debuts 'Dream Machine' for realistic video generation, heating up AI media race VentureBeat accessed August 16, 2024.

[18] Apple Debuts Intelligence, Mistral Raises $600 Million, New AI Text-To-Video Forbes accessed August 16, 2024.

[19] What you need to know about Kling, the AI video generator rival to Sora that’s wowing creators VentureBeat accessed August 16, 2024.

[20] ByteDance joins OpenAI's Sora rivals with AI video app launch Reuters accessed August 16, 2024.

[21] Text2Video-Zero, Picsart AI Research (PAIR), 2023-08-12, retrieved 2023-08-12

[22] Text-to-Video Generative AI Models: The Definitive List AI Business accessed August 16, 2024.

[23] Runway's Sora competitor Gen-3 Alpha now available The Decoder accessed August 16, 2024.

[24] Generative AI's Next Frontier Is Video Bloomberg accessed August 16, 2024.

[25] OpenAI teases 'Sora,' its new text-to-video AI model NBC News accessed August 16, 2024.

[26] Toys R Us creates first brand film to use OpenAI’s text-to-video tool Marketing Dive accessed August 16, 2024.

[1]

[2]

[3]

[note 1]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

Models

See also

Footnotes

References