FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline
Abstract
Multimedia generation approaches occupy a prominent place in artificial intelligence research. Text-to-image models achieved high-quality results over the last few years. However, video synthesis methods recently started to develop. This paper presents a new two-stage latent diffusion text-to-video generation architecture based on the text-to-image diffusion model. The first stage concerns keyframes synthesis to figure the storyline of a video, while the second one is devoted to interpolation frames generation to make movements of the scene and objects smooth. We compare several temporal conditioning approaches for keyframes generation. The results show the advantage of using separate temporal blocks over temporal layers in terms of metrics reflecting video generation quality aspects and human preference. The design of our interpolation model significantly reduces computational costs compared to other masked frame interpolation approaches. Furthermore, we evaluate different configurations of MoVQ-based video decoding scheme to improve consistency and achieve higher PSNR, SSIM, MSE, and LPIPS scores. Finally, we compare our pipeline with existing solutions and achieve top-2 scores overall and top-1 among open-source solutions: CLIPSIM = 0.2976 and FVD = 433.054. Project page: https://ai-forever.github.io/kandinsky-video/
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MoVideo: Motion-Aware Video Generation with Diffusion Models (2023)
- LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation (2023)
- VideoCrafter1: Open Diffusion Models for High-Quality Video Generation (2023)
- DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors (2023)
- ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
a red car is drifting on the mountain road, close view, fast movement
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MoVideo: Motion-Aware Video Generation with Diffusion Models (2023)
- LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation (2023)
- VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models (2023)
- Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation (2023)
- VideoCrafter1: Open Diffusion Models for High-Quality Video Generation (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
FusionFrames: Revolutionizing Text-to-Video with Efficient Pipelines
Links ๐:
๐ Subscribe: https://www.youtube.com/@Arxflix
๐ Twitter: https://x.com/arxflix
๐ LMNT (Partner): https://lmnt.com/
Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper