Highlights


Tanvir Mahmud, Mustafa Munir, Radu Marculescu, and Diana Marculescu
Our method largely improves the visual quality while maintaining temporal consistency of videos. Since all videos are compressed, please use full-screen to view the original quality. We use shorter-duration video (16 frames) for the Text-to-video-zero baseline due to extensive memory requirements.
"A marble sculpture running" | Ours | TokenFlow ([2]) |
---|---|---|
ControlVideo ([1]) | SDEdit ([4]) | T2VZero ([3]) |
StreamV2V ([5]) | MotionTransfer ([6]) | |
"waterpainting of woman running" | Ours | TokenFlow ([2]) |
---|---|---|
ControlVideo ([1]) | SDEdit ([4]) | T2VZero ([3]) |
StreamV2V ([5]) | MotionTransfer ([6]) | |
"Shiny silver robot doing moonwalk" | Ours | TokenFlow ([2]) |
---|---|---|
ControlVideo ([1]) | SDEdit ([4]) | T2VZero ([3]) |
StreamV2V ([5]) | MotionTransfer ([6]) | |
"Car running in winter days" | Ours | TokenFlow ([2]) |
---|---|---|
ControlVideo ([1]) | SDEdit ([4]) | T2VZero ([3]) |
StreamV2V ([5]) | MotionTransfer ([6]) | |
"Watercolor painting of car running" | Ours | TokenFlow ([2]) |
---|---|---|
ControlVideo ([1]) | SDEdit ([4]) | T2VZero ([3]) |
MotionTransfer ([6]) | ||
"Shiny silver robotic wolf" | Ours | TokenFlow ([2]) |
---|---|---|
ControlVideo ([1]) | SDEdit ([4]) | T2VZero ([3]) |
StreamV2V ([5]) | MotionTransfer ([6]) | |
"Cutting Bricks" | Ours | TokenFlow ([2]) |
---|---|---|
ControlVideo ([1]) | SDEdit ([4]) | T2VZero ([3]) |
StreamV2V ([5]) | MotionTransfer ([6]) | |
"Cutting bread, Van Gogh style" | Ours | TokenFlow ([2]) |
---|---|---|
ControlVideo ([1]) | SDEdit ([4]) | T2VZero ([3]) |
StreamV2V ([5]) | MotionTransfer ([6]) | |
We conduct ablation study on various extensions of self-attention mechanisms for consistent video editing. We use 40 frames from each video for joint editing in the same framework.
We observe significant quality and consistency improvements by using denser sampling for extensions. Notably, our adaptive method can preserve the baseline extension performance, while achieving significant speed-up.
Input Video | First Frame | First Frame + Prev Frame | First Frame + Two Prev Frames |
---|---|---|---|
Sampled at 3 | Sampled at 3 + Ada-VE(ours) | Sampled at 1 | Sampled at 1 + Ada-VE(ours) |
[1] Zhang, Yabo, Yuxiang Wei, Dongsheng Jiang, XIAOPENG ZHANG, Wangmeng Zuo, and Qi Tian. "ControlVideo: Training-free Controllable Text-to-video Generation." In The Twelfth International Conference on Learning Representations. 2024.
[2] Geyer, Michal, Omer Bar-Tal, Shai Bagon, and Tali Dekel. "TokenFlow: Consistent Diffusion Features for Consistent Video Editing." In The Twelfth International Conference on Learning Representations. 2024.
[3] Khachatryan, Levon, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. "Text2video-zero: Text-to-image diffusion models are zero-shot video generators." In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15954-15964. 2023.
[4] henlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equa- tions. In International Conference on Learning Representations, 2022.[5] Liang, Feng, et al. "Looking Backward: Streaming Video-to-Video Translation with Feature Banks." arXiv preprint arXiv:2405.15757 (2024).
[6] Yatim, Danah, et al. "Space-time diffusion features for zero-shot text-driven motion transfer." CVPR. 2024.