Ada-VE: Training-Free Consistent Video Editing Using Adaptive Motion Prior (WACV 2025)

Tanvir Mahmud, Mustafa Munir, Radu Marculescu, and Diana Marculescu

 

Highlights

Ada-VE Overview
Figure 1: Ada-VE Overview: (i) Preprocessing includes DDIM inversion to extract noise and motion masks using optical flow. (ii) Reference frames are sampled, edited iteratively with motion-guided sparse self-attention, caching KVs. (iii) Cached KVs are used to edit intermediate frames for enhanced temporal consistency.
Self-Attention Comparison
Figure 2: Self-Attention Comparison: (i) Basic Self-Attention independently applies Q, K, and V to each frame. (ii) Fully Extended Self-Attention combines all frames' K and V for cross-frame attention. (iii) Proposed Sparsely Extended Self-Attention selectively extends K and V for moving regions, guided by motion masks to enhance detail without increasing complexity.

Comparisons to Existing Baselines

Our method largely improves the visual quality while maintaining temporal consistency of videos. Since all videos are compressed, please use full-screen to view the original quality. We use shorter-duration video (16 frames) for the Text-to-video-zero baseline due to extensive memory requirements.




"A marble sculpture running" Ours TokenFlow ([2])
ControlVideo ([1]) SDEdit ([4]) T2VZero ([3])
StreamV2V ([5]) MotionTransfer ([6])



"waterpainting of woman running" Ours TokenFlow ([2])
ControlVideo ([1]) SDEdit ([4]) T2VZero ([3])
StreamV2V ([5]) MotionTransfer ([6])



"Shiny silver robot doing moonwalk" Ours TokenFlow ([2])
ControlVideo ([1]) SDEdit ([4]) T2VZero ([3])
StreamV2V ([5]) MotionTransfer ([6])



"Car running in winter days" Ours TokenFlow ([2])
ControlVideo ([1]) SDEdit ([4]) T2VZero ([3])
StreamV2V ([5]) MotionTransfer ([6])



"Watercolor painting of car running" Ours TokenFlow ([2])
ControlVideo ([1]) SDEdit ([4]) T2VZero ([3])
MotionTransfer ([6])



"Shiny silver robotic wolf" Ours TokenFlow ([2])
ControlVideo ([1]) SDEdit ([4]) T2VZero ([3])
StreamV2V ([5]) MotionTransfer ([6])



"Cutting Bricks" Ours TokenFlow ([2])
ControlVideo ([1]) SDEdit ([4]) T2VZero ([3])
StreamV2V ([5]) MotionTransfer ([6])



"Cutting bread, Van Gogh style" Ours TokenFlow ([2])
ControlVideo ([1]) SDEdit ([4]) T2VZero ([3])
StreamV2V ([5]) MotionTransfer ([6])


 

 

 



 


Ablations on Various Self-Attention Extensions

We conduct ablation study on various extensions of self-attention mechanisms for consistent video editing. We use 40 frames from each video for joint editing in the same framework.

We observe significant quality and consistency improvements by using denser sampling for extensions. Notably, our adaptive method can preserve the baseline extension performance, while achieving significant speed-up.
Input Video First Frame First Frame + Prev Frame First Frame + Two Prev Frames
Sampled at 3 Sampled at 3 + Ada-VE(ours) Sampled at 1 Sampled at 1 + Ada-VE(ours)



 

 

 

[1] Zhang, Yabo, Yuxiang Wei, Dongsheng Jiang, XIAOPENG ZHANG, Wangmeng Zuo, and Qi Tian. "ControlVideo: Training-free Controllable Text-to-video Generation." In The Twelfth International Conference on Learning Representations. 2024.

[2] Geyer, Michal, Omer Bar-Tal, Shai Bagon, and Tali Dekel. "TokenFlow: Consistent Diffusion Features for Consistent Video Editing." In The Twelfth International Conference on Learning Representations. 2024.

[3] Khachatryan, Levon, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. "Text2video-zero: Text-to-image diffusion models are zero-shot video generators." In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15954-15964. 2023.

[4] henlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equa- tions. In International Conference on Learning Representations, 2022.

[5] Liang, Feng, et al. "Looking Backward: Streaming Video-to-Video Translation with Feature Banks." arXiv preprint arXiv:2405.15757 (2024).

[6] Yatim, Danah, et al. "Space-time diffusion features for zero-shot text-driven motion transfer." CVPR. 2024.