Ada-VE: Training-Free Consistent Video Editing Using Adaptive Motion Prior (WACV 2025)

Tanvir Mahmud, Mustafa Munir, Radu Marculescu, and Diana Marculescu

Highlights

**Figure 1:** Ada-VE Overview: (i) Preprocessing includes DDIM inversion to extract noise and motion masks using optical flow. (ii) Reference frames are sampled, edited iteratively with motion-guided sparse self-attention, caching KVs. (iii) Cached KVs are used to edit intermediate frames for enhanced temporal consistency.

**Figure 2:** Self-Attention Comparison: (i) Basic Self-Attention independently applies Q, K, and V to each frame. (ii) Fully Extended Self-Attention combines all frames' K and V for cross-frame attention. (iii) Proposed Sparsely Extended Self-Attention selectively extends K and V for moving regions, guided by motion masks to enhance detail without increasing complexity.

Comparisons to Existing Baselines

Our method largely improves the visual quality while maintaining temporal consistency of videos. Since all videos are compressed, please use full-screen to view the original quality. We use shorter-duration video (16 frames) for the Text-to-video-zero baseline due to extensive memory requirements.

Ada-VE (Ours).
ControlVideo ([1]).
TokenFlow ([2]).
Text-to-video-zero ([3]).
SDEdit ([4]).
StreamV2V ([5]).
MotionTransfer ([6]).

"A marble sculpture running"	Ours	TokenFlow ([2])

ControlVideo ([1])	SDEdit ([4])	T2VZero ([3])

StreamV2V ([5])	MotionTransfer ([6])

"waterpainting of woman running"	Ours	TokenFlow ([2])

ControlVideo ([1])	SDEdit ([4])	T2VZero ([3])

StreamV2V ([5])	MotionTransfer ([6])

"Shiny silver robot doing moonwalk"	Ours	TokenFlow ([2])

ControlVideo ([1])	SDEdit ([4])	T2VZero ([3])

StreamV2V ([5])	MotionTransfer ([6])

"Car running in winter days"	Ours	TokenFlow ([2])

ControlVideo ([1])	SDEdit ([4])	T2VZero ([3])

StreamV2V ([5])	MotionTransfer ([6])

"Watercolor painting of car running"	Ours	TokenFlow ([2])

ControlVideo ([1])	SDEdit ([4])	T2VZero ([3])

MotionTransfer ([6])

"Shiny silver robotic wolf"	Ours	TokenFlow ([2])

ControlVideo ([1])	SDEdit ([4])	T2VZero ([3])

StreamV2V ([5])	MotionTransfer ([6])

"Cutting Bricks"	Ours	TokenFlow ([2])

ControlVideo ([1])	SDEdit ([4])	T2VZero ([3])

StreamV2V ([5])	MotionTransfer ([6])

"Cutting bread, Van Gogh style"	Ours	TokenFlow ([2])

ControlVideo ([1])	SDEdit ([4])	T2VZero ([3])

StreamV2V ([5])	MotionTransfer ([6])

Ablations on Various Self-Attention Extensions

We conduct ablation study on various extensions of self-attention mechanisms for consistent video editing. We use 40 frames from each video for joint editing in the same framework.

First Frame: Only KVs from first frame is used in all frames.
First Frame+Prev Frame: KVs from first frame and immediate previous frame are used.
First Frame+Two Prev Frames: KVs from first frame and two immediate previous frames are used.
Sampled at 3: Reference frames are sampled at an interval 3 from 40 frames, and used in all frames.
Sampled at 3 + Ada-VE(ours): Reference frames are sampled at an interval 3 from 40 frames, and used in all frames. Our adaptive attention extension is integrated.
Sampled at 1: This represents full extension of KVs in self-attention with sampling interval 1.
Sampled at 1 + Ada-VE(ours): This represents full extension of KVs in self-attention with sampling interval 1. Our adaptive attention extension is integrated.

We observe significant quality and consistency improvements by using denser sampling for extensions. Notably, our adaptive method can preserve the baseline extension performance, while achieving significant speed-up.

Input Video First Frame First Frame + Prev Frame First Frame + Two Prev Frames

Sampled at 3 Sampled at 3 + Ada-VE(ours) Sampled at 1 Sampled at 1 + Ada-VE(ours)

Input Video	First Frame	First Frame + Prev Frame	First Frame + Two Prev Frames

Sampled at 3	Sampled at 3 + Ada-VE(ours)	Sampled at 1	Sampled at 1 + Ada-VE(ours)

[1] Zhang, Yabo, Yuxiang Wei, Dongsheng Jiang, XIAOPENG ZHANG, Wangmeng Zuo, and Qi Tian. "ControlVideo: Training-free Controllable Text-to-video Generation." In The Twelfth International Conference on Learning Representations. 2024.

[2] Geyer, Michal, Omer Bar-Tal, Shai Bagon, and Tali Dekel. "TokenFlow: Consistent Diffusion Features for Consistent Video Editing." In The Twelfth International Conference on Learning Representations. 2024.

[3] Khachatryan, Levon, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. "Text2video-zero: Text-to-image diffusion models are zero-shot video generators." In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15954-15964. 2023.

[4] henlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equa- tions. In International Conference on Learning Representations, 2022.

[5] Liang, Feng, et al. "Looking Backward: Streaming Video-to-Video Translation with Feature Banks." arXiv preprint arXiv:2405.15757 (2024).

[6] Yatim, Danah, et al. "Space-time diffusion features for zero-shot text-driven motion transfer." CVPR. 2024.