Untitled

However, it is still unclear how to extend such success to the video editing realm. Given the input video and text prompts, the text-driven video editing algorithm is required to output an edited video that satisfies

However, the training requires both significant amounts of paired text-video data and computational resources, which is often inaccessible

문제 제기 :이미지 분야는 성공을 거뒀지만 아직 video에는 안옴. 그리고 너무 많은 양의 데이터와 연산량이 필요함. 해결책이 나오긴 했으나 퀄리티가 떨어짐.

Different from these methods, we aim at performing zero-shot video editing using off-the-shelf image diffusion models, without training on any video.

기존 방법(fine-tune)과 달리 어떤 비디오 학습 없이 기존 모델들을 사용해 편집하는 거에 초점을 맞춤.

The framewise image editing produces temporal inconsistent results when alternating the video style, due to the lack of temporal modeling.

To tackle this problem, we propose a simple yet effective pipeline, termed vid2vid-zero, for zero-shot video editing.

문제 제기 2: 이 방법은 temporal modeling의 부족으로 일관하지 않은 결과를 내놓음. 이문제를 해결하기 위해 저자는 간단하고 확실하게 효과적인 방법인 vid2vid-zero를 내놓음.

1. Method

Untitled

Real Video Inversion

DDIM Inversion.

Motivated by their success in image editing [23, 25, 25], we first inverse each frame in the input video to the noise space, through the commonly used deterministic inversion method, DDIM inversion [37].

image editing 처럼, 저자는 입력한 비디오를 noise space로 반전시킴.

Null-text Optimization.

when sampling with the latent Xinv T and the source prompt, the reconstructed video may be significantly different from the original video.

노이즈로 바꾼 비디오를 다시 바꿀때 기존 비디오와 다를 수 있음.

we resort to prompt-tuning [21] to learn a soft text embedding that aligns with the video content. Although more sophisticated methods in prompt-tuning can be used, we find that optimizing the null-text embedding [23] can achieve promising textto-video alignment.