While that is much more efficient than previous approaches, it still requires an optimization process. In addition, the generation abilities of Tune-A-Video are limited to text-guided video editing applications; video synthesis from scratch, however, remains out of its reach.
문제제기 : Stable video 생성 분야에는 최적화가 필요하고 아직 text로 편집하는 기술의 능력이 부족함.
In this paper, we take one step forward in studying the novel problem of zero-shot, “training-free” text-to-video synthesis, which is the task of generating videos from textual prompts without requiring any optimization or finetuning.
이 논문에서는 zero-shot의 새로운 문제를 연구하고 “trainig-free” 라는 fine-tuning 또는 최적화 없이 textual prompts로부터 비디오를 생성하는 연구를 함.
A key concept of our approach is to modify a pre-trained text-to-image model (e.g., Stable Diffusion), enriching it with temporally consistent generation. By building upon already trained text-to-image models, our method takes advantage of their excellent image generation quality and enhances their applicability to the video domain without performing additional training.
주요 개념은 t2i 모델을 수정하여 접근했음. 이미 학습된 t2i 모델에 저자의 방법은 추가적인 학습 없이 비디오 영역에 품질과 향상된 생성 능력을 가져옴.
(1) we first enrich the latent codes of generated frames with motion information to keep the global scene and the background time consistent; (2) we then use cross-frame attention of each frame on the first frame to preserve the context, appearance, and identity of the foreground object throughout the entire sequence.
(1) global scene과 시간에 맞는 background를 유지하여 움직임 정보와 생성된 frames 의 latent code를 풍구하게 하고 (2) 그러고 나서 전체 sequence를 통해 맥락, appearance와 같은 정보를 보존하여 first frame에 각 frame의 cross-frame attention을 사용함.
To make video generation cheaper and easier, we propose a new problem: zero-shot text-to-video synthesis. Formally, given a text description τ and a positive integer m ∈ N, the goal is to design a function F that outputs video frames V ∈ R m×H×W×3 (for predefined resolution H × W) that exhibit temporal consistency. To determine the function F, no training or fine-tuning must be performed on a video dataset.
싸고 쉬운 비디오 생성을 만들기 위해, 저자는 새로운 Zero-shot 합성 문제를 제안함. text description τ 과 positivie integer $m$(m개의 sample을 만들어 동영상으로 이음)이 주어지면 일관성을 보여주는 비디오 frame V를 내놓는 함수 F를 디자인함. 학습 또는 fine-tuning 없이 비디오 데이터에 수행됨.
. As we need to generate videos instead of images, SD should operate on sequences of latent codes. However, as shown in Fig. 10 (first row), this leads to completely random generation of images sharing only the semantics described by τ but neither object appearance nor motion coherence.
설명 τ만을 오직 공유하여 DDIM으로 sampling을 하는데 무작위로 결과가 나옴, Fig10을 보면 알 수 있음.
To address this issue, we propose to (i) introduce motion dynamics between the latent codes x 1 T , . . . , xm T to keep the global scene time consistent and (ii) use cross-frame attention mechanism to preserve the appearance and the identity of the foreground object.
이 문제를 해결하기 위해, 저자는 1. global scene 시간을 유지하기 위해 latent codes(x1~xT) 사이에 모션 dynamic을 도입함. 2. appearance와 동인한 forground 보존하기 위해 cross-frame attention을 사용.