Untitled

we investigate generative models suited for interactive applications in video editing.

We aim to circumvent expensive per-video training and correspondence calculation to achieve fast inference for arbitrary videos.

저자는 학습때는 무겁지만 inference에서는 빠른 모델을 목표로 둠.

We propose a controllable structure and content-aware video diffusion model trained on a large-scale dataset of uncaptioned videos and paired text-image data. We opt to represent structure with monocular depth estimates and content with embeddings predicted by a pre-trained neural network.

저자는 컨트롤 가능한 구조와 content-aware video diffusion 모델을 제안함. monocular depth 인 represent sturucture와 pretrained neural network에 의해 예측된 embedding에 content 표현을 최적화 한다.

Method

Untitled

For our purposes, it will be helpful to think of a video in terms of its content and structure. By structure, we refer to characteristics describing its geometry and dynamics, e.g. shapes and locations of subjects as well as their temporal changes. We define content as features describing the appearance and semantics of the video, such as the colors and styles of objects and the lighting of the scene.

저자는 content, structure 관점에서 도움이 된다고 한다. sutrucer는 기하학적인 특징이고 content는 색깔과 모습과 같은 걸 의미함.

To achieve this, we aim to learn a generative model p(x|s, c) of videos x, conditioned on representations of structure, denoted by s, and content, denoted by c.

structure와 contetn를 넣어 비디오 만든다.

Spatio-temporal Latent Diffusion

Untitled

we extend an image architecture by introducing temporal layers, which are only active for video inputs. All other layers are shared between the image and video model. The autoencoder remains fixed and processes each frame in a video independently.

비디오 frames 분포는 frame들 사이에 관계를 무조건 가져야함. 대형 미지 데이터셋에 학습에 의해 얻은 나은 generalizatio으로부터 이익을 공유한 파라미터인 image model을 동시에 학습하는 걸 원한다. 저자는 video active를 위해서(아마 이미지 데이터셋에서 오는 generalization을 video에 넣기 위해서인듯) temporal layers를 추가함. 다른 모든 layer들은 video와 image model 사이를 공유함. autoencoder는 여전히 고정되어 있고 독립적인 video에 각 frame을 진행한다.

The UNet consists of two main building blocks: Residual blocks and transformer blocks (see Fig. 3). Similar to [17, 49], we extend them to videos by adding both 1D convolutions across time and 1D self-attentions across time.

UNet은 2개 블럭으로 나뉨. Residual blocks, transformer blocks임

Representing Content and Structure