Despite its recent success in text-to-image generation tasks, the application of diffusion-based generative models for video generation tasks is still under-explored, due to the following difficulties:
Data scarcity. Complex temporal dynamics High computation cost.
문제 제기 : video generation에는 문제가 있음.
Due to these challenges, recent diffusion-based video generation models propose to deploy a cascaded pipeline [15], which generates low-resolution video frames first, followed by a super-resolution module
그래서 처음에는 cascaded pipline으로 저해상에서 점차 해상도를 높여 학습하는 걸 제안함. 그래도 cost huge함.
To improve data efficiency and relieve the requirement for the video-text paired training data, instead of building our models with 3D convolutions, we choose to adopt 2D convolution together with temporal computation operators to model the spatial and temporal video features
To further reduce the memory cost, we share the same 2D convolutions for processing all the frames. However, this will deteriorate the generation quality of video temporal dynamics (e.g., object motion) because of the feature change across different frames. Therefore, we introduce a new and lightweight adaptor module to adjust the distribution of each frame’s features.
저자는 해결법으로 1. 3D conv로 build하지 않고 2D conv를 적용함. 2. 모든 프레임들 진행에 똑같은 2D conv를 공유함. 하지만 이게 오히려 악화시킴. 그래서 저자는 light weight adaptor module을 도입함.
Gaussian noise at intermediate time step t. xt is short for xt = [x 1 t , ..., x F t ], where x i t represents the i th frame in the sequence. The encoder and decoder of the variational auto-encoder are denoted by E(·) and D(·), respectively. The video frames are mapped into the latent space one by one, i.e., zt = [E(x 1 t ), ..., E(x F t )]. We use CLIP [29] to encode the given text prompt y, and the obtained embedding is denoted as τ (y). We use ϵθ(zt, t, τ (y)) to denote the denoiser of the diffusion model in the latent space.
간단하게 τ (y)= clip(text prompt)임. 나머지는 아는 그대로
Latent Diffusion 사용과 똑같음. 결국 novle 3D U-Net decoder를 사용함.
#### 2D Convolution with distribution adaptor
However, the computation complexity and hardware compatibility of 3D convolution are significantly worse than that of 2D convolution. Thus, to reduce the high computational cost and redundancy, recent video processing models typically replace 3D convolution with a 2D convolution along the spatial dimension followed by a 1D convolution [44] along the temporal dimension (termed “2D+1D”).
3D conv는 너무 무거움. 따라서 2D conv 다음 1D conv로 대체함.
In this work, we further simplify this process from the form of “2D+1D” to “2D+adaptor”, where the adaptor is an even simpler operator compared to the 1D convolution. Specifically, given a set of F video frames, we apply a shared 2D convolution for all the frames to extract their spatial features. After that, we assign a set of distribution adjustment parameters to adjust the mean and variance for the intermediate features of every single frame via: