The two main challenges in T2V generation are the lack of high-quality text-video data at scale and the complexity of modeling the temporal dimension. There are two mainstream frameworks: 1) Transformer with Variational Auto Encoders (VAE); 2) diffusion models with UNet.
문제 제기: T2V 생성에 2가지 주요 문제는 temporal dimension 모델링의 complexity와 scale에 high-quality text-video data의 lack임. 여기에는 2가지 큰 흐름이 있는데 VAE와 U-Net임
However, due to the complexity of video modeling, pixel-based T2V diffusion models must compromise to generate a low-resolution video first (64 × 64 in Make-A-Video and 40 × 24 in Imagen Video), followed by a sequence of super-resolution and frame interpolation models (see Tab. 4 for details). This makes the entire pipeline complicated and computationally expensive
문제 제기: video modeling의 complexity 때문에 pixel-based T2V diffusion models는 처음에는 low-resolution video를 생성하고 super-resolution의 sequence와 frame interpolation models에 따라 타협해야함. 이는 연산량이 높고 pipeline이 복잡하게 만듦.
In this paper, we propose Latent-Shift, which is an efficient model that can generate a two seconds video clip with 256 × 256 resolution without additional super-resolution or frame interpolation models
제안: frame interpolation 혹은 super-reolution 모델 없이도 256x256 resolution인 2초짜리 video clip을 생성할 수 있는 효율적인 모델로, Latent-shift를 내놓음.
we use a parameterfree temporal shift module as motivated from [19,22]. During training, we shift a few channels of the spatial U-Net feature maps forward and backward along the temporal dimension. This allows the shifted features of the current frame to observe the features from the previous and the subsequent frames and thus help to learn temporal coherence.
parameter-free temporal shift moudle을 사용함. 학습 중에 temporla dimesion 혼자 backward와 forward로 spatial U-Net feature maps의 몇 channels을 옮김. 이전과 마지막 frames로부터 feature를 보기 위해 현재 frames의 shifted feautre를 허용하여 따라서 temporal coherence를 학습하는데 도움을 줌.
There are two training stages in the latent image diffusion models: 1) an autoencoder is trained to compress images into compact latent representations; 2) a diffusion model based on the U-Net architecture is trained on textimage pairs to learn T2I generation in the latent space.
2단계가 있음 1) AE는 compact latent represention으로 이미지 compress images를 학습함.
Latent Representation Learning.
The latent space is learned by an autoencoder that consists of an encoder and a decoder.
SDM에 있는 AE 학습임.
Conditional Latent Diffusion Model
Specifically, given an image x that is encoded to the latent space z, we add Gaussian noise into z defined as:
latent space z에 encoded한 주어진 image x, z에 gaussian noise를 추가함. 위와 같이 정의됨.
The conditional latent diffusion model is trained to estimate the noise E given a noisy input and conditioned on the text representation. A mean squared error loss is used: