Untitled

However, in the text-to-video generation, there still have limited applications due to the restriction of the high-quality video dataset and the video generative prior model.

문제 제기 : t2v 생성에 고화질 비디오 데이터셋과 비디오 생성 사전 모델 때문에 제한이 있음.

In this work, we design an efficient training scheme using easily obtained datasets and the pre-trained T2I models for controllable text-to-video generation.

저자는 쉽게 데이터셋을 얻어 효율적인 학습과 controllable t2v 생성을 위한 pretrained T2I 모델을 설계함.

Untitled

Specifically, given image-pose pairs and pose-free videos, we design a novel two-stage training strategy with carefully tuned blocks through a pretrained text-to-image model,

pose-free video 그리고 image-pose pairs를 주어 저자는 pretrained text-to-image model을 통해 tuned blcoks와 함께 2단계 학습 전략을 설계함.

Overall, our proposed method is equipped with delicate designs to generate videos that allow for flexible control through pose sequence, without the need for large video generation models. Moreover, our model inherits the robust editing, composition, and generalization capabilities of the pre-trained T2I model.

전반적으로, 저자의 방법은 large model 없이 pose sequence를 통해 융통성있는 컨트롤이 가능하도록 해주는 비디오 생성을 위한 섬세한 설계를 가짐. 게다가 저자의 모델은 T2I 모델의 pre robust, composition 그리고 생성 능력을 가짐.

1. Method

Pose-guided Text-to-Video Generation

Untitled

Due to the scarcity of qualified video-pose pairs in various datasets, we decide to decouple temporal and control conditions, whereas our model learns the pose control capability from images and the temporal consistency from videos. Therefore, we train our model in two different stages.

다양한 데이터셋에 인증된 video-pose pairs 수가 부족하기 때문에 저자는 시간과 control condition을 분리하기로 결정했음. 반면 저자의 모델은 비디오에 시간이 일정함과 이미지로부터 pose control 능력을 배운다. 그러므로 저자는 2단계로 학습한다.(시간으로 일정함 학습, pose control 학습)

Base Model Architecture.

The widely-used diffusion model [25] for image synthesis employs U-Net [26] for denoising, which is a multi-stage neural network architecture that involves spatial downsampling followed by an upsampling.

The spatial self-attention is utilized for similar correlation by the locations of the latent in representation, while the cross-attention considers the correspondence between latent and conditional inputs (such as text).

똑같이 U-Net을 쓰는데 spatial self-attention 사용함. cross attention은 latent와 conditional 사이에 correspondence를 고려한다면 spatial self-attention은 representaion에 laten 위치에 의한 비슷한 상관관계를 위해 활용됨.

Training Stage 1: Pose-Controllable Text-to-Image Generation.