정리 | Notion

Untitled

As T2I models pretrained with large-scale image-text data already capture knowledge of open-domain concepts, a intuitive question arises: can they infer other novel videos from a single video example, like humans? A new T2V generation setting is therefore introduced, namely, One-Shot Video Tuning, where only a single text-video pair is used to train a T2V generator. The generator is expected to capture essential motion information from the input video and synthesize novel videos with edited prompts.

질문 제기 : T2I 모델들이 하나의 비디오로부터 다른 새로운 비디오를 추론 할 수 있을까? T2V generation 설정은 One-Shot Video Tuning으로 이름지어져서 소개함. text-video 쌍으로만 T2V 생성자를 학습하는데 사용됨. generator는 입력된 영상으로부터 모션 정보를 잡고 편집된 prompts에 새로운 비디오들과 합성됨.

Untitled

Regarding motion: T2I models are able to generate images that align well with the text, including the verb terms.

T2I 모델들은 동사를 포함해 텍스트를 잘 정렬하여 이미지를 생성함. Fig2에 첫번째 행을 보면 된다.

This serves as evidence that T2I models can properly attend to verbs via cross-modal attention for static motion generation

T2I 모델들은 정적인 움직임 생성을 위한 cross-modal attention을 통해 verbs에 적절히 기울인다는 증거이다.

Regarding consistent objects: Simply extending the spatial self-attention in the T2I model from one image to multiple images produces consistent content across frames.

한 이미지부터 다중 이미지까지 T2I 모델에 간단한 spatial self-attention extending은 일관한 content across fram들을 생성한다.

We implement our findings into a simple yet effective method called Tune-A-Video

However, using full attention in spacetime inevitably leads to quadratic growth in computation. It is thus infeasible for generating videos with increasing frames. Additionally, employing a naive fine-tuning strategy that updates all the parameters can jeopardize the preexisting knowledge of T2I models and hinder the generation of videos with new concepts.

Tune-A-Video라 불리는 효과적인 방법을 찾고 구현함. 하지만 spacetime에 full attention을 사용하는 것은 연산량이 곱으로 증가하도록됨. 프레임이 증가하면 비디오 생성이 불가능해진다 .게다가, naive-finetuning 전략은 기존 존재하는 정보와 새로운 컨샙 방해가 위태롭게 모든 파라미터를 업데이트한다.

To tackle these problems, we introduce a sparse spatio-temporal attention mechanism that only visits the first and the former video frame, as well as an efficient tuning strategy that only updates the projection matrices in attention blocks.

이 문제들을 건들기 위해, 저자는 sparse spatio-temporal attention 방법을 만들었는데 오직 첫번째와 이전 비디오 프레임만 방문한다. 뿐만 아니라 효율적인 attention blocks에 행렬을 투영하여 업데이트된다.

1. Method

Network Inflation

Untitled

A T2I diffusion model (e.g., LDM [37]) typically employs a U-Net [38], which is a neural network architecture based on a spatial downsampling pass followed by an upsampling pass with skip connections.