We show that high quality videos can be generated using essentially the standard formulation of the Gaussian diffusion model [46], with little modification other than straightforward architectural changes to accommodate video data within the memory constraints of deep learning accelerators.

저자는 고품질 비디오를 gaussian diffusion model의 공식을 사용해 생성 가능하다.

We additionally show the benefits of joint training on video and image modeling objectives.

저자는 비디오와 이미지 모델링에 결합 학습의 혜택을 보여준다.

1. Video diffusion models

Untitled

We propose to extend this image diffusion model architecture to video data, given by a block of a fixed number of frames, using a particular type of 3D U-Net [13] that is factorized over space and time. First, we modify the image model architecture by changing each 2D convolution into a space-only 3D convolution, for instance, we change each 3x3 convolution into a 1x3x3 convolution. The attention in each spatial attention block remains as attention over space; i.e., the first axis is treated as a batch axis. Second, after each spatial attention block, we insert a temporal attention block that performs attention over the first axis and treats the spatial axes as batch axes. We use relative position embeddings [45] in each temporal attention block so that the network can distinguish ordering of frames in a way that does not require an absolute notion of video time.

저자는 3D U-Net을 제안한다. 1. 저자는 2D conv를 3D conv로 바꾸었는데 예시로 3x3 conv는 1x3x3으로 바뀐다. spatial attention block에 attnetion은 공간을 통한?(over space) attetion으로 남는다. 2. 각 spatial attention block 다음에 batch 축으로 공간 축을 다루고 첫번째 축을 넘은 attention을 행동하는 임시적인 attention block을 둔다. 각 temporal attention block에 상대적 position embedding을 사용한다. 그래서 network는 video time의 절대적인 개념이 필요가 없다. 없어도 잘 구분한다.

The use of factorized space-time attention is known to be a good choice in video transformers for its computational efficiency [2, 5, 21]. An advantage of our factorized space-time architecture, which is unique to our video generation setting, is that it is particularly straightforward to mask the model to run on independent images rather than a video, simply by removing the attention operation inside each time attention block and fixing the attention matrix to exactly match each key and query vector at each video timestep.

factorized space-time attention의 사용은 연산이 매우 효율적이다. 장점은 독립적인 이미지가 잘 되도록 모델이 마스킹 하는 것이 간단하다. 간단하게 각각 video timestep에 맞게 key와 query를 정확히 매치시켜 각 time attention 블럭과 attention matrix를 고치고 attention operation을 제거해서 구현한다.

Reconstruction-guided sampling for improved conditional generation

at a frame rate of at least 24 frames per second. To manage the computational requirements of training our models, we only train on a small subset of say 16 frames at a time. However, at test time we can generate longer videos by extending our samples.

초마다 최소 24 frames을 요구함. 하지만 학습을 위해 16 frame으로 사용함. 그러나 test time은 저자의 samples 확장에 의해 longer videos을 생성함.

Both approaches require one to sample from a conditional model, pθ(x b |x a ). This conditional model could be trained explicitly, but it can also be derived approximately from our unconditional model pθ(x) by imputation, which has the advantage of not requiring a separately trained model.

We will refer to both of these approaches as the replacement method for conditional sampling from diffusion models.

x^b는 2번째에 생성할 영상, x^a는 처음에 생성하는 영상임. conditional model로부터 샘플을 구하는 접근법들이다. 이 conditional model은 학습되어질 수 있지만 imputation에 의해 저자의 unconditional model로부터 근사할 수 있다. (예시를 보면 이해 잘됨 강제로 exact samples를 넣어 올바른 주변 분포를 구하고 똑같이 z^b_s일때 효과를 봄) 이를 repleacement으로 접근한다고 말한다.

When we tried the replacement method to conditional sampling, we found it to not work well for our video models: Although samples x b looked good in isolation, they were often not coherent with x a . This is caused by a fundamental problem with this replacement sampling method. That is, the latents z b s are updated in the direction provided by xˆ b θ (zt) ≈ Eq[x b |zt], while what is needed instead is Eq[x b |zt, x a ]. Writing this in terms of the score of the data distribution, we get Eq[x b |zt, x a ] = Eq[x b |zt] + (σ 2 t /αt)∇z b t log q(x a |zt), where the second term is missing in the replacement method.

하지만 이렇게 replacement를 하지만 samples x^b가 isolation일때가 있는데 이는 x^a와 일관성이 없을때가 있음. 이는 근본적인 문제인데 이를 $E_q[x^b|z_t]$를 $z_t$만 넣지말고 $E_q[x^b|z_t, x^a]$도 넣어서 해결함. 그렇게 하여 아래와 같은 공식으로 정리가 됨.

Untitled

. Since q(x a |zt) is not available in closed form, however, we instead propose to approximate it using a Gaussian of the form q(x a |zt) ≈ N [xˆ a θ (zt),(σ 2 t /α2 t )I],