A major barrier remains to practical adoption of diffusion models: sampling speed
문제 제기: sampling speed를 개선 시켜야만함.
In this paper, we reduce the sampling time of diffusion models by orders of magnitude in unconditional and class-conditional image generation, which represent the setting in which diffusion models have been slowest in previous work. We present a procedure to distill the behavior of a N-step DDIM sampler (Song et al., 2021a) for a pretrained diffusion model into a new model with N/2 steps, with little degradation in sample quality.
저자는 위에 문제를 해결하기 위해 uncond 그리고 class-conditional image generation에 크기에 따라 시간을 감소시킨다. N-step DDIM ssampler 과정은 distill하며 N step sampler에서 N/2 step으로 바꾸지만 퀄리티는 작은 하락만 보임.
To make diffusion models more efficient at sampling time, we propose progressive distillation: an algorithm that iteratively halves the number of required sampling steps by distilling a slow teacher diffusion model into a faster student model.
계속해서 sampling step을 절반으로 줄여가는 distilling을 한다.
We start the progressive distillation procedure with a teacher diffusion model that is obtained by training in the standard way. At every iteration of progressive distillation, we then initialize the student model with a copy of the teacher, using both the same parameters and same model definition. Like in standard training, we then sample data from the training set and add noise to it, before forming the training loss by applying the student denoising model to this noisy data zt.
progressive distillation procedure를 시작하는데 보통 방법들과 똑같음. student 모델은 teacher 모델을 복사하는데 똑같은 parameter와 똑같은 모델이다. 평범하게 학습한다.
The main difference in progressive distillation is in how we set the target for the denoising model: instead of the original data x, we have the student model denoise towards a target x˜ that makes a single student DDIM step match 2 teacher DDIM steps. We calculate this target value by running 2 DDIM sampling steps using the teacher, starting from zt and ending at zt−1/N , with N being the number of student sampling steps.
다른 점은 step 크기가 다르다.
By inverting a single step of DDIM, we then calculate the value the student model would need to predict in order to move from zt to zt−1/N in a single step, as we show in detail in Appendix G. The resulting target value x˜(zt) is fully determined given the teacher model and starting point zt, which allows the student model to make a sharp prediction when evaluated at zt.
여러 이미지가 한 노이즈로 생성될 수 있지만 DDIM은 결정된 이미지와 결정된 계산량이 있다. 그래서 평가 할때 더욱 예리하다.
알고리즘을 보면 똑같은 teacher 모델에 다른 step 크기를 가진 t를 이용해 다른 노이즈를 만들어서 학습한다.
In this section, we discuss how to parameterize the denoising model xˆθ, and how to specify the reconstruction loss weight w(λt). We assume a standard variance-preserving diffusion process for which σ 2 t = 1 − α 2 t . This is without loss of generalization, as shown by (Kingma et al., 2021, appendix G): different specifications of the diffusion process, such as the variance-exploding specification, can be considered equivalent to this specification, up to rescaling of the noisy latents zt. We use a cosine schedule αt = cos(0.5πt), similar to that introduced by Nichol & Dhariwal (2021).
모델 매개변수화와 loss weight 조절을 해야한다. VP 같은 경우 generalization 손실 없이 diffusion process의 specification이 다르다. 동일하게 하기위해서는 noisy latents zt를 rescaling 해야하는데 cosine schedule을 쓰니 비슷하다.
In this case, the training loss is also usually defined as mean squared error in the -space: