this paper addresses the slow sampling time issue in a similar manner to the method in Saharia et al. (2021) and Ho et al. (2022a) that refine low resolution images to high resolution using cascaded applications of multiple diffusion models. However, in contrast to (Saharia et al., 2021; Ho et al., 2022a), our model does not need to train multiple models, and can be implemented on a much lighter single architecture which results in speed enhancement in both training and inference without compromising the generation quality.
문제 제기 및 해결 방법: slow sampling time 문제를 다룸. 차이점은 multiple model을 학습할 필요가 없으며 더욱 light함. 그러면서 퀄리티는 높아짐
Specifically, in contrast to the existing diffusion models that adopt encoder-decoder architecture for the same dimensional input and output, here we propose a new conditional training method for the score function using positional information, which gives flexibility in the sampling process of reverse diffusion. Specifically, our pyramidal DDPM can generate a multiple resolution images using a single score function by utilizing positional information as a condition for training and inference.
그리고 encode-decode 모델에 input과 output이 같은 차원을 가지는데, positional information을 사용하여 score function에 새로운 conditional training method 제안함. pyramidal DDPM은 학습과 inference를 위한 condition으로 positional information을 활용한 single score function을 사용하여 이미지를 만듦.
As the wave is continuous and periodic, low dimensional information can be expanded to a high dimensional space of different frequencies. In particular, the distance between periodically encoded vectors can be easily calculated by a simple dot product, so that the relative positional information of the data is provided without any additional effort.
continuous와 preiodic 흐름으로 low dim 정보는 다른 frequencies의 high dim space로 확장 할 수 있음. 주기저긍로 encoded vecteors 사이 거리는 간단한 dot product로 계산이 가능함. 그래서 data의 relative positional information은 추가적인 노력 없이 제공됨.
Leveraging this simple but strong characteristic of the architecture, our goal is to train diffusion model such that it can understand different scale of the input by giving coordinate information as a condition. Specifically, we concatenate an input image and coordinate values of each pixels (i, j), while i, j ∈ [0, 1] are normalized value of its location. Then, random resizing to the target resolution, 64/ 128/ 256 in our case, is applied on the merged input. The resized coordinate values are encoded with sinusoidal wave, expanded to high dimensional space, and act as conditions when training as shown in Fig. 2.
CNN based model은 architecture의 강한 charaterisitic을 잡는데, diffusion model 학습에 condition으로 조정된 information을 주는 것에 의해 input의 다른 크기를 이해할 수 있음. 각 pixels의 input image와 coordinate values를 concat함(i,j는 normalized되어 0부터 1값). 그러고 random resizing함. resized corrdinate values는 sinusoidal wave에 encoded하고 high dimensional space로 확장하고 Fig2에 보여준 학습처럼 condition으로 넣어짐.
공식은 transformer에서 나온 positional encoding과 유사
By denoting i = [i1 i2 · · · iN ] | , j = [j1 j2 · · · jN ] | and N referring to the dimension of xt, the training cost function in (5) can be converted as
Benefited from the UNet-like model structure (Ronneberger et al., 2015), the cost function Eq. (8) is invariant to all different resolutions so that the optimization can be performed with only a single network. This simple idea of scale-free training of the score network significantly improves the network’s flexibility of sampling process which will be discussed later. Importantly, this can also alleviate the problem of slow training and low batch size problems especially when training with limited resources, the latter of which is significant for higher performance generative tasks.
공식 8과 같이 Loss를 구함. U-Net 구조의 장점으로, cost function Eq.(8)은 모두 다른 resolutions에 불변함. 그래서 오직 하나 network에 optimization이 가능함. 이 score network의 scale-free training의 simple idea는 sampling 과정의 flexibility를 증가함. 중요한 점은, 자원의 한계로 slow training과 low batch size 문제를 일으키는 부분을 감소시킴.