정리 | Notion

In this paper, we propose a novel single image superresolution diffusion probabilistic model (SRDiff) to tackle the over-smoothing, mode collapse and huge footprint problems in previous SISR models.

문제 제기와 개선점 혼합: 기존까지 GAN, Flow-based 모델의 문제점들을 다 해결해주는 diffusion model을 이용하여 image super resolution(SR)을 달성 할 수 있도록 제안함.

Specifically, 1) to extract the image information in LR image, SRDiff exploits a pretrained lowresolution encoder to convert LR image into hidden condition. 2) To generate the HR image conditioned on LR image, SRDiff employs a conditional noise predictor to recover x0 iteratively. 3) To speed up convergence and stabilize training, SRDiff introduces residual prediction by taking the difference between the HR and LR image as the input x0 in the first diffusion step, making SRDiff focus on restoring highfrequency details.

자세히 설명

LR 이미지 정보 얻기 위해서 pretrained low resolution encoder을 이용해 hidden condition으로 바꿈
HR image를 생성하기 위해 LR 이미지를 conditioned로 둠. SRDiff은 x0를 회복하기 위해 conditional noise predictor를 사용함.
수렴 속도와 안정을 위해, SRDiff은 첫 step에 input x0인 LR과 HR사이에 차이에 의한 residual prediction을 도입하여 SRDiff이 highfrequency를 자세하게 집중할 수 있음,

1. SRDiff

Instead of predicting the HR image directly, we apply residual prediction to predict the difference between the HR image xH and the upsampled LR image up(xL) and denote the difference as input residual image x0.

바로 HR image 예측하는 거 대신, HR image xH와 upsampled LR image up(xL) 사이 차이를 예측하기 위해 residual prediction과 input residual image x0(그 흑백 노이즈에 선만 있는 것들, 품질 assess로에 자주 쓰임.) 차이를 표시함.

Untitled

According to Eq. (3) and (7), the reverse process is determined by θ, which is a conditional noise predictor with an RRDB-based [Wang et al., 2018] low-resolution encoder (LR encoder for short) D, as shown in Figure 3. The reverse process converts a latent variable xT to a residual image xr by iteratively denoising in finite step T using the conditional noise predictor θ, conditioned on the hidden states encoded from LR image by the LR encoder D. The SR image is reconstructed by adding the generated residual image xr to the upsampled LR image up(xL).

공식 3과 7에 따라 reverse process는 RRDB-based(residual in residual)로 noise predictor Eθ에 결정됨. reverse process는 latent variable xT를 residual image xr로 변환하는데 LR encoder D에 의해 LR image로부터 hidden states를 encode한 값을 컨디션을 둔 noise predictor θ를 사용함.

Conditional Noise Predictor

The conditional noise predictor θ predicts noise added in each timestep of the diffusion process conditioned on the LR image information

알다시피 predictor Eθ는 LR 이미지에 시간을 조건으로 noise를 예측함.

First, xt is transformed to hidden through a 2Dconvolution block which consists of one 2D-convolutional layer and Mish activation [Misra, 2019]. Then the LR information is fused with the 2D-convolution block output hidden.

Fig3 구조를 보며 이해해야함.

xt는 Mish activation과 하나 2D conv로 구성된 2D conv block을 통해 hidden으로 변화됨. 그러고 LR information은 2D conv block output hidden과 결합됨.

Then the last output hidden and te are fed into the contracting path, one middle step and the expansive path successively. The contracting path and expansive path both consist of four steps, each of which successively applies two residual blocks and one downsampling/upsampling layer.

last output hidden과 $t_e$(time embedding)은 contracting path→ one middle step→ expansive path를 거침. contracting path와 expansive path 둘다 4단계로 구성되었는데 각 2개 residual blocks와 1개 down/upsampling layer를 적용함.

Our conditional noise predictor is easy and stable to train due to the multi-scale skip connection. Moreover, it combines local and global information through the contracting and expansive path.

conditional noise predictor는 multi-scale skip connection 때문에 안정적이고 쉽게 학습할 할 수 있음. 추가적으로 contracting과 expansive path로 local, global information을 결합할 수 있음.

LR Encoder