However, these approaches often suffer from various limitations; e.g., autoregressive models are prohibitively expensive for highresolution image generation, NFs and VAEs often yield sub-optimal sample quality, and GANs require carefully designed regularization and optimization tricks to tame optimization instability [2, 15] and mode collapse [29, 38].

문제 제기: 기존 SR은 너무 expensive 하거나 sub-optimal sample quality 그리고 regularization과 optimization을 신중하게 해야 안정적인 등 문제점이 많음.

SR3 works by learning to transform a standard normal distribution into an empirical data distribution through a sequence of refinement steps, resembling Langevin dynamics

해결: diffusion model 기반 SR3는 refinement steps의 연속으로 empirical data distribution으로 normal distribution으로 변하도록 학습함.

The key is a U-Net architecture [42] that is trained with a denoising objective to iteratively remove various levels of noise from the output. We adapt DDPMs to conditional image generation by proposing a simple and effective modification to the U-Net architecture.

핵심은 U-Net으로 output으로부터 다양한 levels noise를 순차적으로 denoising하기 위해 학습됨. U-Net에 조금 수정하여 제안함,

These quality scores often penalize synthetic high-frequency details, such as hair texture, because synthetic details do not perfectly align with the reference details.

문제 제기 2: PSNR, SSIM 같은 quality scorre는 human preference 같은 것을 반영하지 않음.

We resort to human evaluation to compare the quality of super-resolution methods. We adopt a 2-alternative forced-choice (2AFC) paradigm in which human subjects are shown a low-resolution input and are required to select between a model output and a ground truth image (cf. [63]).

2개 대안을 적용하였는데, ground truth image와 결과물 사이 선택하고 low-resolution input을 보여줌.

1. Conditional Denoising Diffusion Model

Untitled

The conditional DDPM model generates a target image y0 in T refinement steps. Starting with a pure noise image yT ∼ N (0, I), the model iteratively refines the image through successive iterations (yT −1, yT −2, . . . , y0) according to learned conditional transition distributions pθ(yt−1 | yt, x) such that y0 ∼ p(y | x) (see Figure 2).

conditional DDPM 모델은 T번 refinments steps러 target image y0를 생성함. 당연히 y_T에서 y_0으로 가는 과정이며 condition으로 input image x를 두는 거임.

Optimizing the Denoising Model

To help reverse the diffusion process, we take advantage of additional side information in the form of a source image x and optimize a neural denoising model fθ that takes as input this source image x and a noisy target image y~,

Untitled

더 나은 reverse diffusion process 위해, source image x의 form에 side information을 활용하고 source iamge x와 noisy targe image y~를 이용해 neural denoising model f를 최적화함. 그리고 목표는 노이즈가 덜한 targe image y0를 복구하는 거임.

In addition to a source image x and a noisy target image ye, the denoising model fθ(x, ye, γ) takes as input the sufficient statistics for the variance of the noise γ, and is trained to predict the noise vector .

추가적으로 source image x 그리고 noisy target imag y~는 noise γ 편차의 충족하는 statistics input을 가지고 noise vector E를 예측하도록 학습됨.