정리 | Notion

Untitled

While diffusion models have shown impressive results on generation, editing, and other tasks (Section 2), they suffer from long inference times, due to the iterative diffusion process that is applied at the pixel level to generate each result.

We first show how to adapt the Blended Diffusion approach of Avrahami et al. [2022b] to work in the latent space of LDM, instead of working at the pixel level.

문제 제기 및 해결: diffusion models는 뛰어난 성과를 보이지만 긴 inference time으로 불편함 있음. 그래서 pixel level 대신 LDM의 latent space에 적용함.

Next, we address the imperfect reconstruction inherent to LDM, due to the use of VAE-based lossy latent encodings.

To overcome this issue, we propose a solution that starts with a dilated mask, and gradually shrinks it as the diffusion process progresses.

문제 제기 2: VAE-based lossy latent encoding으로 내재된 정보를 잃음. 그래서 안에 얇은 mask를 적용하여 local edits를 해봄. 하지만 이 또한 latent space에 낮은 spatial resolution이 있어 어려움. 이를 극복하고자 dilated mask로 시작하고 점차 줄여가는 방식을 택함.

Finally, we evaluate our method against the baselines both qualitatively and quantitatively, using new metrics for text-driven editing methods that we propose: precision and diversity.

qualitatively와 quantitatively 둘다 평가하는 방법을 택함.

1. LATENT DIFFUSION AND BLENDED DIFFUSION

Blended Diffusion [Avrahami et al. 2022b] addresses zero-shot text-guided local image editing. This approach utilizes a diffusion model trained on ImageNet [Deng et al. 2009], which serves as a prior for the manifold of the natural images, and a CLIP model [Radford et al. 2021], which navigates the diffusion model towards the desired text-specified outcome.

Bleded Diffusion은 CLIP 모델과 diffusion model을 이용해서 local imag editing임.

2. METHOD

Given an image 𝑥, a guiding text prompt 𝑑 and a binary mask𝑚 that marks the region of interest in the image, our goal is to produce a modified image 𝑥ˆ, s.t. the content 𝑥ˆ ⊙ 𝑚 is consistent with the text description 𝑑, while the complementary area remains close to the source image, i.e., 𝑥 ⊙ (1−𝑚) ≈ 𝑥ˆ ⊙ (1−𝑚), where ⊙ is element-wise multiplication. Furthermore, the transition between the two areas of 𝑥ˆ should ideally appear seamless.

Blended model과 비슷. mask m을 image x에 적용하고 text prompt d에 맞게 수정. 수정된 이미지 배경과 기존 이미지 배경은 같아야함. 그리고 당연히 자연스러워야함.

Blended Latent Diffusion

Untitled

However, this model lacks the capability of editing an existing image in a local fashion, hence we propose to incorporate Blended Diffusion [Avrahami et al. 2022b] into text-to-image LDM.

진짜로 Alg 1만 봐도 다 이해됨.

latent diffusion 문제는 local fashion에 기존 이미지 편집의 능력이 부족함. 따라서 저자는 Blended Diffuion을 LDM으로 통합을 제안함.