Untitled

However, repurposing such models for editing real images remains challenging.

First, images do not naturally come with text descriptions. Second, even with initial and target text prompts (e.g., changing the word from cat to dog), existing text-toimage models tend to synthesize completely new content that fails to follow the layout, shape, and object pose of the input image. After all, editing the text prompt only tells us what we want to change, but does not convey what we intend to preserve. Finally, users may want to perform all kinds of edits on a diverse set of real images.

문제제기임. 1. 이미지는 텍스트 설명을 자연스럽게 가져오지 못함. 2. 인풋이미지의 pose나 모양 들을 완벽히 합쳐지는데 힘들어함. 3. 다양한 실제 이미지에 모든 편집되길 원할 수 있음.

To overcome the above issues, we introduce pix2pix-zero, a diffusion-based image-to-image translation approach that is training-free and prompt-free. A user only needs to specify the edit direction in the form of source domain → target domain (e.g., cat→ dog) on-the-fly, without manually creating text prompts for the input image.

In this work, we make two key contributions: (1) An efficient, automatic editing direction discovery mechanism without input text prompting.

기존 단어와 바꾸고 싶은 사물의 단어를 주면 CLIP embedding으로 두 단어 방향을 계산한다.

(2) Content preservation via cross-attention guidance

In Figure 1, we show various editing results using our method while preserving the structure of input images.

We further improve our results and inference speed with a suite of techniques: (1) Autocorrelation regularization: When applying inversion via DDIM [55] we observe that DDIM inversion is prone to make intermediate predicted noise less Gaussian, Hence, we introduce an autocorrelation regularization to ensure noise to be close to Gaussian during inversion. (2) Conditional GAN distillation: Diffusion models are slow due to the multi-step inference of a costly diffusion process. To enable interactive editing, we distill the diffusion model to a fast conditional GAN model, given paired data of the original and edited images from the diffusion model, enabling real-time inference.

DDIM으로 Inversion함. Conditional GAN으로 diffusion model을 distill하여 conditonal GAN에 적용

1. Method

We propose to edit an input image along an edit direction (e.g., cat → dog).

Inverting Real Images

Deterministic inversion.

Inversion entails finding a noise map xinv that reconstructs the input latent code x0 upon sampling.

we adopt the deterministic DDIM [55] reverse process, as shown below:

Untitled

text features c, α¯t+1 is noise scaling factor as defined in DDIM [55], and fθ(xt, t, c) predicts the final denoised latent code x0.

Untitled