Untitled

Figure 1(a-c) briefly illustrates the existing approaches. Image guidance mixes the latent variables of the input image with unconditional latent variables.

In this paper, we propose an asymmetric reverse process (Asyrp) which discovers the semantic latent space of a frozen diffusion model where modification in the space synthesizes various attributes on input images. Our semantic latent space, named h-space, has practical properties for editing applications as follows.The same shift in this space results in the same attribute change in all images. Linear changes in this space lead to linear changes in the attributes. The changes do not degrade the quality of the resulting images. The changes throughout the timesteps are almost identical with the desired attribute changes. Figure 1(d) illustrates some of these properties and § 5.3 provides detailed analyses. To the best of our knowledge, it is the first attempt to discover the semantic latent space in the frozen pretrained diffusion models. Spoiler alert: our semantic latent space is different from the intermediate latent variables in the diffusion process. Moreover, we introduce a principled design of the generative process for versatile editing and quality boosting by quantifiable measures: editing strength of an interval and quality deficiency at a timestep.

1. BACKGROUND : IMAGE MANIPULATION WITH CLIP

CLIP learns multi-modal embeddings with an image encoder EI and a text encoder ET whose similarity indicates semantic similarity between images and texts (Radford et al., 2021). Compared to directly minimizing the cosine distance between the edited image and the target description (Patashnik et al., 2021), directional loss with cosine distance achieves homogeneous editing without mode collapse (Gal et al., 2021):

Untitled

where ∆T = ET (y target) − ET (y source) and ∆I = EI x edit − EI (x source) for edited image x edit, target description y target, original image x source, and source description y source. We use the prompts ‘smiling face’ and ‘face’ as the target and source descriptions for facial attribute smiling.

2. DISCOVERING SEMANTIC LATENT SPACE IN DIFFUSION MODE

we use an abbreviated version of Eq. (3): Eq(3)는 DDIM sampling

Untitled

Untitled

PROBLEM

The easiest idea for manipulating x0 is simply updating xT to optimize directional CLIP loss given text prompts with Eq. (4). However, it leads to distorted images or incorrect manipulation.

An alternative approach is to shift the noise θ t predicted by the network at each sampling step. However, it does not achieve manipulating x0 because the intermediate changes in Pt and Dt cancel out each other resulting in the same pθ(x0:T ), similarly to destructive interference.

Untitled

Untitled

Untitled

Appendix B proves above theorem. Figure 13(a-b) shows that x˜0 is almost identical to x0.