Based on full inversion capability and high-quality image generation power of recent diffusion models, our method performs zeroshot image manipulation successfully even between unseen domains and takes another step towards general application by manipulating images from a widely varying ImageNet dataset. Furthermore, we propose a novel noise combination method that allows straightforward multi-attribute manipulation.
Inspired by this, here we propose a novel DiffusionCLIP - a CLIP-guided robust image manipulation method by diffusion models. Here, an input image is first converted to the latent noises through a forward diffusion.
the key idea of DiffusionCLIP is to fine-tune the score function in the reverse diffusion process using a CLIP loss that controls the attributes of the generated image based on the text prompts.
CLIP Guidance for Image Manipulation
In CLIP, a text encoder and an image encoder are pretrained to identify which texts are matched with which images in the dataset. Accordingly, we use a pretrained CLIP model for our textdriven image manipulation.
To effectively extract knowledge from CLIP, two different losses have been proposed: a global target loss [39], and local directional loss [20]. The global CLIP loss tries to minimize the cosine distance in the CLIP space between the generated image and a given target text as follows:
ytar is a text description of a target, xgen denotes the generated image, and DCLIP returns a cosine distance in the CLIP space between their encoded vectors. On the other hand, the local directional loss [20] is designed to alleviate the issues of global CLIP loss such as low diversity and susceptibility to adversarial attacks.
The local directional CLIP loss induces the direction between the embeddings of the reference and generated images to be aligned with the direction between the embeddings of a pair of reference and target texts in the CLIP space as follows:
Here, EI and ET are CLIP’s image and text encoders, respectively, and yref, xref are the source domain text and image, respectively. The manipulated images guided by the directional CLIP loss are known robust to mode-collapse issues because by aligning the direction between the image representations with the direction between the reference text and the target text, distinct images should be generated. Also, it is more robust to adversarial attacks because the perturbation will be different depending on images [41].
Here, the input image x0 is first converted to the latent xt0 (θ) using a pretrained diffusion model ϵθ. Then, guided by the CLIP loss, the diffusion model at the reverse path is fine-tuned to generate samples driven by the target text ytar. The deterministic forwardreverse processes are based on DDIM [52]. For translation between unseen domains, the latent generation is also done by forward DDPM [23] process as will be explained later.
Specifically, to fine-tune the reverse diffusion model ϵθ, we use the following objective composed of the directional CLIP loss Ldirection and the identity loss LID: