Recently, large-scale language-image (LLI) models, such as Imagen [38], DALL·E 2 [33] and Parti [48], have shown phenomenal generative semantic and compositional power, and gained unprecedented attention from the research community and the public eye.
However, these models do not provide simple editing means, and generally lack control over specific semantic regions of a given image. In particular, even the slightest change in the textual prompt may lead to a completely different output image.
In this paper, we introduce an intuitive and powerful textual editing method to semantically edit images in pre-trained text-conditioned diffusion models via Prompt-to-Prompt manipulations.
Our key idea is that we can edit images by injecting the cross-attention maps during the diffusion process, controlling which pixels attend to which tokens of the prompt text during which diffusion steps. we show several methods to control the cross-attention maps through a simple and semantic interface (see fig. 1).
Our approach constitutes an intuitive image editing interface through editing only the textual prompt, therefore called Prompt-to-Prompt. does not requires model training, fine-tuning, extra data, or optimization.
$I$ be an image which was generated by a text-guided diffusion model [38] using the text prompt P and a random seed s. Our goal is editing the input image guided only by the edited prompt P ∗ , resulting in an edited image I ∗ .
As opposed to previous works, we wish to avoid relying on any user-defined mask to assist or signify where the edit should occur. A simple, but an unsuccessful attempt is to fix the internal randomness and regenerate using the edited text prompt. Unfortunately, as fig. 2 shows, this results in a completely different image with a different structure and composition.
Our key observation is that the structure and appearances of the generated image depend not only on the random seed, but also on the interaction between the pixels to the text embedding through the diffusion process. By modifying the pixel-to-text interaction that occurs in cross-attention layers, we provide Prompt-to-Prompt image editing capabilities.
Recall that each diffusion step t consists of predicting the noise from a noisy image zt and text embedding ψ(P) using a U-shaped network [37]. At the final step, this process yields the generated image I = z0. Most importantly, the interaction between the two modalities occurs during the noise prediction, where the embeddings of the visual and textual features are fused using Cross-attention layers that produce spatial attention maps for each textual token.
More formally, as illustrated in fig. 3(Top), the deep spatial features of the noisy image φ(zt) are projected to a query matrix Q = Q(φ(zt)), and the textual embedding is projected to a key matrix K = K(ψ(P)) and a value matrix V = V (ψ(P)), via learned linear projections Q, K, V . The attention maps are then