정리 | Notion

Untitled

Recently, large-scale language-image (LLI) models, such as Imagen [38], DALL·E 2 [33] and Parti [48], have shown phenomenal generative semantic and compositional power, and gained unprecedented attention from the research community and the public eye.

However, these models do not provide simple editing means, and generally lack control over specific semantic regions of a given image. In particular, even the slightest change in the textual prompt may lead to a completely different output image.

LLI 모델은 최근 많이 쓰였지만 control이 부족하고 의미를 잘 못담음. 또한 만들기 위해 mask가 필요한데 만들기도 쉽지 않고 문제가 많음. 문제는 masking이 주요 정보를 없앰.

In this paper, we introduce an intuitive and powerful textual editing method to semantically edit images in pre-trained text-conditioned diffusion models via Prompt-to-Prompt manipulations.

Our key idea is that we can edit images by injecting the cross-attention maps during the diffusion process, controlling which pixels attend to which tokens of the prompt text during which diffusion steps. we show several methods to control the cross-attention maps through a simple and semantic interface (see fig. 1).

Our approach constitutes an intuitive image editing interface through editing only the textual prompt, therefore called Prompt-to-Prompt. does not requires model training, fine-tuning, extra data, or optimization.

1. Method

Untitled

$I$ be an image which was generated by a text-guided diffusion model [38] using the text prompt P and a random seed s. Our goal is editing the input image guided only by the edited prompt P ∗ , resulting in an edited image I ∗ .

Text prompt P로 이미지를 생성하는 것인데 수정된 p로 가이드하여 이미지 I을 만든다.

As opposed to previous works, we wish to avoid relying on any user-defined mask to assist or signify where the edit should occur. A simple, but an unsuccessful attempt is to fix the internal randomness and regenerate using the edited text prompt. Unfortunately, as fig. 2 shows, this results in a completely different image with a different structure and composition.

Our key observation is that the structure and appearances of the generated image depend not only on the random seed, but also on the interaction between the pixels to the text embedding through the diffusion process. By modifying the pixel-to-text interaction that occurs in cross-attention layers, we provide Prompt-to-Prompt image editing capabilities.

Cross-attention in text-conditioned Diffusion Models

Untitled

Recall that each diffusion step t consists of predicting the noise from a noisy image zt and text embedding ψ(P) using a U-shaped network [37]. At the final step, this process yields the generated image I = z0. Most importantly, the interaction between the two modalities occurs during the noise prediction, where the embeddings of the visual and textual features are fused using Cross-attention layers that produce spatial attention maps for each textual token.

More formally, as illustrated in fig. 3(Top), the deep spatial features of the noisy image φ(zt) are projected to a query matrix Q = Q(φ(zt)), and the textual embedding is projected to a key matrix K = K(ψ(P)) and a value matrix V = V (ψ(P)), via learned linear projections Q, K, V . The attention maps are then

Untitled

cross-attention으로 Query는 이미지로부터, K, V는 prompt로부터 가져온다.