We observe two key semantic issues in state-of-the-art text-based image generation models: (i) “catastrophic neglect”, where one or more of the subjects of the prompt are not generated; and (ii) incorrect “attribute binding”, where the model binds attributes to the wrong subjects or fails to bind them entirely.
we introduce the concept of “Generative Semantic Nursing” (GSN). In the GSN process, one slightly shifts the latent code at each timestep of the denoising process such that the latent is encouraged to better consider the semantic information passed from the input text prompt.
We propose a form of GSN dubbed Attend-and-Excite, which leverages the powerful cross-attention maps of a pretrained diffusion model.
Thus, intuitively, in order for a subject to be present in the generated image, the model should assign at least one image patch to the subject’s token. Attend-and-Excite embodies this intuition by demanding that each subject token is dominant in some patch in the image. We carefully guide the latent at each denoising timestep and encourage the model to attend to all subject tokens and strengthen — or excite — their activations. Importantly, our approach is applied on the fly during inference time and requires no additional training or fine-tuning. We instead choose to preserve the strong semantics already learned by the pre-trained generative model and text encoder.
Latent Diffusion Models
Instead of operating in the image space, Stable Diffusion model(SD) operates in the latent space of an autoencoder.
Text-Conditioning Via Cross-Attention
Instead ohe denoising UNet network consists of selfattention layers followed by cross-attention layers at resolutions of 64, 32, 16, and 8.f operating in the image space, SD operates in the latent space of an autoencoder.
Denote by P the spatial dimension of the intermediate feature map (i.e., P ∈ {64, 32, 16, 8}), and by N the number of text tokens in the prompt. An attention map At ∈ R P ×P ×N is calculated over linear projections of the intermediate features (Q) and text embedding (K), as illustrated in the second row of Figure 3. At defines a distribution over the text tokens for each spatial patch (i, j). Specifically, At[i, j, n] denotes the probability assigned to token n for the (i, j)-th spatial patch of the intermediate feature map.
At each denoising step t, we consider the attention maps of the subject tokens in the prompt P. Intuitively, for a subject to be present in the synthesized image, it should have a high influence on some patch in the image. As such, we define a loss objective that attempts to maximize the attention values for each subject token. We then update the noised latent at time t according to the gradient of the computed loss. This encourages the latent at the next timestep to better incorporate all subject tokens in its representation. This manipulation occurs on the fly during inference (i.e., no additional training is performed).
Extracting the Cross-Attention Maps
Given the input text prompt P, we consider the set of all subject tokens (e.g., nouns) S = {s1, ..., sk} present in P.
indicating the influence of the token s on each image patch.
The resulting aggregated map At contains N spatial attention maps, one for each of the tokens of P
This leads to hsoti obtaining a high probability in the token distribution defined in At. Since we are interested in enhancing the actual prompt tokens, we re-weigh the attention values by ignoring the attention of <sot> and performing a Softmax operation on the remaining tokens (Step 2 in Algorithm 1). After the Softmax operation, the (i, j)-th entry of the resulting matrix At