Untitled

In this paper, we focus on attaining control over the generated structure and semantic layout of the scene – an imperative component in various real-world content creation tasks, ranging from visual branding and marketing to digital art. That is, our goal is to take text-to-image generation to the realm of text-guided Image-to-Image (I2I) translation, where an input image guides the layout (e.g., the structure of the horse in Fig. 1), and the text guides the perceived semantics and appearance of the scene (e.g., “robot horse” in Fig. 1).

Our method does not require any training or fine-tuning, but rather leverages a pre-trained and fixed text-to-image diffusion mode

Specifically, spatial features and their self-attentions are extracted from the guidance image, and are directly injected into the textguided generation process of the target image. We demonstrate that our approach is not only applicable in cases where the guidance image is generated from text, but also for real-world images that are inverted into the model

We demonstrate that fine-grained control over the generated layout is difficult to achieve solely from the interaction with a text. Intuitively, since the cross attention is formed by the association of spatial features to words, it allows to capture rough regions at the object level, yet localized spatial information that is not expressed in the source text prompt (e.g., object parts) is not guaranteed to be preserved by P2P. Instead, our method focuses only on spatial features and their selfaffinities – we show that such features exhibit high granularity of spatial information, allowing us to control the generated structure, while not restricting the interaction with the text.

To summarize, we make the following key contributions: (i) We provide new empirical insights about internal spatial features formed during the diffusion process. (ii) We introduce an effective framework that leverages the power of pre-trained and fixed guided diffusion, allowing to perform high-quality text-guided I2I translation without any training or fine-tuning. (iii) We show, both quantitatively and qualitatively tha

Preliminary

Untitled

In this work, we leverage a pre-trained text-conditioned Latent Diffusion Model (LDM), a.k.a Stable Diffusion, in which the diffusion process is applied in the latent space of a pre-trained image autoencoder. The model is based on a U-Net architecture [37] conditioned on the guiding prompt P. Layers of the U-Net comprise a residual block, a selfattention block, and a cross-attention block, as illustrated in Fig. 2 (b). The residual block convolve image features $φ ^{l−1}_t$ from the previous layer l−1 to produce intermediate features f l t . In the self-attention block, features are projected into queries, q l t , keys, k l t , and values, v l t , and the output of the block is given by:

Untitled

This operation allows for long-range interactions between image features. Finally, cross-attention is computed between the spatial image features and the token embedding of the text prompt P.

Method

Given an input guidance image I G and a target prompt P, our goal is to generate a new image I ∗ that complies with P and preserves the structure and semantic layout of I G. We consider StableDiffusion [36], a state-of-the-art pre-trained and fixed text-to-image LDM model, denoted by θ(xt, P, t).

Specifically, we observe and empirically demonstrate that: (i) spatial features extracted from intermediate decoder layers encode localized semantic information and are less affected by appearance information, and (ii) the self-attention, representing the affinities between the spatial features, allows to retain fine layout and shape details.

Based on our findings, we devise a simple framework that extracts features from the generation process of the guidance image I G and directly injects them along with P into the generation process of I ∗ , requiring no training or fine-tuning (Fig. 2). Our approach is applicable for both text-generated and real-world guidance images, for which we apply DDIM inversion [44] to get the initial x G T .

Spatial features.

In text-to-image generation, one can use descriptive text prompts to specify various scene and object proprieties, including those related to their shape, pose and scene layout, e.g., “a photo of a horse galloping in the forest”. However, the exact scene layout, the shape of the object and its fine-grained pose often significantly vary across generated images from the same prompt under different initial noise xT . This suggests that the diffusion process itself and the resulting spatial features have a role in forming such fine-grained spatial information. This hypothesis is strengthened by [5], which demonstrated that semantic part segments can be estimated from spatial features in an unconditional diffusion model.

Untitled

we perform a simple PCA analysis which allows us to reason about the visual properties dominating the highdimensional features in θ. Specifically, we generated a diverse set of images containing various humanoids in different styles, including both real and text-generated images; sample images are shown in Fig. 3. For each image, we extract features f l t from each layer of the decoder at each time step t, as illustrated in Fig. 2(b). We then apply PCA on f l t across all images.

Untitled