general capability of such models, users often wish to synthesize specific concepts from their own personal lives.

we propose a fine-tuning technique, Custom Diffusion for text-to-image diffusion models. Our method is computationally and memory efficient. To overcome the above-mentioned challenges, we identify a small subset of model weights, namely the key and value mapping from text to latent features in the cross-attention layers. Fine-tuning these is sufficient to update the model with the new concept. To prevent model forgetting, we use a small set of real images with similar captions as the target images. We also introduce augmentation during fine-tuning, which leads to faster convergence and improved results. To inject multiple concepts, our method supports training on both simultaneously or training them separately and then merging.

1. Method

Untitled

Single-Concept Fine-tuning

we aim to embed a new concept in the model given as few as four images and the corresponding text description. The finetuned model should retain its prior knowledge, allowing for novel generations with the new concept, based on the text prompt. This can be challenging as the updated text-to-image mapping might easily overfit the few available images.

In our experiments, we use Stable Diffusion [1] as our backbone model, which is built on the Latent Diffusion Model (LDM) [60]. LDM first encodes images into a latent representation, using hybrid objectives of VAE [34], PatchGAN [30], and LPIPS [85], such that running an encoderdecoder can recover an input image. They then train a diffusion model [28] on the latent representation with text condition injected in the model using cross-attention

Learning objective of diffusion models

그냥 DDPM

Rate of change of weights

Untitled

we analyze the change in parameters for each layer in the finetuned model on the target dataset with the loss in Eqn. 2,

Untitled

where θ 0 l and θl are the updated and pretrained model parameters of layer l. These parameters come from three types of layers – (1) cross-attention (between the text and image), (2) self-attention (within the image itself), and (3) the rest of the parameters, including convolutional blocks and normalization layers in the diffusion model U-Net.

Untitled

Figure 3 shows the mean ∆l for the three categories when the model is fine-tuned on “moongate” images.

Model fine-tuning.

Cross-attention block modifies the latent features of the network according to the condition features, i.e., text features in the case of text-to-image diffusion models. Given text features c ∈ R s×d and latent image features f ∈ R (h×w)×l , a single-head cross-attention [73] operation consists of Q = Wq f, K = Wkc, V = Wvc , and a weighted sum over value features as:

Untitled