Untitled

However, despite the diverse, general capability of such models, users often wish to synthesize specific concepts from their own personal lives

문제 제기 : 모델들은 내 개인적인 특정한 개념을 통합하기 원하는데 이게 만족하지 못하는 결과가 나옴. 이를 해결하고자 함.

This motivates a need for model customization. Given the few user-provided images, can we augment existing text-toimage diffusion models with the new concept (for example, their pet dog or a “moongate” as shown in Figure 1)? The fine-tuned model should be able to generalize and compose them with existing concepts to generate new variations.

그래서 유저가 제공하는 이미지와 diffusion 모델을 이용해 augmentation을 하지만 모델이 존재하던 개념을 잊을 수 있다는 문제와 모델이 sampling variation을 줄이고 조금의 training sample들로 overfit이 가능하다는 문제가 있다. 그리고 연구에서 추가적으로 하나의 개념(fig 1에 Single-concept generation) 뿐만 아니라 Multi-concept composition을 가능하도록했다.

In this work, we propose a fine-tuning technique, Custom Diffusion for text-to-image diffusion models. Our method is computationally and memory efficient. To overcome the above-mentioned challenges, we identify a small subset of model weights, namely the key and value mapping from text to latent features in the cross-attention layers [5, 73].

1. Method

Untitled

Our proposed method for model fine-tuning, as shown in Figure 2, only updates a small subset of weights in the cross-attention layers of the model.

Single-Concept Fine-tuning

Learning objective of diffusion models.

Untitled

당연하게 diffusion이기에 loss는 위와 같음.

This can be computationally inefficient for large-scale models and can easily lead to overfitting when training on a few images. Therefore, we aim to identify a minimal set of weights that is sufficient for the task of fine-tuning.

하지만 위와 같은 loss가 비효율적이고 overfitting 이끌 가능성이 높다한다. 그래서 저자는 fine-tuning으로 최소 가중치 세트를 identify하는 것에 목표를 둔다.

Rate of change of weights.

. Following Li et al. [39], we analyze the change in parameters for each layer in the finetuned model on the target dataset with the loss in Eqn. 2,

Untitled

$θ ‘_l$ and $θ_l$ are the updated and pretrained model parameters of layer l. These parameters come from three types of layers – (1) cross-attention (between the text and image), (2) self-attention (within the image itself), and (3) the rest of the parameters, including convolutional blocks and normalization layers in the diffusion model U-Net. Figure 3 shows the mean ∆l for the three categories when the model is fine-tuned on “moongate” images.

위와 같은 loss를 이용해 updated와 pretrained된 모델 각 layer의 매개변수 loss이다. cross-attention와 self-attention 그리고 나머지 매개변수로 구성된 U-Net모델로 이 $∆_l$을 최소화 하고자한다.