However, the generated image fails to accurately preserve the input image details.
Nevertheless, requiring additional masks for image editing makes the process less intuitive since the user must provide a prefect mask, limiting their flexibility
However, these methods still struggle to conduct complex image editing, since the used regularization is performed for the entire image.
문제 제기: mask를 추가하고 뭐 추가하고 layer 추가해도 아직 입력 image detail 보존에 실패함.
Transferring knowledge of diffusion models to specific domains with real images has been studied.
However, they suffer from the following challenges: (I) They lead to unsatisfying results for selected regions, and unexpected changes in non-selected regions. (see Figs. 1 (Null-text)). (II) They require a user to provide an accurate text prompt that describes every visual object, and the relationship between them in the input image (see Fig. 2).
fine-tuning과 같은 Transferring knowledge는 연구되옴. 하지만 2가지 문제가 있는데 1. 선택한 영역이 만족스럽지 않고 예상못한 변화를 가져옴(Fig 1). 2. 모든 visual object 설명한 text prompt가 필수이고 입력 이미지에 사물들 사이에 관계가 필요함 (Fig2)
To overcome the above-mentioned challenges, we analyze the role of the attention mechanism (and specifically the roles of keys, queries and values herein) in the diffusion process. This leads to the observation that the key dominates the output image structure (where) whereas the value determines the object style (what). Therefore, we propose to learn the textual embedding with a mapping network, namely the input of the value linear mapping network in the cross-attention layers [4, 36].
이 문제들을 극복하고자 attention mechanism을 분석함. key는 output image structure를 지배하고(where) value는 style을 결정함(what). 그러므로 mapping network와 함께 textual embedding을 학습하기 위해 제안함.
We freeze the input of the key linear mapping network with the corresponding textual embedding of the real image. Furthermore, we observe that DDIM inversion not only provides an approximate trajectory to reconstruct the real image [15, 26], but also the object-liked attention maps (e.g., Fig. 3 (left corner)).
Key linear mapping network를 freeze함. 게다가, DDIM inversion은 real image 재구성을 위한 trajectory 근사만 제공하지 않고 object-liked attention maps를 제공함. Fig3 left corner
Finally, in the widely used classifier-free diffusion model, the outputs are from both the conditional and unconditional branches with guidance scale.
Thus, we propose to perform the attention map exchange in both the unconditional branch ( called P2P+), as well as in the conditional branch like P2P [15]. This technique enables us to obtain more accurate editing capabilities.
마지막으로 classifier-free diffusion model을 사용하여 결과물들은 guidance scale에 conditional과 unconditional branches 둘다로부터 옴. 하지만 P2P가 문제가 있어서 P2P+를 제안함. conditional branch 뿐만 아니라 unconditional branch 둘다 attention map을 교환함. 이 기술은 editing capabilities에 더 높은 정확도를 지니게함.