23.05.27 Rereview

이미지가 부족하기에 random하게 crop하여 데이터를 얻음. 이런 self-reference setting은 일반화가 안되기에 이 문제를 좀 상회하기 위한 해결책을 내놓음. a. fine-tunign(오랜 기간 fine tuning은 전 모델 knowledge를 읽게함. 따라서 self-reference를 위한 information bottleneck을 도입함. b. self-reference image에 aggressive augmentation로 더욱 효과적이게 train-test gap을 줄임.
Compressed representation CLIP에다가 224x224x3 이미지를 1차원 1024로 바꿔어 image condition을 넣음. 이미지가 이해하는(text) 것 보다 기억하기 쉽고 content 복사가 가능함. 이런 compressed가 semanatic 유지하는동안 high-frequency details를 무시히 쉽다는 점이 있음. 이 부분이 reference content를 이해하고 학습동안에 최적화된 결과에 도달하기 위해 copy-paste로부터 generator를 보존하도록함.
Image prior

reference image를 기억하는 trival solution을 피하기위해ㅑ stron image prior로 init함. Stable Diffusion을 쓰는데 그 이유는 wild images에 높은 퀄리티를 생성하도록함. 두번째는 pretrained model은 language를 뽑고 image embeddingㅇ도 있음.
Aug

augmentation tech로 flip, rot 등 사용함. Aug→CLIP→MLP 과정을 거쳐 Condition을 구함.

재밌는 점은 Mask를 aug한다는점임. bounding box를 distortions하여 aug함. 왜냐하면 마스크 한부분을 최대한 완벽히 채우려고 하는데 현실은 그렇지 않음. 사람들은 bounding box를 벗어나고그러기 때문. 따라서 Bessel curve에 맞추어 curve를 만들고 random하게 1-5pixel 조정함.
Control similarity degree

Classifier free에서 영감을 받아 가져옴,. 아래와 같은 공식으로 inference에서 각 step마다 prediction을 수정해감.

Untitled

we propose an exemplar-based image editing approach that allows accurate semantic manipulation on the image content according to an exemplar image provided by users or retrieved from the database.

To achieve our goal, we train a diffusion model conditioned on the exemplar image. Different from textguided models, the core challenge is that it is infeasible to collect enough triplet training pairs comprising source image, exemplar and corresponding editing ground truth. One workaround is to randomly crop the objects from the input image, which serves as the reference when training the inpainting model

The model trained from such a self-reference setting, however, cannot generalize to real exemplars, since the model simply learns to copy and paste the reference object into the final output. We identify several key factors that circumvent this issue. The first is to utilize a generative prior. Specifically, a pretrained text-to-image model has the ability to generate high-quality desired results, we leverage it as initialization to avoid falling into the copy-and-paste trivial solution. However, a long time of finetuning may still cause the model to deviate from the prior knowledge and ultimately degenerate again. Hence, we introduce the information bottleneck for self-reference conditioning in which we drop the spatial tokens and only regard the global image embedding as the condition. In this way, we enforce the network to understand the high-level semantics of the exemplar image and the context from the source image, thus preventing trivial results during the selfsupervised training. Moreover, we apply aggressive augmentation on the self-reference image which can effectively reduce the training-test gap.

To the best of our knowledge, we are the first to address this semantic image composition problem where the reference is semantically transformed and harmonized before blending into another image, as shown in Figure 1 and Figure 2

Method

$x_s ∈ R^{H×W×3}$

$m ∈ {0, 1}^{H×W}$

, denote the source image as xs ∈ R H×W×3 , with H and W being the width and height respectively.

is represented as a binary mask m ∈ {0, 1}^{H×W} where value 1 specifies the editable positions in xs.

our goal is to synthesize an image y from {xs, xr, m}, so that the region where m = 0 remains as same as possible to the source image xs, while the region where m = 1 depicts the object as similar to the reference image xr and fits harmoniously

This task is very challenging and complex because it implicitly involves several non-trivial procedures.