However, the current leading methods suffer from, to varying degrees, several drawbacks: (i) they are limited to a specific set of edits such as painting over the image, adding an object, or transferring style [6, 33]; (ii) they can operate only on images from a specific domain or synthetically generated images [20, 43]; or (iii) they require auxiliary inputs in addition to the input image, such as image masks indicating the desired edit location, multiple images of the same subject, or a text describing the original image [6, 17, 39, 47, 51].
문제 제기: text based image editing은 많은 발전을 이룸. 하지만 현재 방법들은 단점들이 있음.
In this paper, we propose a semantic image editing method that mitigates all the above problems.
위에 언급한 문제들을 해결함. 오직 텍스트 이미지만 입력 받아서 정교하고 정확한 이미지 편집을 제공함.
. We adapt them in our work to edit real images instead of synthesizing new ones. We do so in a simple 3-step process, as depicted in Figure 3:
하나를 새로 synthesizing 하는 대신에 실제 이미지에 편집을 적용함. 그래서 3-step으로 작동하는데 Fig3에 나와있음.
We first optimize a text embedding so that it results in images similar to the input image. Then, we fine-tune the pre-trained generative diffusion model (conditioned on the optimized embedding) to better reconstruct the input image. Finally, we linearly interpolate between the target text embedding and the optimized one, resulting in a representation that combines both the input image and the target text.
text embebedding을 최적화하여 input image에 비슷한 image 결과를 보임. 그러고 더 나은 reconstruct를 위해 generative diffusion을 fine-tune함. 마지막으로 text embedding과 opmized 사이에 linearly interpolate하여 input image와 target text 둘다 합치어 표현함.
This is further supported by a human perceptual evaluation study, where raters strongly prefer Imagic over other methods on a novel benchmark called TEdBench – Textual Editing Benchmark.
human perceptual evaluation 추가적인 연구로, 평가자는 TEdBench라는 새로운 벤치마크에서 다른 방법보다 Imagic을 강력하게 선호함.
To achieve this feat, we utilize the text embedding layer of the diffusion model to perform semantic manipulations
주어진 텍스트에 맞고 만족하는 결과를 내기 위해, 저자는 text embedding layer를 활용하여 semantic manipulations를 한다.
Text embedding optimization
The target text is first passed through a text encoder [46], which outputs its corresponding text embedding etgt P R T ˆd , where T is the number of tokens in the given target text, and d is the token embedding dimension. We then freeze the parameters of the generative diffusion model fθ, and optimize the target text embedding etgt using the denoising diffusion objective [22]:
target text는 text encoder를 통과하여 embedding함. generative diffusion model f의 parameters를 freeze하고 text embedding e_tgt를 최적화하는데 위 공식을 사용함.