정리 | Notion

Untitled

Existing approaches for this task [34, 6, 43] are usually built upon pre-trained text-to-image models [33, 35] to fully utilize the image generation capacity of big models.

The first direction is to invert the input images to the textual space so that the pre-trained models have a deep understanding of the concept.

However, the pre-trained model needs to be finetuned for many steps to learn each new concept. The learned model weights need to be stored per concept. It is both time and storageconsuming, and thus the scalability of these approaches is greatly limited.

The second direction is to learn an image-to-image mapping with text guidance directly.

However, it can primarily perform pixel-to-pixel mapping while failing to generate objects with large pose variations or change the location of objects.

기존까지 t2i 모델들은 2가지 방향으로 구성됨,

입력 이미지를 textual space로 바꿈. 그러고 pretrained model은 개념의 깊은 이해를 가짐. 문제 제기 1: 각 새로운 개념 학습하기 위한 많은 step이 필요함, 모델 가중치는 각 개념을 포함해야함. 이게 storageconsumin과 시간 둘다 필요하기에 많이 제한됨.
text guidance를 직접적으로 image2image를 학습함. 문제 제기 2: 그러나 object의 위치를 바꾸거나 큰 포즈 변화와 함께 objects 생성에 실패함.

This paper aims at addressing these challenges by lifting test-time finetuning for personalized image generation.

논문은 personalized image generation을 위한 test-time finetuning을 가져와서 이 문제를 다룸.

Instead of inverting the input images to a token with inefficient online optimization, we propose to learn the general concept of the input images by learning an image encoder and map them to a compact textual embedding.

비효율적인 온라인 최적화와 함께 inverting input image를 token화 대신 image encoder에 학습에 의해 input image의 일반적인 개념을 학습하고 compact textual embedding에 map함.

Our new components are trained on only text-image pairs without using any paired images of the same concept.

새로운 요소들은 똑같은 개념 쌍의 이미지들 사용하지 않고 text-image 쌍만으로 학습함.

1. Method

Untitled

Overall Framework

Given a few images of a concept, the goal is to generate new high-quality images of this concept from text description p. The generated image variations should preserve the identity of the input concept.

몇개 개념 이미지를 주어졌을때 text description p로부터 새로운 고퀄리티 개념의 이미지를 생성함. 생성된 이미지 variations는 입력한 개념과 동일하게 보존함.

The overall framework of our model is shown in Fig. 2. Our model is built upon a pre-trained text-to-image model. We first inject a unique identifier Vˆ to the input prompt to represent the object concept, then use a learnable image encoder to map the input images to a concept textual embedding.

모델의 전반적인 framework는 Fig2에 있음. object concept을 표현하기 위해 input prompt에 unique identifier V^를 입력함. 그러고 concept textual embedding에 input images를 map하기 위해 learnable image encoder를 사용함.