Untitled

However, these approaches are still prone to forgetting prior knowledge, or face difficulties in accessing it concurrently with newly learned concepts

문제 제기 : 지금까지 Re-training, freeze model 같은 접근법들의 문제점은 쉽게 전에 지식을 잊고 새로운 개념을 동시에 접근하는 게 어렵다는 점을 지적한다.

We propose to overcome these challenges by finding new words in the textual embedding space of pre-trained text-to-image models. We consider the first stage of the text encoding process (Figure 2). Here, an input string is first converted to a set of tokens. Each token is then replaced with its own embedding vector, and these vectors are fed through the downstream model. Our goal is to find new embedding vectors that represent new, specific concepts.

We represent a new embedding vector with a new pseudo-word (Rathvon, 2004) which we denote by S∗. This pseudo-word is then treated like any other word, and can be used to compose novel textual queries for the generative models. (“a photograph of S∗ on the beach”, “)

Fig 2를 과정으로 pretrained t2i textual 임베딩 공간에 새로운 단어를 찾아 해결. 그리고 embedding vector에 S*이라는 pesudo-word를 이용한다. 이걸로 풍부한 textual 이해와 생성 능력을 높인다.

Method

Text embeddings.

Typical text encoder models, such as BERT, begin with a text processing step (Figure 2, left). First, each word or sub-word in an input string is converted to a token, which is an index in some pre-defined dictionary. Each token is then linked to a unique embedding vector that can be retrieved through an index-based lookup. These embedding vectors are typically learned as part of the text encoder cθ.

In our work, we choose this embedding space as the target for inversion. Specifically, we designate a placeholder string, S∗, to represent the new concept we wish to learn. We intervene in the embedding process and replace the vector associated with the tokenized string with a new, learned embedding v∗, in essence “injecting” the concept into our vocabulary.

BERT 모델로 text encoder로 쓰는데 S을 상사용하여 target embedding space에 inversion을 한다. S은 새로운 개념을 배우게 해준다. 새롭게 배운 embedding v*로 토큰화하여 vector를 대체한다. 이렇게 개념을 주입한다.

Textual inversion.

To find these new embeddings, we use a small set of images (typically 3-5), which depicts our target concept across multiple settings such as varied backgrounds or poses. We find v∗ through direct optimization, by minimizing the LDM loss of Equation (1) over images sampled from the small set. To condition the generation, we randomly sample neutral context texts, derived from the CLIP ImageNet templates (Radford et al., 2021). These contain prompts of the form “A photo of S∗”, “A rendition of S∗”, etc. The full list of templates is provided in the supplementary materials.

Our optimization goal can then be defined as:

다양한 배경과 포즈를 가진 작은 이미지들 set을 사용해 v* 찾는데 v*을 최적화하기 위해 LDM loss를 최적화하도록 한다. 생성에 조건을 위해 중립적인 텍스트를 랜덤하게 생성하고 CLIP으로부터 가져온다.

Untitled

Untitled

and is realized by re-using the same training scheme as the original LDM model, while keeping both cθ and θ fixed. Notably, this is a reconstruction task. As such, we expect it to motivate the learned embedding to capture fine visual details unique to the concept.