Introducing new concepts into large scale models is often difficult. Re-training a model with an expanded dataset for each new concept is prohibitively expensive, and fine-tuning on few examples typically leads to catastrophic forgetting (Ding et al., 2022; Li et al., 2022).
We propose to overcome these challenges by finding new words in the textual embedding space of pre-trained text-to-image models. We consider the first stage of the text encoding process (Figure 2). Here, an input string is first converted to a set of tokens. Each token is then replaced with its own embedding vector, and these vectors are fed through the downstream model. Our goal is to find new embedding vectors that represent new, specific concepts.
We represent a new embedding vector with a new pseudo-word (Rathvon, 2004) which we denote by S∗. This pseudo-word is then treated like any other word, and can be used to compose novel textual queries for the generative models. One can therefore ask for “a photograph of S∗ on the beach”, “an oil painting of a S∗ hanging on the wall”, or even compose two concepts, such as “a drawing of S 1 ∗ in the style of S 2 ∗ ”. Importantly, this process leaves the generative model untouched. In doing so, we retain the rich textual understanding and generalization capabilities that are typically lost when fine-tuning vision and language models on new tasks.
To find these pseudo-words, we frame the task as one of inversion. We are given a fixed, pre-trained text-toimage model and a small (3-5) image set depicting the concept. We aim to find a single word embedding, such that sentences of the form “A photo of S∗” will lead to the reconstruction of images from our small set. This embedding is found through an optimization process, which we refer to as “Textual Inversion”.
In summary, our contributions are as follows:
• We introduce the task of personalized text-to-image generation, where we synthesize novel scenes of user-provided concepts guided by natural language instruction.
• We present the idea of “Textual Inversions” in the context of generative models. Here the goal is to find new pseudo-words in the embedding space of a text encoder that can capture both high-level semantics and fine visual details.
• We analyze the embedding space in light of GAN-inspired inversion techniques and demonstrate that it also exhibits a tradeoff between distortion and editability. We show that our approach resides on an appealing point on the tradeoff curve.
• We evaluate our method against images generated using user-provided captions of the concepts and demonstrate that our embeddings provide higher visual fidelity, and also enable more robust editing
Text-guided synthesis.
Rather than training conditional models, several approaches employ test-time optimization to explore the latent spaces of a pre-trained generator (Crowson et al., 2022; Murdock, 2021; Crowson, 2021). These models typically guide the optimization to minimize a text-to-image similarity score derived from an auxiliary model such as CLIP (Radford et al., 2021).
GAN inversion.
Manipulating images with generative networks often requires one to find a corresponding latent representation of the given image, a process referred to as inversion (Zhu et al., 2016; Xia et al., 2021).
Optimization methods directly optimize a latent vector, such that feeding it through the GAN will re-create a target image. Encoders leverage a large image set to train a network that maps images to their latent representations.
In our work, we follow the optimization approach, as it can better adapt to unseen concepts. Encoders face harsher generalization requirements, and would likely need to be trained on web-scale data to offer the same freedom. We further analyze our embedding space in light of the GAN-inversion literature, outlining the core principles that remain and those that do not.
Personalization
Adapting models to a specific individual or object is a long-standing goal in machine learning research. Personalized models are typically found in the realms of recommendation systems (Benhamdi et al., 2017; Amat et al., 2018; Martinez et al., 2009; Cho et al., 2002) or in federated learning (Mansour et al., 2020; Jiang et al., 2019; Fallah et al., 2020; Shamsian et al., 2021).