InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning

This paper proposes a novel approach to personalized text-to-image generation that does not require any test-time finetuning. The proposed approach, called InstantBooth, is built upon a pre-trained text-to-image model and consists of two main components: a learnable image encoder and a few adapter layers. The image encoder is used to learn the general concept of the input images, while the adapter layers are used to learn rich visual feature representations. The proposed approach is trained on text-image pairs without using paired images of the same concept. Experiments show that InstantBooth can generate competitive results on unseen concepts concerning language-image alignment, image fidelity, and identity preservation while being 100 times faster than existing test-time finetuning-based methods.

Key insights and lessons learned from the paper:

Questions for the authors:

Related topics or future research directions:

References: