Untitled

Our model directly performs the image edit in the forward pass, and does not require any additional example images, full descriptions of the input/output images, or per-example finetuning.

Despite being trained entirely on synthetic examples (i.e., both generated written instructions and generated imagery), our model achieves zero-shot generalization to both arbitrary real images and natural human-written instructions.

1 Prior work

Learning to follow instructions

Our method differs from existing text-based image editing works [6,13,17,28,39,53] in that it enables editing from instructionsthat tell the model what action to perform, as opposed to text labels, captions or descriptions of input/output images.

A key benefit of following editing instructions is that the user can just tell the model exactly what to do in natural written text. There is no need for the user to provide extra information,

Training data generation with generative models

As generative models continue to improve, there is growing interest in their use as a source of cheap and plentiful training data for downstream tasks [33, 47, 50, 58, 65, 66]. In this paper, we use two different off-the-shelf generative models (language, text-toimage) to produce training data for our editing model.

2 Method

Untitled

Generating a Multi-modal Training Dataset

We first operate entirely in the text domain, where we leverage a large language model to take in image captions and produce editing instructions and the resulting text captions after the edit. For example, as shown in Figure 2a, provided the input caption “photograph of a girl riding a horse”, our language model can generate both a plausible edit instruction “have her ride a dragon” and an appropriately modified output caption “photograph of a girl riding a dragon”. Operating in the text domain enables us to generate a large and diverse collection of edits, while maintaining correspondence between the image changes and text instructions.

Operating in the text domain enables us to generate a large and diverse collection of edits

Our model is trained by finetuning GPT-3 on a relatively small human-written dataset of editing triplets: (1) input captions, (2) edit instructions, (3) output captions

Generating Paired Images from Paired Captions

Next, we use a pretrained text-to-image model to transform a pair of captions (referring to the image before and after the edit) into a pair of images. One challenge in turning a pair of captions into a pair of corresponding images is that textto-image models provide no guarantees about image consistency, even under very minor changes of the conditioning prompt. . This is unsuitable for our purposes, where we intend to use this paired data as supervision for training a model to edit images (and not produce a different random image). We therefore use Prompt-to-Prompt [17], a recent method aimed at encouraging multiple generations from a text-to-image diffusion model to be similar. This is done through borrowed cross attention weights in some number of denoising steps. Figure 3 shows a comparison of sampled images with and without Prompt-to-Prompt.