Despite exciting progress, existing large-scale text-toimage generation models cannot be conditioned on other input modalities apart from text, and thus lack the ability to precisely localize concepts or use reference images to control the generation process.

문제 제기 : 텍스트-이미지 모델이 텍스트 이외의 양식을 받을수 없기에 localize 능력이나 제어하는 능력이 부족하다.

As shown in Figure 1, we still retain the text caption as input, but also enable other input modalities such as bounding boxes for grounding concepts, grounding reference images, and grounding part keypoints.

To prevent knowledge forgetting, we propose to freeze the original model weights and add new trainable gated Transformer layers [65] that take in the new grounding input (e.g., bounding box). During training, we gradually fuse the new grounding information into the pretrained model using a gated mechanism [1]. This design enables flexibility in the sampling process during generation for improved quality and controllability;

pretrained model에 새로운 grounding 정보를 넣었을때 기존 엄청난 양 개념 정보를 보존하기 위해 freeze 방식을 제안하였다. 그리고 학습동안 pretrained model에 새로운 grounding 정보를 넣는다.

1. Open-set Grounded Image Generation

Grounding Instruction Input

We denote as e the grounding entity described either through text or an example image, and as l the grounding spatial configuration described with e.g., a bounding box or a set of keypoints. We define the instruction to a grounded text-toimage model as a composition of the caption and grounded entities:

Untitled

We process both caption and grounding entities as input tokens to the diffusion model, as described in detail below.

C는 caption l은 grounding spatial configuration. 이거 하기 위해서 token을 적용함.

Untitled

Caption Tokens

The caption c is processed in the same way as in LDM. Specifically, we obtain the caption feature sequence (yellow tokens in Figure 2(b)) using h c = [h c 1 , · · · , hc L ] = ftext(c), where h c is the contextualized text feature for theth word in the caption.

fig 2 b이다. Caption Token은 LDM처럼 똑같은 방법으로 구한다.

Grounding Tokens.

. For each grounded text entity denoted with a bounding box, we represent the location information as l = [αmin, βmin, αmax, βmax] with its top-left and bottomright coordinates. For the text entity e, we use the same pretrained text encoder to obtain its text feature ftext(e) (light green token in Figure 2(b)), and then fuse it with its bounding box information to produce a grounding token (dark green token in Figure 2(b)):

l은 위치 정보이고 text entity e를 text encoder f_text(e)에 넣어서 값을 얻고 이거랑 box랑 합쳐서 grounding token(dark green in Fig 2(b))를 얻는다.

Untitled

Fourier is the Fourier embedding [42], and MLP(·, ·) is a multi-layer perceptron that first concatenates the two inputs across the feature dimension. The grounding token sequence is represented as h e = [h e 1 , · · · , he N ]