We can refer to the conditioning space defined by the tokens embedding space of the language model as P space. In other words, P is the textualconditioning space, where during synthesis, an instance p ∈ P (after passing through a text encoder) is injected to all attention layers of a U-net, as illustrated in Figure 1(left). In this paper, we introduce the Extended Textual Conditioning space. This space, referred to as P+ space, consists of n textual conditions {p1, p2, ...pn}, where each pi is injected to the corresponding layer i in the U-net (see Figure 1 (right))
개선 : P space는 language 모델의 token embedding space에 의해 정의된 conditioning spcae임. instance p ∈ P는 U-Net의 모든 attention layer에 주입되어 합쳐짐(fig 1 left). 이 논문에서는 이 textual conditioning space를 확장하는 방법을 제시하며, 이 space는 P+ space라 할 수 있음. n textual condition(p1,p2..pn)의 구성으로 모든 layer에 들어감.
This learned token can then be employed in a text prompt to produce diverse and novel images related to the concept provided by the user. This technique of learning tokens is referred to as Textual Inversion (TI) [15].
학습된 토큰은 새롭고 다양한 이미지 생성이 가능하게 함. 이 learning tokens의 기술은 Textual Inversion이라 함.
In our work, we introduce Extended Textual Inversion (XTI), where we invert the input images into a set of token embeddings, one per layer, namely, inversion into P+. Our findings reveal that the expanded inversion process in P+ is not only faster than TI, but also more expressive and precise, owing to the increased number of tokens that provide superior reconstruction capabilities.
여기서는 Extended Textual Inversion(XTI)를 도입하는데, inversion P+로 한 레이어 token embedding의 set에 이미지를 invert함. 이게 TI보다 빠를 뿐만 아니라 더욱 정확하고 표현이 좋음. token 수를 늘리면 더 좋은 결과를 내놓음.
Specifically, we employ the insertion of inverted tokens of diverse subjects into the different layers to capitalize on the inherent shapestyle disentanglement exhibited by these layers. This approach enables us to achieve previously unattainable results as shown in Figure 2.
구체적으로, 다양한 주제의 inverted tokens을 다른 레이어에 삽입하여 이러한 레이어에서 나타나는 고유한 모양 스타일 풀림을 활용함. 이 접근은 Fig2에 얻기 이전에 어려운 결과를 할 수 있게함.
We partitioned the cross-attention layers of the denoising U-net into two subsets: coarse layers with low spatial resolution and fine layers with high spatial resolution. We then used two conditioning prompts: "red cube" and "green lizard", and injected one prompt into one subset of cross-attention layers, while injecting the second prompt into the other subset. The resulting generated images are provided in Figure 3.
cross-attention layer를 2 subset으로 나누었는데 low spatial resolution인 coarse layers와 high spatial resolution인 fine layers임. 그러고 나서 2 condtion pormpts를 사용했는데 cross-attention layers의 한 subset에 하나 넣는다. 위 Fig3보면 red cube는 특정 layer에만 들어갔듯이 하나의 레이어 종류에는 하나의 text가 들어감.
This experiment suggests that the conditioning mechanism at different resolutions processes prompts differently, with different attributes exerting greater influence at different levels.
이 실험은 prompts가 달라지면 다른 resolutions를 process하는 conditioning mechnism을 제안함.
In our work, we define P as the set of individual token embeddings that are passed to the text encoder. The process of injecting a text prompt into the network for a particular cross-attention layer is illustrated in Figure 4.
text encoder에 통과하는 각 token embedding들의 set로 P를 정의함. cross-attention layer에 network로 text prompt를 주입하는 과정은 Fig 4에 나와있음.