We propose LayoutTransformer a simple yet powerful auto-regressive model that can synthesize new layouts, complete partial layouts, and compute likelihood of existing layouts. Self-attention approach allows us to visualize what existing elements are important for generating the next category in the sequence.
LayoutTransformer는 새로운 layouts을 생성할 수 있는 모델이다. 공간적인 layout도 존재하는 layout의 likelihood도 계산 할 수 있다. 존재하는 요소들은 다음 category sequence에 생성하도록 해줘서 중요한데 self-attention은 이게 가능하도록 해준다.
We model different attributes of layout elements separately - doing so allows the attention module to more easily focus on the attributes that matter. This is important especially in datasets with inherent symmetries such as documents or apps and in contrast with existing approaches which concatenate or fuse different attributes of layout primitives.
layout elements 각각의 다른 요소가 다르다 - 속성의 문제를 쉽게 파악하도록 해줌. 특히 데이터셋에 다른 요소가 섞이거나 더해진 거에 중요
We present an exciting finding – encouraging a model to understand layouts results in feature representations that capture the semantic relationships between objects automatically (without explicitly using semantic embeddings, like word2vec [35]). This demonstrates the utility of the task of layout generation as a proxy-task for learning semantic representations,
의미있는 관계를 잡도록 학습하면 모델이 layout에 object 간에 관계를 자동으로 이해하는 기능 표현이 학습된다.
LayoutTransformer shows good performance with essentially the same architecture and hyperpameters across very diverse domains. We show the adaptability of our model on four layout datasets: MNIST Layout [29], Rico Mobile App Wireframes [9], PubLayNet Documents [66], and COCO Bounding Boxes [32]. To the best of our knowledge, MMA is the first framework to perform competitively with the state-of-the-art approaches in 4 diverse data domains.
Given a dataset of layouts, a single layout instance can be defined as a graph G with n nodes, where each node i ∈ {1, . . . , n} is a graphical primitive. We assume that the graph is fully-connected, and let the attention network learn the relationship between nodes. The nodes can have structural or semantic information associated with them. For each node, we project the information associated with it to a d-dimensional space represented by feature vector si . Note that the information itself can be discrete (e.g., part category), continuous (e.g., color), or multidimensional vectors (e.g., signed distance function of the part) on some manifold.
하나 layout instance는 n node들과 함께 graph G를 정의한다. graph는 fully-connected이다. 그리고 attention network는 node들 사이 관계를 학습한다. node들은 이들의 관련된 의미있는 정보 또는 구조를 가질 수 있다. 각 node는 d-dimensional 공간은 feature vector si에 의해 표현하여 정보 관계를 투영한다.
Each primitive also carries geometric information gi which we factorize into a position vector and a scale vector. For the layouts in R 2 such as images or documents, gi = [xi , yi , hi , wi ], where (x, y) are the coordinates of the centroid of primitive and (h, w) are the height and width of the bounding box containing the primitive, normalized with respect to the dimensions of the entire layout.
위치나 크기 요소인 geometric 정보 $g_i$를 내놓는다. 2차원 이미지 같은 경우 $g_i=[x_i,y_i,h_i,w_i]$로 구성된다.
Representing geometry with discrete variables
We observe that even though discretizing coordinates introduces approximation errors, it allows us to express arbitrary distributions which we find particularly important for layouts with strong symmetries such as documents and app wireframes. . We project each geometric field of the primitive independently to the same d-dimension, such that i th primitive in R 2 can be represented as (si , xi , yi , hi , wi). We concatenate all the elements in a flattened sequence of their parameters. We also append embeddings of two additional parameters s⟨bos⟩ and s⟨eos⟩ to denote start & end of sequence. Layout in R 2 can now be represented by a sequence of 5n + 2 latent vectors.
좌표를 이산화 하면 근사 오차가 일어난다.문서나 app 같은 강한 symmetries에 layouts에 특히 중요한 부분을 발견한 임의의 분포를 표현한다. 같은 차원에 근본적으로 독립적인 geometric field 에 투영하는데 R_2에 i번째 primitive는 (si,xi,yi,hi,wi)와 같이 표현이 가능하다. 이 파라미터들의 flattened sequence에 모든 elements를 concat한다. 저자는 연속의 시작과 끝을 s<bos> 그리고 s<eos>로 표현하는 2개의 추가적인 변수를 임베딩하였다. 2차원에 layout은 5n+2 잠재벡터 sequence에 의해 표현할 수 있다.
For brevity, we use θj , j ∈ {1, . . . , 5n + 2} to represent any element in the above sequence. We can now pose the problem of modeling this joint distribution as product over series of conditional distributions using chain rule: