정리 | Notion

However, such operations do not always lead to intuitive editing. In particular, spatial control for diverse layouts (e.g., pose and shape) is difficult to handle via 1D operations based on a slider UI.

In addition, in most existing techniques, spatial control is limited to basic transformations, such as translation, rotation, and scaling.

문제 제기: latent space에 slider UI로 1D를 다루는 것은 어려움. 추가적으로, spatial control은 기본 transformations에 제한되어있음(rotation, scaling, translation 같은 것).

Untitled

In this paper, we tackle the novel problem of controlling the spatial layout of StyleGAN images by manipulating latent codes in accordance with user inputs directly specified on the images.

user가 입력한 특정 이미지에 따라 latent codes를 조작하여 StyleGAN의 spatial layout을 조작하는 새로운 문제를 해결함. Fig1에 나왔듯이, drag, anchor key+drag, 3D motion도편집 가능함.

1. Method

We formulate this task as the problem of constructing a latent transformer T, which transforms initial latent codes w_before in accordance with user inputs U:

Untitled

where α is a parameter that adjusts the degree of manipulation for the latent codes wbe f ore, and f is an arbitrary function based on a neural network. The user inputs are defined as U = {vi ,pi} K i=1 consisting of K motion vectors vi ∈ R 3 in the xyz-directions and pixel positions pi ∈ Z 2 of the start points for vi .

user input U에 따라 latent codes w_before를 변환하는 latent transformer T의 문제로 공식화함. user inputs은 v_i motion vectors와 pixel position pi로 구성됨.

Untitled

First, the user annotates the output image of StyleGAN from the initial latent codes wbe f ore (Section 3.3). Next, we inject the user input U, initial latent codes wbe f ore, and StyleGAN feature map into the latent transformer to compute the edited latent codes wˆ a fter (Section 3.1). Finally, we obtain the resulting image by injecting the latent codes wˆ a fter into the StyleGAN generator

w_before latent codes로부터 styleGAN의 결과물을 annotates함. 다음 user input U와 latent codes w_before 그리고 feature map을 latent trasformers에 editied latent codes w^_after에 편집되도록 계산함. 마지막으로 StyleGAN generator로 w^_after를 styleGAN에 넣어 얻음.

Network architecture

Untitled

To handle a different number of user inputs for each test time, we incorporate a transformer encoder-decoder architecture, which can handle variable-length inputs, in our latent transformer.

각마다 user inputs의 다른 수를 다루기 위해, transformer encoder-decoder arc을 통합하여 다양한 길이를 다룰 수 있음.

Untitled

On the side of the transformer encoder, given user inputs U, it extracts a sequence of feature vectors passed to the transformer decoder. Because pixel positions pi themselves do not contain semantic information about what is to be moved, we instead use semantic feature vectors extracted from the StyleGAN feature map as input.