정리 | Notion

Untitled

However, these methods have limited capability for fine-grained control at the object level, owing to the difficulty of describing the shape and appearance of multiple objects simultaneously with text.

문제제기 1: Object level에 fine grained control을 위한 image editing 방법의 능력이 아직 제한적임. 모양과 같은 것을 설명하는데 어려움이 있음.

However, most of these works either fall into the prompt engineering pitfall or fail to independently manipulate multiple objects.

문제제기 2: 여러 objects 각각에 prompt engineering은 실패함.

To tackle the aforementioned issues, we propose a novel framework, dubbed Structure-and-Appearance Paired Diffusion Models (PAIR-Diffusion). Specifically, we perceive an image as an amalgamation of diverse objects, each described by various factors such as shape, category, texture, illumination, and depth. Then we further identified two crucial macro properties of an object: structure and appearance.

위와 같은 문제들을 PAIR-Diffusion을 제안하여 해결함. 융합된 여러 objects로 이미지를 인식하고 shape, category, texture 등과 같은 여러 요소들에 의해 설명함. 그러고나서 저자는 structure과 appearance두 주요 요소를 동일화함.

1. Method

In the case of image generation, this is usually designed as a holistic process, where the chosen model learns to generate images as a whole in the pixel space [31, 50, 59]. Nevertheless, natural images can be seen as a composition of objects [65], each described by different factors, e.g.shape, style, texture, depth, illumination, etc. Under this formulation, p(x) can be written as:

Untitled

image generation 경우 holistic process로 설계됨. pixel space에 생성한 이미지들을 학습함. z는 objects 수임. z=(o1,o2….oN)이면 N가지 objects가 있는 거고 o는 i번쨰 object에 관한 설명임. 설명은 style, depth와 같은 요소들임.

In this work, we aim for fine-grained image editing. We thus assume that we already have the information about the objects that we want in the final image from references provided by the user. Therefore, we are only interested in the distribution p(x|z) = p(x|(o1, o2, . . . , oN )).

저자는 fine-grained image editing에 집중함. 따라서 저자는 이미 최종 이미지인 objects에 대한 정보가 있음. 그래서 p(x|z)와 같은 분포에 오직 관심이 있음.

Structure and Appearance Paired Diffusion

We thus represent an object as oi = {si , fi , πi}, where si = {ci , mi} represents the structure, with category ci and shape mi , and fi the appearance. πi denotes a latent variable that captures all the aspects of the i-th object which we do not wish to control; we represent with π = {πi} the collection of these variables for all the objects in the image.

oi는 structure si={category, shape}로 구성되고 f는 appearance. 파이는 control하지 않길 원하는 i번째 object의 모든 측면을 잡은 latent variable임.

Structure

We represent the structure information using a segmentation map, due to the fine-grained control it provides and the ease of computation. Given an off the shelf network ES(·), we obtain S = ES(x), with S ∈ N H×W . Let ci be the class of the i-th object as predicted by the segmentation network, we extract the shape as:

Untitled