정리 | Notion

However, these specialist segmentation models are limited to specific tasks, classes, granularities, data types

This requires expensive annotation efforts and is not sustainable for a large number of segmentation tasks.

문제 제기 : specialist segmentation models은 특정 tasks, classes, granularities, dat types에 제한되었음. 그리고 이를 하나하나 다 학습하려면 너무 많은 노력과 적당하지 않은 segmentation tasks의 큰 수가 필요함.

In this work, we aim to train a single model that is capable of solving diverse and unlimited segmentation tasks. The main challenges are twofold: (1) to incorporate those very different data types in training, e.g., part, semantic, instance, panoptic, person, medical image, aerial image, etc.; (2) to design a generalizable training scheme that differs from conventional multi-task learning, which is flexible on task definition and is capable of handling out-of-domain tasks.

저자는 이 문제를 해결하기 위해 (1) 다른 data type을 다 통합함. (2) conventional multi-task 학습으로부터 다른 scheme 학습하는 것이 일반화가 가능하도록 설계함.

1. Approach

Untitled

SegGPT is a special version of Painter [46] framework which enables to segment everything with a generalist Painter, thus the name of our model, SegGPT.

SegGPT는 그 GPT가 아니다. generalist Painter이다.

In-Context Coloring

In the traditional framework of Painter, the color space for each task is pre-defined, resulting in the solution collapse into multi-task learning.

기존 Painter의 framework는 각 task에 color space가 이미 정의됨. multi-task learning에 축소됨.

To address this limitation, we propose a random coloring scheme for in-context coloring. We begin by randomly sampling another image that shares a similar context with the input image, such as the same semantic category or object instance. Next, we randomly sample a set of colors from the target image and map each color to a random one. This results in a re-coloring of the corresponding pixels. As a result, we get two pairs of images, which are defined as an in-context pair. In addition, we introduce the mix-context training method which trains the model using mixed examples. This involves stitching together multiple images with the same color mapping. The resulting image is then randomly cropped and resized to form a mixed-context training sample.

이 한계를 대처하기 위해, 저자는 맥락 coloring을 위한 randomw coloring을 제안. 비슷한 맥락을 나누어 다른 image로 무작위 sampling 하고 무작위로 색깔을 칠함. 상응하는 pixel에 re-coloring 하는 결과가 나옴. 결과적으로, 저자는 2 쌍의 이미지를 얻음. 게대가 저자는 mix-context training 방법을 사용함. 모델을 섞인 예시를 사용해 학습한 거임. 같은 컬러 맵에 같이 다중 이미지들을 함게 엮음. 결과 이미지는 mixed-context training sample 형식을 구축하기 위해 무작위로 crop과 resize 되었다.

Context Ensemble

Untitled

Once the training is finished, its full power can be unleashed during inference. SegGPT enables arbitrary segmentation in context, e.g., with an example of a single image and its target image. The target image can be of a single color (excluding the background), or multiple colors, e.g., segmenting several categories or objects of interest in one shot Specifically, given an input image to be tested, we stitch it with the example image and feed it to SegGPT to get the corresponding in-context predictions.

학습이 끝나고, 추론에 최대 파워를 쓸 수 있음. SegGPT는 맥락에 임의의 segmentation이 가능하다. target imge는 하나 색깔이 될 수 있고 여러 색깔도 된다. 이미지가 주어지면 예시 이미지를 stitch하고 SegGPT에 입력 맥락 예측에 맞는 결과를 얻는다.

To efficiently leverage multiple examples for a SegGPT model, we propose two context ensemble approaches. One is called Spatial Ensemble, multiple examples concatenated in n × n grid and then subsampled to the same size as a single example

Another approach is Feature Ensemble. Multiple examples are combined in the batch dimension and computed independently except that features of the query image are averaged after each attention layer