1. Motivation and Observation

Why do we need Inpaint Anything?

However, they typically need fine annotations for each mask, which are essential for training and inference.

However, their mask segmentation predictions have not been fully-explored.

문제 제기: SOTA image inpaintin works(LaMA) 같은 것들이 각 마스크를 위한 fine annotations를 필요로함. 그리고 SAM은 segmentation predictions를 하지만 fully-explored하지 않음.

Therefore, by combining the advantages of SAM [7], the SOTA image inpainters [13], and AI generated content (AIGC) models [11], we provide a powerful and user-friendly pipeline for solving more general inpainting-related problems, such as object removal, new content filling, and background replacing.

SAM, SOTA image inpainters 그리고 AI generated Content(AIGC, 기존까지는 원래 제거된 영역에 채우는 것밖에 못했지만 사람들이 원하는 콘텐드를 새롭게 생성하기 위한 assists와 엄청난 수요를 처리할 수 있음.) 모델들의 장ㅈ머들을 합치어서 더욱 general inpainting-related problems를 할 수 있게됨.

What Inpaint Anything can do?

SAM + SOTA inpainters for removing anything:

With IA, users can easily remove specific objects from the interface by simply clicking on them. Furthermore, IA provides an option for users to fill the resulting “hole” with contextual data.

특정 objects를 쉽게 제거하고 옵션으로 여기에 무언가로 채울 수 있음.

SAM + AIGC models for filling or replacing anything:

(1) After removing objects, IA provides users the option to fill the resulting “hole” either with contextual data or “new content”.

objects를 지운 후, 옵션으로 IA는 contextual data 또는 new content에 결과물 구멍(의역하자면 없는 부분임)을 채울 수 있음. SD 모델 같은 경우에는 text로도 가능함.

(2) In addition, users have another option to take IA to retain the click-selected object and replace the remaining background with the newly generated scene. This scene replacement process of IA supports various ways of prompting AIGC models, such as using a different image as visual prompt or using a short caption as text prompt

추가적으로 사용자들은 새롭게 생성된 장면으로 대체하고 특정 object는 선택하여 유지할 수 있는 옵션이 있음. prompting AIGC 모델들의 다양한 방법들로 대체할 수 있음.

Methodology

Preliminary

Segment Anything Model (SAM).

SAM has demonstrated promising segmentation capabilities in various scenarios and the great potential of the foundation models for computer vision.