On the one hand, the massive human labor force is hidden behind huge amounts of labeled data. Moreover, current initialization settings, especially the semi-supervised VOS, need specific object mask groundtruth for model initialization. How to liberate researchers from labor-expensive annotation and initialization is much of important.
문제 제기: 현재 VOS(Video object segmentation)은 사람이 하나하나 labeling한 데이터셋을 이용함. 근데 이게 엄청난 노동량을 필요하고 게다가 initialization settings는 model initialization을 위하 특정 object mask groundtrunth가 필요함. 어떻게 이 initialization과 노동에서 벗어날 수 있을지 연구함.
We conclude that SAM has the following advantages that can assist interactive tracking: 1) Strong image segmentation ability. Trained on 11 million images and 1.1 billion masks, SAM can produce high-quality masks and do zero-shot segmentation in generic scenarios. 2) High interactivity with different kinds of prompts. With input user-friendly prompts of points, boxes, or language, SAM can give satisfactory segmentation masks on specific image areas.
However, using SAM in videos directly did not give us an impressive performance due to its deficiency in temporal correspondence.
SAM을 interactive tracking 보조도구로 사용하면 장점이 있는데 1) image segmentation 능력이 좋아서 고퀄리티 mask와 zero-shot이 가능함. 2) 다른 종류의 prompts와 함께 높은 interactive를 가짐.
하지만 SAM이 temporal correspondence 때문에 좋은 성과를 못냈음.
In this technical report, we introduce our Track-Anything project, which develops an efficient toolkit for high-performance object tracking and segmentation in videos. With a user-friendly interface, the Track Anything Model (TAM) can track and segment any objects in a given video with only one-pass inference. Figure 1 shows the one-pass interactive process in the proposed TAM.
Track-Anything Model(TAM)을 제시함. segmentation과 object tracking에 특화되었고, 유저친화적인 인터페이스와 함께 TAM은 오직 한번의 추론으로 주어진 영상에 track과 어떤 obejcts를 segment함.
Track Anything task, which aims to flexible object tracking in arbitrary videos. Here we define that the target objects can be flexibly selected, added, or removed in any way according to the users’ interests. Also, the video length and types can be arbitrary rather than limited to trimmed or natural videos.
Track Anything task는 임의의 비디오에 융통성 있게 object tracking에 목표를 둠. traget objects는 융통적이게 선택되고 더해지고 사라짐. 또한 비디오 길이와 타입은 제한된 자연스럽거나 다듬어진 비디오에 제한되지 않고 결정될 수 있음.
XMem
Given the mask description of the target object at the first frame, XMem can track the object and generate corresponding masks in the subsequent frames.
The drawbacks of XMem are also obvious: 1) as a semi-supervised VOS model, it requires a precise mask to initialize; 2) for long videos, it is difficult for XMem to recover from tracking or segmentation failure.
첫 frame에 target object의 mask 설명이 주어지고 XMem은 object와 이후 나오는 frames에 상응하는 masks를 생성하며 track함.
결점은 1. initialize를 위해 정확한 mask가 필요함. 2) 긴 비디오에는 실패할 수 있음
Step 1: Initialization with SAM [5]
As SAM provides us an opportunity to segment a region of interest with weak prompts, e.g., points, and bounding boxes, we use it to give an initial mask of the target object.