These methods are prone to drifting and struggle with occlusions. Recent state-of-the-art VOS methods use attention [36,18,54,9,60] to link representations of past frames stored in the feature memory with features extracted from the newly observed query frame which needs to be segmented. Despite the high performance of these methods, they require a large amount of GPU memory to store past frame representations. In practice, they usually struggle to handle videos longer than a minute on consumer-grade hardware.

문제 제기: Video object segmentation(VO) 방법들은 표류하기 쉽고 폐색으로 어려움을 겪음. 최근 VOS 방법들은 segmented하는데 필요한 새로운 observed query frame으로부터 features extracted한 것과 함께 feature memory에 과거 frames 저장된 representation을 연결하기 위해 attention을 사용함. 이 방법이 좋은 결과를 내지만 GPU memory가 많이 듦. 그래서 1분이 넘는 비디오 다루기가 힘듦.

Untitled

. As high-resolution features are compressed right away, they produce less accurate segmentations. Figure 1 shows the relation between GPU memory consumption and segmentation quality in short/long video datasets (details are given in Section 4.1).

고해상도 features는 올바르게 압축되면 segmentations 정확도가 낮다. Fig1을 보면 GPU memory와 segmentation quality 사이 관계를 보여준다.

We think this undesirable connection of performance and GPU memory consumption is a direct consequence of using a single feature memory type. To address this limitation we propose a unified memory architecture, dubbed XMem. Inspired by the Atkinson–Shiffrin memory model [1] which hypothesizes that the human memory consists of three components, XMem maintains three independent yet deeply-connected feature memory stores: a rapidly updated sensory memory, a high-resolution working memory, and a compact thus sustained long-term memory.

성능과 GPU 메모리 소비의 이러한 옳지 않은 연결은 단일 기능 메모리 유형을 사용한 직접적인 결과라고 생각함. 이 제한적인 걸 해결하기 위해, 메모리 구조 통합을 제안함. 3개 요소들로 구성된 human memory 가설인데 XMem은 3거자 독립적인 deeply-connected feature memory stores를 유지함: 급작스러운 메모리 사용, 고해상도 메모리 그리고 긴 시간 메모리 유지

To control the size of the working memory, XMem routinely consolidates its representations into the long-term memory, inspired by the consolidation mechanism in the human memory [46].

working memory의 사이즈를 조절하기 위해, XMem은 long-term memory로(LSTM 같은거) 표현하여 주기적으로 통합함.

1. XMem

Untitled

Overview

Figure 2 provides an overview of XMem.

Given the image and target object mask at the first frame (top-left of Figure 2), XMem tracks the object and generates corresponding masks for subsequent query frames. For this, we first initialize the different feature memory stores using the inputs. For each subsequent query frame, we perform memory reading (Section 3.2) from long-term memory (Section 3.3), working memory (Section 3.4), and sensory memory (Section 3.5) respectively

첫번째 frame에 이미지와 target object mask 주어지고 XMem은 이후 query frames에 맞는 masks를 생성하고 object를 track함. 처음에는 inputs을 사용하여 다른 feature memory stores로 시작함. 각 이후 query frame에 long term memory, working emmory, sensory memory 각각으로부터 memory reading을 함. 그러고 segmentation masks를 생성함.

When the working memory reaches a pre-defined maximum of Tmax frames, we consolidate features from the working memory into the long-term memory in a highly compact form. When the long-term memory is also full (which only happens after processing thousands of frames), we discard obsolete features to bound the maximum GPU memory usage.

working memory가 T_max frames에 최대치에 도달했을때, working memory로부터 features를 highly compact form에 long-term memony으로 통합한다. 그리고 long term memory도 가득차면 가득찬 GPU memory 사용량을 제한하기 위해 더 이상 사용하지 않는 기능은 제거함.

Untitled

XMem consists of three end-to-end trainable convolutional networks as shown in Figure 3: a query encoder that extracts query-specific image features, a decoder that takes the output of the memory reading step to generate an object mask, and a value encoder that combines the image with the object mask to extract new memory features.

XMem은 Fig3에 보여진 학습 가능한 convolutional networks 3개 구성을 보여줌. 1. query-specific image feautres를 뽑는 query encoder 2. object mask 생성하기 위한 memory reading step의 결과물을 가져오는 decoder 3. 새로운 memory features를 뽑기 위해 object mask와 함께 image combine하는 value encoder