Who among us has not wished, either as a child or as an adult, to see such figures come to life and move around on the page? Sadly, while it is relatively fast to produce a single drawing, creating the sequence of images necessary for animation is a much more tedious endeavor, requiring discipline, skill, patience, and sometimes complicated software. As a result, most of these figures remain static upon the page.
문제 제기 : 정형화된 형식들은 잘 되지만 아이들이 그린 그림 같이 변화적인 그림들에는 시스템이 인식하기에 결점이 있음.
Inspired by the importance and appeal of the drawn human figure, we design and build a system to automatically animate it given an in-the-wild photograph of a child’s drawing. Our system is fast, intuitive, and robust to much of the variation present in these types of drawings, making it well-suited to allow our target audience– children–to see their own characters coming to life.
그려진 human figuare의 표현과 중요성이 있기에 저자는 자동으로 주어진 아이들의 그림을 animate하는 시스템을 구축함. 시스템은 빠르고 직관적이며 robust함.
We describe each stage and identify common causes of failure in each. For object detection and pose estimation, we make use of existing computer vision models designed to detect human figures and joints in photographs; we fine-tune these models for use with children’s drawings.
저자는 단계로 나누어 aniamate함. 하지만 각 단계가 공통적으로 실패하는 이유가 있는데 이 이유를 알려주고 이 모델들을 fine-tune함.
We first detect a bounding box around the human figure within the drawing. This step is necessary because many children’s drawings portray human figures as part of a larger scene [Kellogg 1967] and because the photograph may include background either drawn or outside the bounds of the piece of paper such as a table surface.
그림 안에 human figure 주변에 bounding box를 찾아야함. 이 step은 larger scene의 파트로 사람 figure를 묘사하여 그리는 경우가 많고 photograph는 table과 같은 종이 밖에 배경이 포함되기 때문에 필요함.
We utilize pretrained weights derived from the MS-COCO dataset, one of the largest publicly available semantic segmentation datasets
However, MS-COCO is comprised primarily of photographs of real-world objects, not artistic renderings, and does not contain a category for drawings of human figures. Therefore, we fine-tune the model. The model’s backbone weights are frozen and attached to a head, which predicts a single class, human figure.
MS-COCO dataset으로부터 학습한 모델을 사용함. 하지만 drawing과 거리가 있고 현실 photograph에 집중됨. 그러므로 fine-tune을 함. 모델의 backbone weights는 frozen하고 head를 붙여 human figure 한개만 predict 하도록함.
With the bounding box identified, we next obtain a segmentation mask, separating the figure from the background.
While Mask R-CNN does predict a segmentation mask for each detection, we found them to be inadequate in many cases. Because this mask will be used to create a 2D textured mesh of the figure, it must be a single polygon that tightly conforms to the edges of the figure, includes all body parts, and excludes extraneous background elements.
bounding box identified와 함께 다음 segmentation mask를 얻고 배경으로부터 figure를 분리함. 하지만 이 작업이 진짜 매우 어려움. 그래서 Mask R-CNN 써봤는데 많은 경우에 부적합함. 이 mask가 figure에 2D textured mesh를 만들기위해 사용되어서임. 무조건 하나 figure의 edges를 tightly conforms한 polygon을 가지려하고 배경을 많이 제거하려고 해서임.