정리 | Notion

Untitled

we close this gap and study the behaviors of image classifiers trained with natural language supervision at large scale. Enabled by the large amounts of publicly available data of this form on the internet, we create a new dataset of 400 million (image, text) pairs and demonstrate that a simplified version of ConVIRT trained from scratch, which we call CLIP, for Contrastive Language-Image Pre-training, is an efficient method of learning from natural language supervision. We study the scalability of CLIP by training a series of eight models spanning almost 2 orders of magnitude of compute and observe that transfer performance is a smoothly predictable function of compute (Hestness et al., 2017; Kaplan et al., 2020). We find that CLIP, similar to the GPT family, learns to perform a wide set of tasks during pre-training including OCR, geo-localization, action recognition, and many others

저자는 CLIP을 만들었다. 엄창난 양에 데이터를 인터넷에 가져와 데이터셋을 만들었고 CLIP을 학습했는데 GPT랑 친화적이다. 그리고 transfer가 부드럽게 이루어진다.

1 Approach

Natural Language Supervision

It’s much easier to scale natural language supervision compared to standard crowd-sourced labeling for image classification since it does not require annotations to be in a classic “machine learning compatible format” such as the canonical 1-of-N majority vote “gold label”. Instead, methods which work on natural language can learn passively from the supervision contained in the vast amount of text on the internet.

고전적인 머신러닝 기법에 주석이 필요하지 않아서 다른 자연어 처리 기능으로 확장하는 데 훨씬 좋다. 대신 인터넷의 방대한 양의 텍스트에 포함된 supervision을 통해 수동적으로 학습할 수 있다.

Learning from natural language also has an important advantage over most unsupervised or self-supervised learning approaches in that it doesn’t “just” learn a representation but also connects that representation to language which enables flexible zero-shot transfer.

이 방법은 비지도 자기지도학습은 단순히 representation을 학습하는 것이 아닌 유연하게 zero-shot으로 transfer하여 representation한다는 점에서 장점이 있다.

Selecting an Efficient Pre-Training Method

Untitled

Both these approaches share a key similarity. They try to predict the exact words of the text accompanying each image. This is a difficult task due to the wide variety of descriptions, comments, and related text that co-occur with images. Recent work in contrastive representation learning for images has found that contrastive objectives can learn better representations than their equivalent predictive objective (Tian et al., 2019). Other work has found that although generative models of images can learn high quality image representations, they require over an order of magnitude more compute than contrastive models with the same performance (Chen et al., 2020a). Noting these findings, we explored training a system to solve the potentially easier proxy task of predicting only which text as a whole is paired with which image and not the exact words of that text.

기존까지는 단어와 각 이미지가 동반되어 예측되기 때문에 속도가 느리다. 하지만 저자는 다른 방식으로 테스트의 단어를 정확하게 뽑는 게 아니고 대략적으로 어떤 텍스트가 이미지와 함께 쌍으로 이루는지 계산하는 방법으로 학습하였다. Fig2를 보면 기존보다 4배나 효율적으로 나온다. 정말 대단하다..

Due to the large size of our pre-training dataset, over-fitting is not a major concern and the details of training CLIP are simplified compared to the implementation of Zhang et al. (2020). We train CLIP from scratch without initializing the image encoder with ImageNet weights or the text encoder with pre-trained weights.

pre-training 데이터가 너무 커서 overfitting은 별문제 x. CLIP은 별다른 intializing은 없고 pretrained weights를 사용한다.

We instead use only a linear projection to map from each encoder’s representation to the multi-modal embedding space. We did not notice a difference in training efficiency between the two versions and speculate that non-linear projections may be co-adapted with details of current image only in self-supervised representation learning methods. We also remove the text transformation function tu from Zhang et al. (2020) which samples a single sentence at uniform from the text since many of the (image, text) pairs in CLIP’s pretraining dataset are only a single sentence.

linear projection만 사용하여 다중 모델 임베딩 space에 representation을 한다. non-linear와 차이가 별로 없다. text-transformation도 없앴다.

We also simplify the image transformation function tv. A random square crop from resized images is the only data augmentation used during training. Finally, the temperature parameter which controls the range of the logits in the softmax, τ , is directly optimized during training as a log-parameterized multiplicative scalar to avoid turning as a hyper-parameter.

이미지는 crop을 사용, temperature 매개변수는 소프트맥스 범위에 맞춰 조절.