Untitled

Both towers are trained to minimize a contrastive loss, which encourages representations of paired images and texts to be similar, and representations of non-paired images and texts to be dissimilar. At test time, the resulting model can be used for zero-shot image classification by comparing the image embedding with embeddings of textual class descriptions

제대로된 이미지-텍스트 쌍은 비슷하고 다른 건 다르게 표현하도록 학습된다.

Untitled

In this paper, we adopt a contrastive learning framework and propose a more data- and compute-efficient strategy named contrastive-tuning. The key idea is to tune the text tower using image-text data, while using a pre-trained, strong image model as the image tower. During training, both towers’ weights can be locked or unlocked, leading to different design choices that are illustrated in Figure 2. Specifically, we find that locking the image tower works best, as shown in Figure 1. We call this specific instance of contrastive-tuning “Locked-image Tuning” (LiT), which just teaches a text model to read out suitable representations from a pre-trained image model.

저자는 contrasive learning과 더 효과적인 전략으로 contrsve tuning을 적용한다. 주요 아이디어는 image-text에서 image 모델이 더 강하면 학습동안 lock하거나 unlock하는 거다. Fig2에 잘 나와있고 이를 “Locked-image Tuning”(LiT)라 한다.

1. Methods

Contrastive pre-training

Contrastive pre-training is one particularly effective approach for training models from image-text data, which was recently proven to work well in practice [31, 46]. We take a closer look at this approach and propose a simple, yet highly effective recipe to significantly enhance contrastive pre-training from image-text data.

Contrasive pre-training은 image-text 데이터로부터 학습한 모델들을 효과적으로 접근한다.

The key idea behind the contrastive pre-training approach is to learn two embedding models: an image model and a text model, both of which produce representations of the same dimensionality. These models are trained using a contrastive loss.

Contrasive pre-training에 주요 아이디어는 임베딩 모델들을 같은 크기 차원을 생성하는 학습한다. 이 모델들은 contrastive loss를 사용할 수 있다. 이 loss가 비슷한 임베딩을 가지게하도록한다. 또한 다르면 멀리 임베딩하도록 돕는다.(contrastive loss)

After image and text towers are trained, they can be readily used for zero-shot classification: class names or descriptions are embedded with the text model. Then, for a given image the label is selected that has the embedding closest to the embedding of the image.

학습 후에 zero-shot classification에 쓰는데 임베딩이 잘 된다.

Contrastive-tuning

Contrastive pre-training can be viewed as learning two tasks at the same time: (1) learning an image embedding and (2) learning a text embedding to align with the image embedding space. While contrastive pre-training on imagetext data works well for solving both of these tasks simultaneously, it may be not the optimal approach.

Contrastive pre-training은 2가지를 동시에 보는데 (1) 이미지 임베딩 학습 그리고 (2) 텍스트 임베딩을 이미지 임베딩과 함께 맞추기 위한 학습이다. 이 방법이 최적의 방법은 아닐 거다.

However, this common approach has a clear weakness: it is limited to a predefined set of categories and, thus, the resulting models can only reason about these categories. In contrast, image-text data does not have this limitation, as it learns from the free-form text that potentially spans a broad range of real-life concepts. On the other hand, image-text data that is available may be of lower quality (for learning image embeddings) than carefully curated datasets

만약 contrastive pre-training이 아니라면 image-text 쌍인 대형 데이터셋을 이용해서 학습한다. 근데 이 방법에는 정의된 카테고리 집합이 한계가 있다. 반대로 한계가 없다면 자유롭게 개념을 추가하고 없앨 수있다. 반면 image-text data는 조심히 선별한 데이터셋보다는 낮은 퀄리티를 가진다.

We propose contrastive-tuning to combine advantages of both sources of data. One specific way of doing this is to initialise the contrastive pre-training with an image model that was already pre-trained using cleaner (semi-)manually labeled data. This way the image-text alignment is learned independently of image embedding, enabling benefit from both data sources.