The data set to evaluate retrieval performance is often small, e.g., 1, 000 images in test sets for Flickr-30k. The retrieval performance fluctuates acutely with the change in training data distribution. Although current methods achieve good performance in retrieval, these methods often do not perform well on the ImageNet classification tasks.

문제 제기: retrieval performance는 이미지가 적을 때 데이터 분포가 바뀌면 급격하게 변경된다. 지금 retrieval performance가 좋은 성과를 내지만 이 방법은 ImageNet classification tasks에 잘 작동하지 않는다.

Untitled

we propose a bilingual model named Alter ego CLIP (AltCLIP) which achieved strong performances on ImageNet and multimodal retrieval tasks in both English and Chinese. Our AltCLIP learns a strong bilingual language-image representation under a two-stage framework (see Figure 1 for an overview). In the first stage, we use Teacher Learning to distill the knowledge learned from CLIP. In the second stage, we train the model via Contrastive Learning (Hadsell et al., 2006) on a relatively small amount of Chinese and English text-image pairs.

AltCLIP을 제시하는데 2단계로 나뉜다. 첫번째는 CLIP으로부터 Teacher Learning으로 distill이 있다. 두번째 단계는 Contrastive Leanling을 통해 소량의 Chinese-English text-image 쌍을 학습한다.

1 Methodology

Teacher Learning Stage

In this stage, we perform Teacher Learning (Hinton et al., 2015) on text encoders. We use the text encoder from CLIP (Radford et al., 2021) as the teacher text encoder, and the XLM-R (Conneau et al., 2020) model pretrained on multilingual data as the student encoder. A fully-connected layer is added to transform the output of the XLMR model into the same output dimension as the teacher encoder. We use parallel text data in both English and Chinese to distill the knowledge of text-image alignment.

첫번째 단계로 distill을 한다. XLM-R(교차언어 모델) model은 student encoder로 다중언어 데이터를 pretrained 하였다.

Given parallel text input (sent1, sent2), the teacher text encoder generates the learning target from input sent1, which is the embedding of the [TOS] token, denoted by x t tos. The student text encoder generates embedding x s cls from input sent2. We minimize Mean Squared Error (MSE) between x t tos and x s cls.

문장 1, 문장 2로 sent1으로 학습해야 할 target을 생성하는 teacher text encoder이다. sent2는 student text encoder로 embedding을 한다. MSE로 두 문장 1, 2의 차이를 최소화시킨다.

At inference time, only the student encoder is used as the text encoder.

Contrastive Learning Stage

This stage of training aims to further improve textimage alignment by contrastive learning on multilingual text-image pairs.

We use Contrastive Loss (Hadsell et al., 2006) between the output projection of the image encoder and text encoder, as done similarly in previous work (Radford et al., 2021). We follow LiT (Zhai et al., 2022) to freeze the image encoder at training time and only update the parameters in the text encoder.

image-text 쌍으로 다중언어를 학습하는 단계이다. Contrastive Loss를 사용했는데 image encoder와 text encoder 사이로부터 구하는 걸로 보인다. LiT는 image encoder를 Freeze학 ㅗtext encoder를 학습시킨다.(LiT는 논문 읽을 예정, list에 추가함)