We follow the procedures described in the Conceptual Captions dataset (Sharma et al., 2018) to have a large noisy dataset. But instead of applying the complex filtering and post-processing steps as proposed by (Sharma et al., 2018) to clean the dataset, we only apply simple frequency-based filtering. The resulting dataset is noisy, but is two orders of magnitude larger than the Conceptual Captions dataset.
복잡한 filtering과 post-processing을 적용하여 clean dataset이 되기보다 간단하고 빈도 기반 filtering을 적용한다. 결과 데이터셋은 noisy하지만 Conceptual Captions dataset보다는 규모적으로 크다.
We name our model ALIGN: A Large-scale ImaGe and Noisy-text embedding. Image and text encoders are learned via a contrastive loss (formulated as normalized softmax) that pushes the embeddings of matched image-text pair together while pushing those of non-matched image-text pair apart.
ALIGN 모델이라하고 Image, text 인코더들이 contrastive loss로 학습한다. 지도, 비지도학습 둘다 효율적이다.
the key difference is that the text encoder generates the “label” weights.
다른 것들과 가장큰 차이점은 text encoder는 “label” 가중치를 생성한다.
ng, we trade quality for scale by relaxing most of the cleaning steps in the original work. Instead, we only apply minimal frequency-based filtering as detailed below. The result is a much larger (1.8B image-text pairs) but noisier dataset.
기존은 엄격하게 filtering 하였다면 저자는 덜 엄격하게 filtering함. 그래서 필터링 안된 데이터가 많다보니 다른 데이터보다 큼. 하지만 더 노이즈하다.
Image-based filtering.
we remove pornographic images and keep only images whose shorter dimension is larger than 200 pixels and aspect ratio is smaller than 3. To ensure that we don’t train on test images, we also remove duplicates or near-duplicates of test images in all downstream evaluation datasets
3보다 더작은 관점 비율과 200픽셀보다 큰이미지는 남기고 외설적인 이미지는 삭제함. 그리고 비슷한 이미지는 삭제함.
Text-based filtering.
We exclude alt-texts that are shared by more than 10 images. These alt-texts are often irrelevant to the content of the images (e.g., “1920x1080”, “alt img”, and “cristina”).
10개 이미지보다 더 많이 공유되는 대체 텍스트(뭐 1920x1080 같은거)는 제외함. 내용과 상관이 없는 경우가 많음. 그리고 드문 토큰을 포함한 alt-text도 삭제. 3보다 짧은 unigram이나 20보다 긴 unigram도 버림.
We pre-train ALIGN using a dual-encoder architecture. The model consists of a pair of image and text encoders with a cosine-similarity combination function at the top. We use EfficientNet with global pooling (without training the 1x1 conv layer in the classification head) as the image encoder and BERT with [CLS] token embedding as the text embedding encoder (we generate 100k wordpiece vocabulary from our training dataset).