In this context, given a target level of performance, the preferred model is not the fastest to train but the fastest at inference, and although it may be cheaper to train a large model to reach a certain level of performance, a smaller one trained longer will ultimately be cheaper at inference.

문제 제기: inference가 빠른 모델이 중요함. 작은 모델은 학습 시간이 길지만(여러번 epoch 돌려서?) inference가 확실히 빠르다는 걸 보장해준다는 얘기. 이런 모델이 선호된다.

The focus of this work is to train a series of language models that achieve the best possible performance at various inference budgets, by training on more tokens than what is typically used. The resulting models, called LLaMA, ranges from 7B to 65B parameters with competitive performance compared to the best existing LLMs.

그래서 Meta에서 다양하게 inference가 저렴한 가능한 성과를 도달함. 토큰은 다른 것들보다 많은 편인데 LLaMA 라고 함.

In the rest of this paper, we present an overview of the modifications we made to the transformer architecture (Vaswani et al., 2017), as well as our training method. Finally, we expose some of the biases and toxicity encoded in our models, using some of the most recent benchmarks from the responsible AI community.

model 구조 변경과 유해한 콘텐츠 생성이 어떤지 보여준다고 하는데 이제 이게 의무이자 책임이 되어가는듯.

1 Approach

Pre-training Data

English CommonCrawl [67%]. We preprocess five CommonCrawl dumps, ranging from 2017 to 2020, with the CCNet pipeline (Wenzek et al., 2020).

2017-2020 데이터에 five CommonCrawl dumps로 prepocess함. 중복 제거, 영어가 아닌 거 없앰, 질낮은 컨텐츠 삭제 그리고 위키피디아 참고를 하는 과정임.

C4 [15%].

We thus included the publicly available C4 dataset (Raffel et al., 2020) in our data. the main difference with CCNet is the quality filtering, which mostly relies on heuristics such as presence of punctuation marks or the number of words and sentences in a webpage.

C4 데이터셋도 중복 제거 같은 일을 하는데 CCNet과 달리 휴리스틱에 의존하거나 단어와 문장 개수를 참고함.

Github [4.5%]. We use the public GitHub dataset available on Google BigQuery. We only kept projects that are distributed under the Apache, BSD and MIT licenses.

당연히 중복제거 filter는 있고 재밌는 점은 무작정 학습이 아니라 MIT, BSD 라이센스 같은 허용이 되는 저작권만 가져옴.

Wikipedia [4.5%]. which use either the Latin or Cyrillic scripts: bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk. We process the data to remove hyperlinks, comments and other formatting boilerplate.

한문서에 다양한 언어를 가져와 학습함.

Gutenberg and Books3 [4.5%].

Gutenberg Project, which contains books that are in the public domain, and the Books3 section of ThePile (Gao et al., 2020), a publicly available dataset for training large language models. We perform deduplication at the book level, removing books with more than 90% content overlap.

구텐베르크 프로젝트랑 책 데이터를 가져옴. 책은 90% 이상 겹치는 도서는 삭제함.