To advance the state of the art of instruction-tuning for LLMs, we propose for the first time to use GPT-4 as a teacher for self-instruct tuning. Our paper makes the following contributions:
instruction-tuning이 좋은 성과를 냄. 저자는 GPT-4를 self-instruct tuning teacher로 두어 사용함.

- GPT-4 data. We release data generated by GPT-4, including the 52K instruction-following dataset in both English and Chinese, and the GPT-4-generated feedback data that rate the outputs of three instruction-tuned models.
- Models & Evaluation. Based on the GPT-4-generated data, we have developed instruction-tuned LLaMA models and reward models. To evaluate the quality of instruction-tuned LLMs, we use three metrics evaluated on test samples (i.e., unseen instructions): human evaluation on three alignment criteria, automatic evaluation using GPT-4 feedback, and ROUGE-L on un-natural*
- 주요 기여는 위와 같음. GPT-4 데이털르 만듦. 그리고 GPT-4 생성된 feedback data를 3개 instruction-tuned 모델들의 평가를 받음
- 모델 평가는 GPT-4 생성된 데이터 기반으로 저자가 reward model고 ㅏLLaMA 모델을 tuned함. 잘을 측정하기 위해 3 metric으로 결과물을 평가함.
1. DATASET
Data Collection.
Each instruction describes the task the model should perform. We follow the same prompting strategy to consider cases with and without input, which is the optional context or input for the task. The output answers to the instruction instance using LLMs.
각 instruction 설명은 모델이 해야할 행동을 알려줌. 저자는 optional context 또는 task에 따라 input이 없냐 있냐 경우를 고려하기 위해 똑같은 prompting 전략을 사용함. output은 결과임.
- English Instruction-Following Data: For the 52K instructions collected in Alpaca (Taori et al., 2023), one English GPT-4 answer is provided for each. The details are described in Algorithm 1.
- Chinese Instruction-Following Data: We use ChatGPT to translate the 52K instructions into Chinese and ask GPT-4 to answer them in Chinese.
- Comparison Data: We ask GPT-4 to rate its own response from 1 to 10. Furthermore, we ask GPT-4 to compare and rate the responses from the three models, including GPT-4, GPT-3.5 and OPT-IML (Iyer et al., 2022).
- Answers on Unnatural Instructions: The GPT-4 answers are decoded on the core dataset of 68K instruction-input-output triplets (Honovich et al., 2022). The subset is used to quantify the gap between GPT-4 and our instruction-tuned models at scale.
- 영어 데이터는 Alpca 데이터에서 GPT-4 응답으로 구축됨.
- 중국어는 ChatGPT로 중국어로 번역하고 GPT-4에 중국어에 대해 답을 얻음.(LLAMA모델로도 가능)
- 비교 데이터는 GPT-4에게 1부터 10까지 점수를 매김. 거기다가 3개의 모델들로부터 응답을 비교해봄.(GPT-4,GPT-3.5 and OPT-IML)
- 부자연스러운 지시에 대한 답 경우 GPT-4 답변은 67k instruction 핵심 데이터에 decoded됨. subset은 저자의 instruction tuned model과 GPT 4 사이의 격차를 정량화 하는데 쓰임.
Data Statistics.

GPT-4 tends to generated longer sequences than GPT-3.5. The GPT-3.5 data in Alpaca exhibits an output distribution with a longer tail than our GPT-4-generated output distribution, probably because the Alpaca dataset involves an iterative data collection process to remove similar instruction instances at each iteration, which is absent in our current one-time data generation. Despite this simple process, the GPT-4 generated instruction-following data demonstrates more favorable alignment performance, as shown in experiments later
GPT-4는 3.5보다 더 긴 문장을 생성함. GPT-3.5는 GPt-4 output 분포보다 더 긴 output distribution을 보임. 아마 데이터셋 특징 떄문임. 비슷한 과정임에도 GPT-4 alignment performanc에 더 alignment 성능을 가짐을 증명함.
2. INSTRUCTION-TUNING LANGUAGE MODELS