Thus, we say that the language modeling objective is misaligned. Averting these unintended behaviors is especially important for language models that are deployed and used in hundreds of applications.
toxic content를 생성하기에 language modeling은 misaligned되었다고 봄. 이런 의도하지 않은 행동은 특히 language model을 사용하는 많은 languange 모델에 중요함.
We make progress on aligning language models by training them to act in accordance with the user’s intention (Leike et al., 2018). This encompasses both explicit intentions such as following instructions and implicit intentions such as staying truthful, and not being biased, toxic, or otherwise harmful.
저자는 더 나은 alignging language model을 만듦. 신뢰할 수 있고 biased, toxic하지 않은 follwoin instruction과 implicit intentions를 따르는 명백한 행동을 일으킴.
We focus on fine-tuning approaches to aligning language models. Specifically, we use reinforcement learning from human feedback (RLHF; Christiano et al., 2017; Stiennon et al., 2020) to fine-tune GPT-3 to follow a broad class of written instructions (see Figure 2).
저자는 aligning language models fine-tuning 접근에 집중함. 특히 RLHF reinforcement를 사용함. 그러기 위해 fin-tune GPT-3를 written instuctions의 broad class를 따르게 함. Fig2 확인하면됨. 이 과정으로 학습한 GPT-3가 Instruct GPT임.
We then apply the following three steps (Figure 2).
3단계를 적용함. Fig2에 있음. 이를 자세히 서술함.
Step 1: Collect demonstration data, and train a supervised policy. Our labelers provide demonstrations of the desired behavior on the input prompt distribution (see Section 3.2 for details on this distribution). We then fine-tune a pretrained GPT-3 model on this data using supervised learning.
lablers는 input prompt distribution에 사용자의 demonstrations(원하는 output을 적는 거임)를 포함함. 저자는 지도학습을 사용해 이 data에 pretrained GPT-3 model을 fine tune함.
Step 2: Collect comparison data, and train a reward model. We collect a dataset of comparisons between model outputs, where labelers indicate which output they prefer for a given input. We then train a reward model to predict the human-preferred output.
model outputs 사이 비교 데이터셋을 모음. 그러고나서 human-preferred output 예측하는 reward model을 학습함.
Step 3: Optimize a policy against the reward model using PPO. We use the output of the RM as a scalar reward. We fine-tune the supervised policy to optimize this reward using the PPO algorithm (Schulman et al., 2017).
저자는 RM scalar reward의 output을 이용함. supervised policy fine-tune하여 PPO 알고리즘을 사용해 reward를 최적화함.
Steps 2 and 3 can be iterated continuously; more comparison data is collected on the current best policy, which is used to train a new RM and then a new policy.
step 2와 3는 반복함. 최고 policy를 모으기 위해서임.