We would like to apply reinforcement learning to complex tasks defined only by human judgment, where we can only tell whether a result is good or bad by asking humans.
사람 판단만으로 정의된 complex tasks를 강화 학습을 정의하고 싶음. 그래서 저자는 human lable들을 사용해 reward 모델을 학습하고 그러고 나서 모델을 최적화함.
In this paper, we combine the pretraining advances in natural language processing with human preference learning. We fine-tune pretrained language models with reinforcement learning rather than supervised learning, using a reward model trained from human preferences on text continuations
저자는 human preference learning과 함께 NLP에 향상된 pretraining을 결합함. 저자는 supervised learning보다 강화학습으로 fine-tune함.
We begin with a vocabulary Σ and a language model ρ which defines a probability distribution over sequences of tokens Σ n via
vocabulary Σ 와 language model p는 위와 같은 공식을 통해 토큰들의 연속으로 정의됨.
We will apply this model to a task with input space X = Σ ≤m, data distribution D over X, and output space Y = Σ n. For example, x ∈ X could be an article of up to 1000 words and y ∈ Y could be a 100-word summary.
이 모델은 1000 단어를 입력하면 100단어로 요약 할 수 있도록 제한 할 수 있음.
We initialize a policy π = ρ, and then fine-tune π to perform the task well using RL. If the task was defined by a reward function r : X × Y → R, then we could use RL to directly optimize the expected reward:
π=p로 시작하고 fine-tune π는 RL에 잘 사용되도록 행동함. 만약 reward 함수 r이 정의되고 RL를 예상하는 reward로 최적화하도록 사용할 수 있다.
To do this, we will first use human labels to train a reward model, and then optimize that reward model.
하지만 사람 판단에 의해 정의된 업무를 하여 사람이 결정한 reward를 받기를 원함. 그러기 위해서 저자는 human labels를 reward model에 학습을 시키고나서 reward 모델을 최적화한다.
we ask human labelers to pick which of several values of yi is the best response to a given input x. 1 We ask humans to choose between four options (y0, y1, y2, y3); considering more options allows a human to amortize the cost of reading and understanding the prompt x. Let b ∈ {0, 1, 2, 3} be the option they select.
사람들에게 prompt x를 읽고 그에 맞는가치 y를 선택하게 함. 그래서 위와 같은 공식으로 loss를 계산함,
Since the reward model needs to understand language, we initialize it as a random linear function of the final embedding output of the language model policy ρ following. To keep the scale of the reward model consistent across training, we normalize it so that it has mean 0 and variance 1 for x ∼ D, y ∼ ρ(·|x).