Leveraging more than word-level information from unlabeled text, however, is challenging for two main reasons. First, it is unclear what type of optimization objectives are most effective at learning text representations that are useful for transfer. Second, there is no consensus on the most effective way to transfer these learned representations to the target task.
문제 제기: 라벨이 없는 텍스트를 word-level 이상을 활용하는 건 어렵다. 이유는 2가지가 있는데, 1. object가 transfer에 효율적인 text representation 학습에 가장 효과적인 최적화 타입이 무엇인지 불명확함. 2. 목표하는 task에 representation을 학습하기위한 가장 효과적인 방법이 일치안됨.
In this paper, we explore a semi-supervised approach for language understanding tasks using a combination of unsupervised pre-training and supervised fine-tuning. Our goal is to learn a universal representation that transfers with little adaptation to a wide range of tasks.
unsupervised pretraining과 supervised fine-tuning의 결합한 semi-supervised를 사용해 language undestanding에 접근한다. 목표는 넓은 범위에 적용할 수 있는 transfers에 universersal representation을 학습하는 거다.
We employ a two-stage training procedure. First, we use a language modeling objective on the unlabeled data to learn the initial parameters of a neural network model. Subsequently, we adapt these parameters to a target task using the corresponding supervised objective
two-stage training 과정을 거치는데 1. 모델의 초기 파라미터를 unlabeled data에 학습하고 마지막으로 이 parameter를 supervised objective에 상응하는데 사용한다.
Our training procedure consists of two stages. The first stage is learning a high-capacity language model on a large corpus of text. This is followed by a fine-tuning stage, where we adapt the model to a discriminative task with labeled data.
단계는 2개임. 1. high-capacity language model을 학습 2. fine-tuning stage임.
Given an unsupervised corpus of tokens U = {u1, . . . , un}, we use a standard language modeling objective to maximize the following likelihood:
where k is the size of the context window, and the conditional probability P is modeled using a neural network with parameters Θ. These parameters are trained using stochastic gradient descent [51].
unsupervised는 위와 같은 liklihood를 최대화하는데 목표를 둔다.
we use a multi-layer Transformer decoder [34] for the language model, which is a variant of the transformer [62]. This model applies a multi-headed self-attention operation over the input context tokens followed by position-wise feedforward layers to produce an output distribution over target tokens:
multi-layer transformer를 사용함.
After training the model with the objective in Eq. 1, we adapt the parameters to the supervised target task. We assume a labeled dataset C, where each instance consists of a sequence of input tokens, x 1 , . . . , xm, along with a label y. The inputs are passed through our pre-trained model to obtain the final transformer block’s activation h m l , which is then fed into an added linear output layer with parameters Wy to predict y: