BERT : Pre-training of Deep Bidirectional Transformer for Language Understanding

https://arxiv.org/pdf/1810.04805.pdf

Abstract

BERT is designed to pre-train deep bidirectional representations from unlabeled text jointly conditioning on both left and right context in all layers
pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks
It obtains new state-of-the-art results on eleven natural language processing tasks

Introduction

Two existing strategies for applying pre-trained language representations
1. Feature-based : ELMo (Peters et al., 2018a), uses task-specific architectures that include the pre-trained representations as additional features.
  
  feature-based downstream task는 미리 학습된 언어 모델에서 추출한 feature를 이용하여 다운스트림 태스크를 수행하는 방식. 이 방식은 전체 모델을 fine-tuning하는 것과 달리, 사전 학습된 모델의 중간층에서 feature를 추출하여 이를 입력으로 사용
2. Fine-tuning : Generative Pre-trained Transformer (OpenAI GPT) (Radford et al., 2018), introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all **pre-trained parameters
3. 둘의 차이점 : 전체 모델을 학습하는 방식과 중간층에서 feature를 추출하여 사용하는 방식의 차이
  - fine-tuning-based downstream task에서는 전체 모델을 fine-tuning하여 downstream task 에 맞게 학습합니다. 따라서 fine-tuning-based downstream task는 feature-based downstream task보다 학습 시간이 더 오래 걸리며, 데이터 양과 성능 사이에 trade-off가 존재
  - feature-based downstream task는 사전 학습된 모델에서 추출한 feature를 이용하므로 학습 시간이 덜 걸리며, 적은 데이터로도 높은 성능을 발휘할 수 있습니다.
unidirectional model is very harmful when applying fine-tuning based approaches to token-level tasks

⇒ NER(Named Entity Recognition)과 같은 token-level tasks에서는 각 토큰이 속한 개체(entity)가 어떤 것인지를 판별해야 합니다. 만약 unidirectional 모델을 사용한다면, 이전 토큰에 대한 정보를 현재 토큰에 전달할 수 없기 때문에 개체를 정확하게 인식하는 것이 어려울 수 있습니다. → unidirectional 모델보다 bidirectional 모델을 사용하여 token-level tasks를 수행하는 것이 더 나은 성능을 얻을 수 있다고 주장
To alleviate previous unidirectionality constraint → by using MLM(Masked-Language-Model)

MLM : randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context.
- enables the representation to fuse the left and the right context
+) using next sentence prediction task → BERT는 두 문장 간의 상호작용을 이해하는 능력을 향상시킬 수 있다.

→ MLM과 next sentence prediction task를 동시에 수행하여, 문장 내에서의 단어 임베딩과 두 문장 간의 관계를 이해하는 능력을 동시에 학습 ⇒ 다양한 downstream NLP 태스크에서 뛰어난 성능

Related work - 앞선 연구들의 문제점 지적

ELMO - NLP 몇몇 성능지표(QA, sentiment analysis, NER)에서 SOTA 성능을 뽑아냈지만, model is feature-based and not deeply bidirectional,

Unsupervised Fine-tuning 부분에서 두가지 접근법이 나옴

기존 언어 모델을 다운스트림 태스트에서 fine-tuning 하는 것 → 다양한 테스크에서 pre-trained 모델을 사용을 통해 좋은 성능을 뽑을 수 있다. ex) GPT
scratch 접근법은 pre-training없이 언어 모델을 학습하는 것 → 즉, 처음부터 학습하는 것으로 BERT아키택처를 이용해 새로운 모델을 만드는 것을 의미.

Untitled

BERT

There are two steps in BERT framework: pre-training and fine-tuning

Abstract

Introduction

Related work - 앞선 연구들의 문제점 지적

BERT

Model Architecture