BERT Explained: A Complete Guide with Theory and Tutorial

BERT Explained: A Complete Guide with Theory and Tutorial

  1. Why was BERT needed?
  2. What is the core idea behind it?
  3. How does it work?
  4. When can we use it and how to fine-tune it?
  5. How can we use it? Using BERT for Text Classification — Tutorial

Part I

1. Why was BERT needed?

2. What is the core idea behind it?

3. How does it work?

  1. Token embeddings: A [CLS] token is added to the input word tokens at the beginning of the first sentence and a [SEP] token is inserted at the end of each sentence.
  2. Segment embeddings: A marker indicating Sentence A or Sentence B is added to each token. This allows the encoder to distinguish between sentences.
  3. Positional embeddings: A positional embedding is added to each token to indicate its position in the sentence.

1. Masked LM (MLM)

  • 80% of the tokens are actually replaced with the token [MASK].
  • 10% of the time tokens are replaced with a random token.
  • 10% of the time tokens are left unchanged.

2. Next Sentence Prediction (NSP)

  • 50% of the time the second sentence comes after the first one.
  • 50% of the time it is a a random sentence from the full corpus.

Architecture

4. When can we use it and how to fine-tune it?

Part II

5. How can we use it? Using BERT for Text Classification — Tutorial

1. Installation

2. Preparing the data

  • Column 0: An ID for the row
  • Column 1: The label for the row (should be an int — class labels: 0,1,2,3 etc)
  • Column 2: A column of the same letter for all rows — this is a throw-away column that we need to include because BERT expects it.
  • Column 3: The text examples we want to classify

3. Training Model using Pre-trained BERT model

  • All the .tsv files should be in a folder called “data” in the “BERT directory”.
  • We should have created a folder “bert_output” where the fine tuned model will be saved.
  • The pre-trained BERT model should have been saved in the “BERT directory”.
  • The paths in the command are relative path, “./”

python run_classifier.py
--task_name=cola
--do_train=true
--do_eval=true
--do_predict=true
--data_dir=./data/
--vocab_file=./cased_L-12_H-768_A-12/vocab.txt
--bert_config_file=./cased_L-12_H-768_A-12/bert_config.json
--init_checkpoint=./cased_L-12_H-768_A-12/bert_model.ckpt
--max_seq_length=128
--train_batch_size=32
--learning_rate=2e-5
--num_train_epochs=3.0
--output_dir=./bert_output/
--do_lower_case=False

4. Making predictions on new data

export TRAINED_MODEL_CKPT=./bert_output/model.ckpt-[highest checkpoint number]

python run_classifier.py
--task_name=cola
--do_predict=true
--data_dir=./data
--vocab_file=./cased_L-12_H-768_A-12/vocab.txt
--bert_config_file=/cased_L-12_H-768_A-12/bert_config.json
--init_checkpoint=$TRAINED_MODEL_CKPT
--max_seq_length=128
--output_dir=./bert_output

5. Taking it a step further

  • Here is a tutorial for doing just that on this same Yelp reviews dataset in PyTorch.
  • Alternatively, there is this great colab notebook created by Google researchers that shows in detail how to predict whether an IMDB movie review is positive or negative, with a new layer on top of the pre-trained BERT model in Tensorflow.

Final Thoughts