Much recently in October, 2018, Google released new language representation model called BERT, which stands for “Bidirectional Encoder Representations from Transformers”. According to their paper, It obtains new state-of-the-art results on wide range of natural language processing tasks like text classification, entity recognition, question and answering system etc.

In December, 2017, I had participated in one HackerEarth Challenge, “Predict the Happiness” where I build a multi-layered fully connected Neural Network for this text classification problem (Predict the Happiness). I could get 87.8% accuracy from the submitted solution on the test data. My rank was 66 on the leader board of the challenge. I had penned down the solution that time in my blog-post here.

Having so much of discussion around BERT over internet, I chose to apply BERT in the same competition in order to prove if tuning BERT model can take me to the top of leader board of the challenge. I would recommend reading my previous blog-post to know about data-set, problem statement and solution. So, let’s start.

1. Installation

As far as tensorflow based installation are concerned, It is easy to set up the experiment. In your python tensorflow environment, just follow these two steps.

  1. Clone the BERT Github repository onto your own machine. On your terminal, type
    git clone
  2. Download the pre-trained model from official BERT Github page here. There are 4 types of per-trained models.
    BERT-Base, Uncased
    : 12-layer, 768-hidden, 12-heads, 110M parameters
    BERT-Large, Uncased
    : 24-layer, 1024-hidden, 16-heads, 340M parameters
    BERT-Base, Cased
    : 12-layer, 768-hidden, 12-heads , 110M parameters
    BERT-Large, Cased
    : 24-layer, 1024-hidden, 16-heads, 340M parameters

I downloaded the BERT-Base, Cased one for the experiment as the text data-set used had cased words. Also, base models are only 12 layers deep neural network (as opposed to BERT-Large which is 24 layers deep) which can run on GTX 1080Ti (11 GB VRAM). BERT-Large models can not run on 11 GB GPU memory and it would require more space to run (64GB would suffice).

2. Preparing Data for Model

We need to prepare the text data in a format that it complies with BERT model. Basically codes written by Google to apply BERT accepts the “Tab separated” file in following format.

train.tsv or dev.tsv

  • an ID for the row
  • the label for the row as an int (class labels: 0,1,2,3 etc)
  • A column of all the same letter (weird throw away column expected by BERT)
  • the text examples you want to classify


  • an ID for the row
  • the text sentences/paragraph you want to test

The below python code snippet would read the HackerEarth training data (train.csv) and prepares it according to BERT model compliance.

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from pandas import DataFrame

le = LabelEncoder()

df = pd.read_csv("data/train.csv")

# Creating train and dev dataframes according to BERT
df_bert = pd.DataFrame({'user_id':df['User_ID'],
            'text':df['Description'].replace(r'\n',' ',regex=True)})

df_bert_train, df_bert_dev = train_test_split(df_bert, test_size=0.01)

# Creating test dataframe according to BERT
df_test = pd.read_csv("data/test.csv")
df_bert_test = pd.DataFrame({'User_ID':df_test['User_ID'],
                 'text':df_test['Description'].replace(r'\n',' ',regex=True)})

# Saving dataframes to .tsv format as required by BERT
df_bert_train.to_csv('data/train.tsv', sep='\t', index=False, header=False)
df_bert_dev.to_csv('data/dev.tsv', sep='\t', index=False, header=False)
df_bert_test.to_csv('data/test.tsv', sep='\t', index=False, header=True)

The below image shows the head() of pandas dataframe “df” which actual training data from the challenge (train.csv).

Training Data Happiness prediction
Hackathon Training Data

With the above python codes, we have changed the above train.csv format into BERT complied format as shown in below image.

BERT input dataframe
BERT input format data

Similarly, test.csv is also read in the data frame and converted as described earlier. Finally, all the data frames are converted into tab separated file “.tsv”.

3. Training Model using Pre-trained BERT model

Following the blog-post till here finishes half of the job. Just recheck the following things.

  • All the .tsv files are in a folder having name “data”
  • Make sure you have created a folder “bert_output” where the fine tuned model will be saved and test results are generated under the name “test_results.tsv
  • Check that you downloaded the pre-trained BERT model in current directory “cased_L-12_H-768_A-12”
  • Also, ensure that the paths in the command are relative path (starts with “./”)

One can now fine tune the downloaded pre-trained model for our problem data-set by running the below command on terminal:


It generates “test_results.tsv” in output directory as a result of predictions on test data-set. It contains predicted probability value of all the classes in column wise.

4. Preparing Results for Submission

The below python code converts the results from BERT model to .csv format in order to submit to HackerEarth Challenge.

df_results = pd.read_csv("bert_output/test_results.tsv",sep="\t",header=None)
df_results_csv = pd.DataFrame({'User_ID':df_test['User_ID'],

# Replacing index with string as required for submission
df_results_csv['Is_Response'].replace(0, 'happy',inplace=True)
df_results_csv['Is_Response'].replace(1, 'not_happy',inplace=True)

# writing into .csv

The above figure shows the conversion of probability values into submission results.

Power of BERT

I submitted the file result.csv to HackerEarth “Predict the Happiness challenge” and became Leaderboard Rank 4. Imagine the accuracy and rank if I could use the larger model instead of Base model.

Leaderboard “Happiness Predictor”

Finally about BERT

It is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. The research paper from Google that proposes BERT is found here. It is a must read.

Training ?

The model is pre-trained using two novel unsupervised prediction tasks:

BERT uses a simple approach for this: Mask out 15% of the words in the input, run the entire sequence through a deep Bidirectional Transformer encoder, and then predict only the masked words. For example:

Input: the man went to the [MASK1] . he bought a [MASK2] of milk.
Labels: [MASK1] = store; [MASK2] = gallon

In order to learn relationships between sentences, we also train on a simple task which can be generated from any monolingual corpus: Given two sentences A and B, is B the actual next sentence that comes after A, or just a random sentence from the corpus ?

Sentence A: the man went to the store.
Sentence B: he bought a gallon of milk.
Label: IsNextSentence

Sentence A: the man went to the store.
Sentence B: penguins are flightless.
Label: NotNextSentence

Architecture ?

There are two models pre-trained depending on the scale of the model architecture namely BASE and LARGE.


Number of Layers =12
No. of hidden nodes = 768
No. of Attention heads =12
Total Parameters = 110M


Number of Layers =24,
No. of hidden nodes = 1024
No. of Attention heads =16
Total Parameters = 340M

The TensorFlow code and pre-trained models for BERT are present in GitHub link here.

Fine Tuning ?

For sequence-level classification tasks, BERT fine-tuning is straight forward. The only new added parameters during fine-tuning are for a classification layer W ∈ (K×H), where ‘K’ is the number of classifier labels and ‘H’ is the number of final hidden states. The label probabilities for K classes are computed with a standard soft-max. All of the parameters of BERT and ‘W’ are fine-tuned jointly to maximize the log-probability of the correct label.

Use of Transformers ?

I found the explanation provided in this and this link useful and explanatory in terms of BERT and how BERT is based on the idea of using transformers with attentions. Transformers have changed the usual encoder decoder (RNNs/LSTMs) implementations. A must read paper on Transformers “Attention is all you need” which has fundamentally replaced the encoder-decoder architecture as transformers are superior in quality (removes the shortcoming of training long sequences in RNNs/LSTMs) while being more parallelizable and requiring significantly less time to train.

If you liked the post, follow this blog to get updates about upcoming articles. Also, share it so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.

Happy Deep Learning 🙂