I recently participated in “Innoplexus Online Hiring Hackathon: Saving lives with AI” on Analytics Vidya and hence I am coming up with this tutorial of the solution submitted by me which gives 77% accuracy on the test data. I was ranked 40 on public as well as private leaderboard of the challenge.

Problem Description

Clinical studies often require detailed patients’ information documented in clinical narratives. Named Entity Recognition (NER) is a fundamental Natural Language Processing (NLP) task to extract entities of interest (e.g., disease names, medication names and lab tests) from clinical narratives, thus to support clinical and translational research. Clinical notes have been analyzed in greater detail to harness important information for clinical research and other healthcare operations, as they depict rich, detailed medical information.

In this challenge, Hackers are invited to extract all disease names from a given paragraphs/documents. Data-set can be downloaded from here.

  1. Test-set : 20000 documents
  2. Train-set : 30000 documents with labelled entities (diseases).

For example, here is a sentence from a clinical report:

We compared the inter-day reproducibility of post-occlusive reactive hyperemia (PORH) assessed by single-point laser Doppler flowmetry (LDF) and laser speckle contrast analysis (LSCI).

In the sentence given, reactive hyperemia (in bold) is the named entity with the type disease/indication.

Data Description

The train file has the following structure:

Variable Definition
id Unique ID for a token/word
Doc_ID Unique ID for a Document/Paragraph
Sent_ID Unique ID for a Sentence
Word Exact word/token
tag (Target) Named Entity Tag

The target ‘tag’ follows the Inside-outside-beginning (IOB) tagging format. The IOB format (short for inside, outside, beginning) is a common tagging format for tagging tokens in named-entity recognition. The target ‘tag’ has three kinds of tags.

  1. B-indications : Beginning tag indicates that the token is the beginning of a disease entity (disease name in this case).
  2. I-indications : Inside tag indicates that the token is inside an entity.
  3. O : Outside tag indicates that a token is outside a disease entity


Before going ahead with deep learning and python based implementation, It is important to clearly understand the kind of problem NER is. Beginners may confound it with a sentence parsing problem or a classical classification problem. Essentially, unlike other sentence or document classification technique, NER is a word classification problem where each word of the sentence has to be classified among the labelled tags. Any word which does not represent the disease name has to be classified as “O” tag. Similarly, the first word of disease name has to be classified as “B-Indication” and following words of disease name as “I-Indication”.

An obvious question that arises is regarding the kind of classifier which can be used in such problem. A classification model which can model the words of sentence in sequence of states/nodes along with tagging each of these words with class tag. This will allow contextual learning of entities and classification of each word at the same time. In non deep learning models, Conditional Random field (CRF) had been an obvious and popular choice for modelling NER problems. Readers can have a overview about CRF from here. In order to implement CRF for NER application, one can use a popular implementation of sklearn crfsuite from here.

There are open source packages which implements deep learning based NER and is becoming famous in Industry for example Spacy. This blog-post demonstrates a deep learning model that can be utilized for NER problems. Motivation of this blog-post is to train a custom NER model from scratch using Python and Keras. This will allow to learn domain specific entities like disease names in here. So, let’s get started covering following steps.

  1. Importing Libraries
  2. Reading Data
  3. Creating Word & Tag dictionary
  4. Getting Train & Test Sentences
  5. Feature Extraction for DL Model
  6. Building Bidirectional LSTM Model
  7. Prediction on Test Set
  8. Prepare Submission Data
  9. Writing the Submission File
  10. Leaderboard Score

1. Importing libraries

import pandas as pd
import numpy as np
from tqdm import tqdm, trange
import unicodedata

from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.models import Model, Input
from keras.layers import LSTM, Embedding, Dense
from keras.layers import TimeDistributed, Dropout, Bidirectional

# Defining Constants

# Maximum length of text sentences
MAXLEN = 180
# Number of LSTM units
LSTM_N = 150
# batch size

2. Reading Data

# Reading the training set
data = pd.read_csv("train.csv", encoding="latin1")


# Reading the test set
test_data = pd.read_csv("test.csv", encoding="latin1")


3. Creating Word & Tag dictionary

print("Number of uniques docs, sentences and words in Training set:\n",data.nunique())
print("\nNumber of uniques docs, sentences and words in Test set:\n",test_data.nunique())

# Creating a vocabulary
words = list(set(data["Word"].append(test_data["Word"]).values))

# Converting greek characters to ASCII characters eg. 'naïve café' to 'naive cafe'
words = [unicodedata.normalize('NFKD', str(w)).encode('ascii','ignore') for w in words]
n_words = len(words)
print("\nLength of vocabulary = ",n_words)

tags = list(set(data["tag"].values))
n_tags = len(tags)
print("\nnumber of tags = ",n_tags)

# Creating words to indices dictionary.
word2idx = {w: i for i, w in enumerate(words)}
# Creating tags to indices dictionary.
tag2idx = {t: i for i, t in enumerate(tags)}


4. Getting Train & Test Sentences

def get_tagged_sentences(data):
Objective: To get list of sentences along with labelled tags.
Returns a list of lists of (word,tag) tuples.
Each inner list contains a words of a sentence along with tags.
    agg_func = lambda s: [(w, t) for w, t in zip(s["Word"].values.tolist(), s["tag"].values.tolist())]
    grouped = data.groupby("Sent_ID").apply(agg_func)
    sentences = [s for s in grouped]
    return sentences

def get_test_sentences(data):
Objective: To get list of sentences.
Returns a list of lists of words.
Each inner list contains a words of a sentence.

    agg_func = lambda s: [w for w in s["Word"].values.tolist()]
    grouped = data.groupby("Sent_ID").apply(agg_func)
    sentences = [s for s in grouped]
    return sentences
# Getting training sentences in a list
sentences = get_tagged_sentences(data)
print("First 2 sentences in a word list format:\n",sentences[0:2])


# Getting test sentences in a list
test_sentences = get_test_sentences(test_data)
print("First 2 sentences in a word list format:\n",test_sentences[0:2])


5. Feature Extraction for DL Model

# Converting words to indices for test sentences (Features)
# Converting greek characters to ASCII characters in train set eg. 'naïve café' to 'naive cafe'
X = [[word2idx[unicodedata.normalize('NFKD', str(w[0])).
encode('ascii','ignore')] for w in s] for s in sentences]

# Converting words to indices for test sentences (Features)
# Converting greek characters to ASCII characters in test-set eg. 'naïve café' to 'naive cafe'
X_test = [[word2idx[unicodedata.normalize('NFKD', str(w)).
encode('ascii','ignore')] for w in s] for s in test_sentences]

Padding train and test sentences to 180 words.
Sentences of length greater than 180 words are truncated.
Sentences of length less than 180 words are padded with a high value.
X = pad_sequences(maxlen=MAXLEN, sequences=X, padding="post", value=n_words - 1)
X_test = pad_sequences(maxlen=MAXLEN, sequences=X_test, padding="post", value=n_words - 1)

# Converting tags to indices for test sentences (labels)
y = [[tag2idx[w[1]] for w in s] for s in sentences]
# Padding tag labels to 180 words.
y = pad_sequences(maxlen=MAXLEN, sequences=y, padding="post", value=tag2idx["O"])

# Making labels in one hot encoded form for DL model
y = [to_categorical(i, num_classes=n_tags) for i in y]

6. Building Bidirectional LSTM Model

# 180 dimensional word indices as input
input = Input(shape=(MAXLEN,))

# Embedding layer of same length output (180 dim embedding will be generated)
model = Embedding(input_dim=n_words, output_dim=MAXLEN, input_length=MAXLEN)(input)

# Adding dropout layer
model = Dropout(0.2)(model)

# Bidirectional LSTM to learn from both forward as well as backward context
model = Bidirectional(LSTM(units=LSTM_N, return_sequences=True, recurrent_dropout=0.1))(model)

# Adding a TimeDistributedDense, to applying a Dense layer on each 180 timesteps
out = TimeDistributed(Dense(n_tags, activation="softmax"))(model) # softmax output layer
model = Model(input, out)

model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
history = model.fit(X, np.array(y), batch_size=BS, epochs=2, validation_split=0.05, verbose=1)




7. Prediction on Test Set

# Predicting on trained model
pred = model.predict(X_test)
print("Predicted Probabilities on Test Set:\n",pred.shape)
# taking tag class with maximum probability
pred_index = np.argmax(pred, axis=-1)
print("Predicted tag indices: \n",pred_index.shape)


# Flatten both the features and predicted tags for submission
ids,tagids = X_test.flatten().tolist(), pred_index.flatten().tolist()

# converting each word indices back to words
words_test = [words[ind].decode('utf-8') for ind in ids]
# converting each predicted tag indices back to tags
tags_test = [tags[ind] for ind in tagids]
print("Length of words in Padded test set:",len(words_test))
print("Length of tags in Padded test set:",len(tags_test))
print("\nCheck few of words and predicted tags:\n",words_test[:10],tags_test[:10])


8. Prepare Submission Data

The task here is to convert padded fixed 180 dimensional predicted tags
to variable length test set sentences.
1. If the sentences have word length shorter than 180,
   then predcited tags are skipped.
2. If the sentences have word length longer than 180,
   then all extra words are tagged with "O" tag class.

predicted_tags = []
counts = test_data.groupby('Sent_ID')['id'].count().tolist()

for index,count in enumerate(counts):
    if count <= MAXLEN:
        out = ['O']*(count-MAXLEN)


predictions_final = [item for sublist in predicted_tags for item in sublist]
print("\nLength of test set words and predicted tags should match.")
print("Length of predicted tags:",len(predictions_final))
print("Length of words in test set:",test_data['Word'].size)


9. Writing the Submission File

df = pd.read_csv("sample_submission.csv", encoding="latin1")
# Creating a dataframe in the submission format
df_results = pd.DataFrame({'id':df['id'],'Sent_ID':df['Sent_ID'],'tag':predictions_final})
# writing csv submission file
df_results.to_csv('submission_final.csv',sep=",", index=None)


10. Leaderboard Score

# Relaxed/Partial F1 score on private leaderboard was 77.8%
# Partial F1 score: F1 score with considering partial disease name detection
from IPython.display import Image


At The End

Hope it was easy to go through tutorial as I have tried to keep it short and simple. Beginners who are interested in text analytics can start with this application. Readers are strongly encouraged to download the data-set and check if they can reproduce the results. I also hope the comments in each code block are sufficient enough to understand the codes. Readers can discuss in comments if there is an explicit explanation needed. Few more variants that can be tried out as an extension are as follows:

  1. CRF Suite : Conditional Random Field model along with feature engineering was used by few participants in the Hackathon to get better accuracy than posted here.
  2. Feature Engineering : A survey can be done to pre-process text paragraphs or do some feature engineering to improve models. This blog-post does not involves any kind of preprocessing except converting greek characters to ASCII.
  3. Deep Models : There are papers which quote SOTA for NER problems. Readers can try implementing deep model from this paper. The implementation of this paper in Keras can be found here.

You can get the full python implementation of this blog-post in a Jupyter Notebook from GitHub link here.

If you liked the post, follow this blog to get updates about upcoming articles. Also, share it so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.

Happy Deep Learning 🙂