Named-entity recognition (NER) (also known as entity extraction) is a sub-task of information extraction that seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc. I recently participated in “Innoplexus Online Hiring Hackathon: Saving lives with AI” on Analytics Vidya which was named entity recognition task. Hence, I am coming up with this tutorial of the solution submitted by me which gives 77% accuracy on the test data. I was ranked 40 on public as well as private leaderboard of the challenge.

Problem Description

Clinical studies often require detailed patients’ information documented in clinical narratives. Named Entity Recognition (NER) is a fundamental Natural Language Processing (NLP) task to extract entities of interest (e.g., disease names, medication names and lab tests) from clinical narratives, thus to support clinical and translational research. Clinical notes have been analyzed in greater detail to harness important information for clinical research and other healthcare operations, as they depict rich, detailed medical information.

In this challenge, Hackers are invited to extract all disease names from a given paragraphs/documents. Data-set can be downloaded from here.

  1. Test-set : 20000 documents
  2. Train-set : 30000 documents with labelled entities (diseases).

For example, here is a sentence from a clinical report:

We compared the inter-day reproducibility of post-occlusive reactive hyperemia (PORH) assessed by single-point laser Doppler flowmetry (LDF) and laser speckle contrast analysis (LSCI).

In the sentence given, reactive hyperemia (in bold) is the named entity with the type disease/indication.

Data Description

The train file has the following structure:

Variable Definition
id Unique ID for a token/word
Doc_ID Unique ID for a Document/Paragraph
Sent_ID Unique ID for a Sentence
Word Exact word/token
tag (Target) Named Entity Tag

The target ‘tag’ follows the Inside-outside-beginning (IOB) tagging format. The IOB format (short for inside, outside, beginning) is a common tagging format for tagging tokens in named entity recognition. The target ‘tag’ has three kinds of tags.

  1. B-indications : Beginning tag indicates that the token is the beginning of a disease entity (disease name in this case).
  2. I-indications : Inside tag indicates that the token is inside an entity.
  3. O : Outside tag indicates that a token is outside a disease entity.

Therefore, any word which does not represent the disease name has to be classified as “O” tag. Similarly, the first word of disease name has to be classified as “B-Indication” and following words of disease name as “I-Indication”.

Approach

Before going ahead with deep learning and python based implementation, It is important to clearly understand the kind of problem NER is. Beginners may confound it with a sentence parsing problem or a classical classification problem. Essentially, unlike other sentence or document classification technique (as in this and this post), NER is a word classification problem where each word of the sentence has to be classified among the labelled tags.

An obvious question that arises is regarding the kind of classifier which can be used in such problem. A classification model which can model the words of sentence in sequence of states/nodes along with tagging each of these words with class tag. This will allow contextual learning of entities and classification of each word at the same time. In non deep learning models, Conditional Random field (CRF) had been an obvious and popular choice for modelling NER problems. Readers can have a overview about CRF from here. In order to implement CRF for NER application, one can use a popular implementation of sklearn crfsuite from here.

There are open source packages which implements deep learning based NER and is becoming famous in Industry for example Spacy. This blog-post demonstrates a deep learning model that can be utilized for NER problems. Motivation of this blog-post is to train a custom NER model from scratch using Python and Keras. This will allow to learn domain specific entities like disease names in here. So, let’s start covering following steps.

  1. Importing Libraries
  2. Reading Data
  3. Creating Word & Tag dictionary
  4. Getting Train & Test Sentences
  5. Feature Extraction for DL Model
  6. Building Bidirectional LSTM Model
  7. Prediction on Test Set
  8. Prepare Submission Data
  9. Writing the Submission File
  10. Leaderboard Score

1. Importing libraries

import pandas as pd
import numpy as np
from tqdm import tqdm, trange
import unicodedata

from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.models import Model, Input
from keras.layers import LSTM, Embedding, Dense
from keras.layers import TimeDistributed, Dropout, Bidirectional

# Defining Constants

# Maximum length of text sentences
MAXLEN = 180
# Number of LSTM units
LSTM_N = 150
# batch size
BS=48

2. Reading Data

# Reading the training set
data = pd.read_csv("train.csv", encoding="latin1")
data.head(10)
NER Training Data
NER training data with tag
# Reading the test set
test_data = pd.read_csv("test.csv", encoding="latin1")
test_data.head(10)
NER Test Data
NER test data samples

3. Creating Word & Tag dictionary

print("Number of uniques docs, sentences and words in Training set:\n",data.nunique())
print("\nNumber of uniques docs, sentences and words in Test set:\n",test_data.nunique())

# Creating a vocabulary
words = list(set(data["Word"].append(test_data["Word"]).values))
words.append("ENDPAD")

# Converting greek characters to ASCII characters eg. 'naïve café' to 'naive cafe'
words = [unicodedata.normalize('NFKD', str(w)).encode('ascii','ignore') for w in words]
n_words = len(words)
print("\nLength of vocabulary = ",n_words)

tags = list(set(data["tag"].values))
n_tags = len(tags)
print("\nnumber of tags = ",n_tags)

# Creating words to indices dictionary.
word2idx = {w: i for i, w in enumerate(words)}
# Creating tags to indices dictionary.
tag2idx = {t: i for i, t in enumerate(tags)}
Information about train and test data
Training and Test data Information

4. Getting Train & Test Sentences


def get_tagged_sentences(data):
'''
Objective: To get list of sentences along with labelled tags.
Returns a list of lists of (word,tag) tuples.
Each inner list contains a words of a sentence along with tags.
'''
    agg_func = lambda s: [(w, t) for w, t in zip(s["Word"].values.tolist(), s["tag"].values.tolist())]
    grouped = data.groupby("Sent_ID").apply(agg_func)
    sentences = [s for s in grouped]
    return sentences

def get_test_sentences(data):
'''
Objective: To get list of sentences.
Returns a list of lists of words.
Each inner list contains a words of a sentence.
'''

    agg_func = lambda s: [w for w in s["Word"].values.tolist()]
    grouped = data.groupby("Sent_ID").apply(agg_func)
    sentences = [s for s in grouped]
    return sentences
# Getting training sentences in a list
sentences = get_tagged_sentences(data)
print("First 2 sentences in a word list format:\n",sentences[0:2])
word tag format training sentences
Sample train sentence in word tag format
# Getting test sentences in a list
test_sentences = get_test_sentences(test_data)
print("First 2 sentences in a word list format:\n",test_sentences[0:2])
Test Samples in word list format
Test sample sentences in word list format

5. Feature Extraction for DL Model


# Converting words to indices for test sentences (Features)
# Converting greek characters to ASCII characters in train set eg. 'naïve café' to 'naive cafe'
X = [[word2idx[unicodedata.normalize('NFKD', str(w[0])).
encode('ascii','ignore')] for w in s] for s in sentences]

# Converting words to indices for test sentences (Features)
# Converting greek characters to ASCII characters in test-set eg. 'naïve café' to 'naive cafe'
X_test = [[word2idx[unicodedata.normalize('NFKD', str(w)).
encode('ascii','ignore')] for w in s] for s in test_sentences]

'''
Padding train and test sentences to 180 words.
Sentences of length greater than 180 words are truncated.
Sentences of length less than 180 words are padded with a high value.
'''
X = pad_sequences(maxlen=MAXLEN, sequences=X, padding="post", value=n_words - 1)
X_test = pad_sequences(maxlen=MAXLEN, sequences=X_test, padding="post", value=n_words - 1)

# Converting tags to indices for test sentences (labels)
y = [[tag2idx[w[1]] for w in s] for s in sentences]
# Padding tag labels to 180 words.
y = pad_sequences(maxlen=MAXLEN, sequences=y, padding="post", value=tag2idx["O"])

# Making labels in one hot encoded form for DL model
y = [to_categorical(i, num_classes=n_tags) for i in y]

6. Building Bidirectional LSTM Model


# 180 dimensional word indices as input
input = Input(shape=(MAXLEN,))

# Embedding layer of same length output (180 dim embedding will be generated)
model = Embedding(input_dim=n_words, output_dim=MAXLEN, input_length=MAXLEN)(input)

# Adding dropout layer
model = Dropout(0.2)(model)

# Bidirectional LSTM to learn from both forward as well as backward context
model = Bidirectional(LSTM(units=LSTM_N, return_sequences=True, recurrent_dropout=0.1))(model)

# Adding a TimeDistributedDense, to applying a Dense layer on each 180 timesteps
out = TimeDistributed(Dense(n_tags, activation="softmax"))(model) # softmax output layer
model = Model(input, out)

model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
history = model.fit(X, np.array(y), batch_size=BS, epochs=2, validation_split=0.05, verbose=1)
Training LSTM model
Training epochs for NER Model
model.summary()
NER Model Summary
Model Summary

7. Prediction on Test Set

# Predicting on trained model
pred = model.predict(X_test)
print("Predicted Probabilities on Test Set:\n",pred.shape)
# taking tag class with maximum probability
pred_index = np.argmax(pred, axis=-1)
print("Predicted tag indices: \n",pred_index.shape)

pred_shape

# Flatten both the features and predicted tags for submission
ids,tagids = X_test.flatten().tolist(), pred_index.flatten().tolist()

# converting each word indices back to words
words_test = [words[ind].decode('utf-8') for ind in ids]
# converting each predicted tag indices back to tags
tags_test = [tags[ind] for ind in tagids]
print("Length of words in Padded test set:",len(words_test))
print("Length of tags in Padded test set:",len(tags_test))
print("\nCheck few of words and predicted tags:\n",words_test[:10],tags_test[:10])

pred_eg_chk

8. Prepare Submission Data

'''
The task here is to convert padded fixed 180 dimensional predicted tags
to variable length test set sentences.
1. If the sentences have word length shorter than 180,
   then predcited tags are skipped.
2. If the sentences have word length longer than 180,
   then all extra words are tagged with "O" tag class.
'''

i=0
j=1
predicted_tags = []
counts = test_data.groupby('Sent_ID')['id'].count().tolist()

for index,count in enumerate(counts):
    if count <= MAXLEN:
        predicted_tags.append(tags_test[i:i+count])
    else:
        predicted_tags.append(tags_test[i:i+MAXLEN])
        out = ['O']*(count-MAXLEN)
        predicted_tags.append(out)

    i=j*MAXLEN
    j=j+1

predictions_final = [item for sublist in predicted_tags for item in sublist]
print("\nLength of test set words and predicted tags should match.")
print("Length of predicted tags:",len(predictions_final))
print("Length of words in test set:",test_data['Word'].size)

prepare_sub

9. Writing the Submission File

df = pd.read_csv("sample_submission.csv", encoding="latin1")
# Creating a dataframe in the submission format
df_results = pd.DataFrame({'id':df['id'],'Sent_ID':df['Sent_ID'],'tag':predictions_final})
# writing csv submission file
df_results.to_csv('submission_final.csv',sep=",", index=None)
df_results.head()
Predictions from test set for NER task
Sample Predictions with Tags

10. Leaderboard Score

# Relaxed/Partial F1 score on private leaderboard was 77.8%
# Partial F1 score: F1 score with considering partial disease name detection
from IPython.display import Image
Image(filename='/home/abhijeet/Pictures/F1_score.png')
Leaderboard
Hackathon Leaderboard

At The End

Hope it was easy to go through tutorial as I have tried to keep it short and simple. Beginners interested in text analytics can start with this application. Readers are strongly encouraged to download the data-set and check if they can reproduce the results. I also hope the comments in each code block are sufficient enough to understand the codes. Readers can discuss in comments if there is a need of an explicit explanation. Few more variants that can be tried out as an extension are as follows:

  1. CRF Suite : Participants used Conditional Random Field model along with feature engineering in the Hackathon to get better accuracy than posted here.
  2. Feature Engineering : A survey can be done to pre-process text paragraphs or do some feature engineering to improve models. This blog-post does not involves any kind of preprocessing except converting greek characters to ASCII.
  3. Deep Models : There are papers which quote SOTA for NER problems. Readers can try implementing deep model from this paper. The implementation of this paper in Keras can be found here.

Finally, you can get the full python implementation of this blog-post in a Jupyter Notebook from GitHub link here.

If you liked the post, follow this blog to get updates about upcoming articles. Also, share it so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.

Happy Deep Learning 🙂