Named-entity recognition (NER) (also known as entity extraction) is a sub-task of information extraction that seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc. I recently participated in “Innoplexus Online Hiring Hackathon: Saving lives with AI” on Analytics Vidya which was named entity recognition task. Hence, I am coming up with this tutorial of the solution submitted by me which gives 77% accuracy on the test data. I was ranked 40 on public as well as private leaderboard of the challenge.
Problem Description
Clinical studies often require detailed patients’ information documented in clinical narratives. Named Entity Recognition (NER) is a fundamental Natural Language Processing (NLP) task to extract entities of interest (e.g., disease names, medication names and lab tests) from clinical narratives, thus to support clinical and translational research. Clinical notes have been analyzed in greater detail to harness important information for clinical research and other healthcare operations, as they depict rich, detailed medical information.
In this challenge, Hackers are invited to extract all disease names from a given paragraphs/documents. Data-set can be downloaded from here.
- Test-set : 20000 documents
- Train-set : 30000 documents with labelled entities (diseases).
For example, here is a sentence from a clinical report:
We compared the inter-day reproducibility of post-occlusive reactive hyperemia (PORH) assessed by single-point laser Doppler flowmetry (LDF) and laser speckle contrast analysis (LSCI).
In the sentence given, reactive hyperemia (in bold) is the named entity with the type disease/indication.
Data Description
The train file has the following structure:
Variable | Definition |
---|---|
id | Unique ID for a token/word |
Doc_ID | Unique ID for a Document/Paragraph |
Sent_ID | Unique ID for a Sentence |
Word | Exact word/token |
tag (Target) | Named Entity Tag |
The target ‘tag’ follows the Inside-outside-beginning (IOB) tagging format. The IOB format (short for inside, outside, beginning) is a common tagging format for tagging tokens in named entity recognition. The target ‘tag’ has three kinds of tags.
- B-indications : Beginning tag indicates that the token is the beginning of a disease entity (disease name in this case).
- I-indications : Inside tag indicates that the token is inside an entity.
- O : Outside tag indicates that a token is outside a disease entity.
Therefore, any word which does not represent the disease name has to be classified as “O” tag. Similarly, the first word of disease name has to be classified as “B-Indication” and following words of disease name as “I-Indication”.
Approach
Before going ahead with deep learning and python based implementation, It is important to clearly understand the kind of problem NER is. Beginners may confound it with a sentence parsing problem or a classical classification problem. Essentially, unlike other sentence or document classification technique (as in this and this post), NER is a word classification problem where each word of the sentence has to be classified among the labelled tags.
An obvious question that arises is regarding the kind of classifier which can be used in such problem. A classification model which can model the words of sentence in sequence of states/nodes along with tagging each of these words with class tag. This will allow contextual learning of entities and classification of each word at the same time. In non deep learning models, Conditional Random field (CRF) had been an obvious and popular choice for modelling NER problems. Readers can have a overview about CRF from here. In order to implement CRF for NER application, one can use a popular implementation of sklearn crfsuite from here.
There are open source packages which implements deep learning based NER and is becoming famous in Industry for example Spacy. This blog-post demonstrates a deep learning model that can be utilized for NER problems. Motivation of this blog-post is to train a custom NER model from scratch using Python and Keras. This will allow to learn domain specific entities like disease names in here. So, let’s start covering following steps.
- Importing Libraries
- Reading Data
- Creating Word & Tag dictionary
- Getting Train & Test Sentences
- Feature Extraction for DL Model
- Building Bidirectional LSTM Model
- Prediction on Test Set
- Prepare Submission Data
- Writing the Submission File
- Leaderboard Score
1. Importing libraries
import pandas as pd import numpy as np from tqdm import tqdm, trange import unicodedata from keras.preprocessing.sequence import pad_sequences from keras.utils import to_categorical from keras.models import Model, Input from keras.layers import LSTM, Embedding, Dense from keras.layers import TimeDistributed, Dropout, Bidirectional # Defining Constants # Maximum length of text sentences MAXLEN = 180 # Number of LSTM units LSTM_N = 150 # batch size BS=48
2. Reading Data
# Reading the training set data = pd.read_csv("train.csv", encoding="latin1") data.head(10)
# Reading the test set test_data = pd.read_csv("test.csv", encoding="latin1") test_data.head(10)
3. Creating Word & Tag dictionary
print("Number of uniques docs, sentences and words in Training set:\n",data.nunique()) print("\nNumber of uniques docs, sentences and words in Test set:\n",test_data.nunique()) # Creating a vocabulary words = list(set(data["Word"].append(test_data["Word"]).values)) words.append("ENDPAD") # Converting greek characters to ASCII characters eg. 'naïve café' to 'naive cafe' words = [unicodedata.normalize('NFKD', str(w)).encode('ascii','ignore') for w in words] n_words = len(words) print("\nLength of vocabulary = ",n_words) tags = list(set(data["tag"].values)) n_tags = len(tags) print("\nnumber of tags = ",n_tags) # Creating words to indices dictionary. word2idx = {w: i for i, w in enumerate(words)} # Creating tags to indices dictionary. tag2idx = {t: i for i, t in enumerate(tags)}
4. Getting Train & Test Sentences
def get_tagged_sentences(data): ''' Objective: To get list of sentences along with labelled tags. Returns a list of lists of (word,tag) tuples. Each inner list contains a words of a sentence along with tags. ''' agg_func = lambda s: [(w, t) for w, t in zip(s["Word"].values.tolist(), s["tag"].values.tolist())] grouped = data.groupby("Sent_ID").apply(agg_func) sentences = [s for s in grouped] return sentences def get_test_sentences(data): ''' Objective: To get list of sentences. Returns a list of lists of words. Each inner list contains a words of a sentence. ''' agg_func = lambda s: [w for w in s["Word"].values.tolist()] grouped = data.groupby("Sent_ID").apply(agg_func) sentences = [s for s in grouped] return sentences # Getting training sentences in a list sentences = get_tagged_sentences(data) print("First 2 sentences in a word list format:\n",sentences[0:2])
# Getting test sentences in a list test_sentences = get_test_sentences(test_data) print("First 2 sentences in a word list format:\n",test_sentences[0:2])
5. Feature Extraction for DL Model
# Converting words to indices for test sentences (Features) # Converting greek characters to ASCII characters in train set eg. 'naïve café' to 'naive cafe' X = [[word2idx[unicodedata.normalize('NFKD', str(w[0])). encode('ascii','ignore')] for w in s] for s in sentences] # Converting words to indices for test sentences (Features) # Converting greek characters to ASCII characters in test-set eg. 'naïve café' to 'naive cafe' X_test = [[word2idx[unicodedata.normalize('NFKD', str(w)). encode('ascii','ignore')] for w in s] for s in test_sentences] ''' Padding train and test sentences to 180 words. Sentences of length greater than 180 words are truncated. Sentences of length less than 180 words are padded with a high value. ''' X = pad_sequences(maxlen=MAXLEN, sequences=X, padding="post", value=n_words - 1) X_test = pad_sequences(maxlen=MAXLEN, sequences=X_test, padding="post", value=n_words - 1) # Converting tags to indices for test sentences (labels) y = [[tag2idx[w[1]] for w in s] for s in sentences] # Padding tag labels to 180 words. y = pad_sequences(maxlen=MAXLEN, sequences=y, padding="post", value=tag2idx["O"]) # Making labels in one hot encoded form for DL model y = [to_categorical(i, num_classes=n_tags) for i in y]
6. Building Bidirectional LSTM Model
# 180 dimensional word indices as input input = Input(shape=(MAXLEN,)) # Embedding layer of same length output (180 dim embedding will be generated) model = Embedding(input_dim=n_words, output_dim=MAXLEN, input_length=MAXLEN)(input) # Adding dropout layer model = Dropout(0.2)(model) # Bidirectional LSTM to learn from both forward as well as backward context model = Bidirectional(LSTM(units=LSTM_N, return_sequences=True, recurrent_dropout=0.1))(model) # Adding a TimeDistributedDense, to applying a Dense layer on each 180 timesteps out = TimeDistributed(Dense(n_tags, activation="softmax"))(model) # softmax output layer model = Model(input, out) model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"]) history = model.fit(X, np.array(y), batch_size=BS, epochs=2, validation_split=0.05, verbose=1)
model.summary()
7. Prediction on Test Set
# Predicting on trained model pred = model.predict(X_test) print("Predicted Probabilities on Test Set:\n",pred.shape) # taking tag class with maximum probability pred_index = np.argmax(pred, axis=-1) print("Predicted tag indices: \n",pred_index.shape)
# Flatten both the features and predicted tags for submission ids,tagids = X_test.flatten().tolist(), pred_index.flatten().tolist() # converting each word indices back to words words_test = [words[ind].decode('utf-8') for ind in ids] # converting each predicted tag indices back to tags tags_test = [tags[ind] for ind in tagids] print("Length of words in Padded test set:",len(words_test)) print("Length of tags in Padded test set:",len(tags_test)) print("\nCheck few of words and predicted tags:\n",words_test[:10],tags_test[:10])
8. Prepare Submission Data
''' The task here is to convert padded fixed 180 dimensional predicted tags to variable length test set sentences. 1. If the sentences have word length shorter than 180, then predcited tags are skipped. 2. If the sentences have word length longer than 180, then all extra words are tagged with "O" tag class. ''' i=0 j=1 predicted_tags = [] counts = test_data.groupby('Sent_ID')['id'].count().tolist() for index,count in enumerate(counts): if count <= MAXLEN: predicted_tags.append(tags_test[i:i+count]) else: predicted_tags.append(tags_test[i:i+MAXLEN]) out = ['O']*(count-MAXLEN) predicted_tags.append(out) i=j*MAXLEN j=j+1 predictions_final = [item for sublist in predicted_tags for item in sublist] print("\nLength of test set words and predicted tags should match.") print("Length of predicted tags:",len(predictions_final)) print("Length of words in test set:",test_data['Word'].size)
9. Writing the Submission File
df = pd.read_csv("sample_submission.csv", encoding="latin1") # Creating a dataframe in the submission format df_results = pd.DataFrame({'id':df['id'],'Sent_ID':df['Sent_ID'],'tag':predictions_final}) # writing csv submission file df_results.to_csv('submission_final.csv',sep=",", index=None) df_results.head()
10. Leaderboard Score
# Relaxed/Partial F1 score on private leaderboard was 77.8% # Partial F1 score: F1 score with considering partial disease name detection from IPython.display import Image Image(filename='/home/abhijeet/Pictures/F1_score.png')
At The End
Hope it was easy to go through tutorial as I have tried to keep it short and simple. Beginners interested in text analytics can start with this application. Readers are strongly encouraged to download the data-set and check if they can reproduce the results. I also hope the comments in each code block are sufficient enough to understand the codes. Readers can discuss in comments if there is a need of an explicit explanation. Few more variants that can be tried out as an extension are as follows:
- CRF Suite : Participants used Conditional Random Field model along with feature engineering in the Hackathon to get better accuracy than posted here.
- Feature Engineering : A survey can be done to pre-process text paragraphs or do some feature engineering to improve models. This blog-post does not involves any kind of preprocessing except converting greek characters to ASCII.
- Deep Models : There are papers which quote SOTA for NER problems. Readers can try implementing deep model from this paper. The implementation of this paper in Keras can be found here.
Finally, you can get the full python implementation of this blog-post in a Jupyter Notebook from GitHub link here.
If you liked the post, follow this blog to get updates about upcoming articles. Also, share it so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.
Happy Deep Learning 🙂
I tried your code but I guess it is tagging every word with an ‘O’ tag.
Like
Hey, May be this is an unbalanced dataset, so you are seeing mostly “O” tags. Find some disease name and you would see B tags too. I remember disease names where tagged correctly hence the F-1 score of 0.77
Like
Hey.. could you pls share your code for checking the F-1 score of the model.
Liked by 1 person
Ohh that was calculated by Hackathon.
For now you can calculate precision, recall and F1 score from stratified K-CrossFold validation on Train set using sklearn libraries.
Like
Hi ,
I tested your model , it works fine , i have 2 questions :
1. do you have an idea how to deal with unbalanced data for the robustess of model (high occurence of outside tag)
2.If i add a crf layer to the model , do you think it will improve F1-score
Thank you!
Like
Hi ,
I tested your model , it works fine , i have 2 questions :
1. do you have an idea how to deal with unbalanced data for the robustess of model (high occurence of outside tag)
2.If i add a crf layer to the model , do you think it will improve F1-score
Thank you!
Liked by 1 person
Hey Ahmed,
1. That’s the case here also. Occurrence of entities are always less.
2. I saw people using ensemble of CRF and Deep Learning based model to improve accuracy.
Thanks
Liked by 1 person
Hi Kumar,
I am a beginner and doing similar to your project – NER.
1. You mentioned “Motivation of this blog-post is to train a custom NER model from scratch using Python and Keras.” Do you mean to say that you customize BiLSTM. If so, how different from original BiLSTM?
2. Are you using BiLSTM alone without using RNN and CNN?
3. My train data needs to have tagging I, O, B to individual words. What word embedding do you recommend to use word tagging?
Thank for your contribution. This article is greatly helpful in my project.
Like
Hi Joe,
1. “train a custom NER” means I did not use a pre-trained NER like nltk or spacy, instead trained a BiLSTM model on the training data provided.
2. Yes, Only BiLSTM, check “model summary” (section 6). BiLSTM are better variants of RNNs.
3. Similar tagging is also there in this demonstration. If you check section 6, I have used an embedding layer to generate 180 dimensional embedding vector. Also, I kept max length of input sentence as 180 so that we get tag prediction for each word.
Thanks.
Like
Is there any way to save and load this model ? i have tried model.save and load , but after saving model and load it I’m getting random accuracy like an untrained model.
Like
You can simply save the model after training is done. It should work. If you are testing it on some other data, it may not work and give random accuracy.
Like
how did you save the model?
Like
Hi,
How do we handle some disease names which are not trained?
Like
I am really very happy to visit your blog. Directly I am found which I truly need. please visit our website for more information about Machine Learning Company In Australia
Like