Question answering system is a field of information retrieval and natural language processing which is concerned with building systems that automatically answer questions asked by a human. Ideally, the task would like a English reading comprehension task where given a passage or paragraph, system would be able to process texts, understand it and correctly answer the questions from the passage like we human do.
I will be covering this task in couple of blog-posts. The purpose of the blog-post series is to build the question answering toy application understanding each step and provide an insight of implementation of the same to our readers.
The Task
In 2015, Facebook came up with a bAbI data-set and 20 toy tasks for testing text understanding and reasoning in the bAbI project. The tasks are described in detail in the paper here and GitHub link here. The aim is that each task tests a unique aspect of text and reasoning, and hence test different capabilities of learning models. In this blog-post, we will demonstrate one of the toy task named : “Basic factoid QA with single supporting fact“. The task is very simple illustrated below.
Sandra travelled to the kitchen. Sandra travelled to the hallway. Mary went to the bathroom. Sandra moved to the garden. Where is Sandra ? Ground Truth: Garden (based on single supporting fact 4)
Looking at this reasoning task per se, is not generic in nature and answer is only based on single supporting fact from the story. The learning model should be able to learn the sequence of events and then answer the question “Where is Actor ?“. Factoid QA means the answer would be single word.
bAbI Dataset
For the aforementioned task, We have 10,000 training examples and 1000 testing examples.
The file format for the task is as follows:
ID text ID text ID text ID question[tab]answer[tab]supporting_fact ID. ...
Each sentence is provided with an ID. The IDs for a given “story” start at 1 and increase. When the IDs in a file reset back to 1 you can consider the following sentences as a new “story”. Supporting fact ID refer to the sentences within a “story”.
1 Mary moved to the bathroom. 2 John went to the hallway. 3 Where is Mary? bathroom 1 4 Daniel went back to the hallway. 5 Sandra moved to the garden. 6 Where is Daniel? hallway 4 7 John moved to the office. 8 Sandra journeyed to the bathroom. 9 Where is Daniel? hallway 4 10 Mary moved to the hallway. 11 Daniel travelled to the office. 12 Where is Daniel? office 11 13 John went back to the garden. 14 John moved to the bedroom. 15 Where is Sandra? bathroom 8 1 Sandra travelled to the office. 2 Sandra went to the bathroom. 3 Where is Sandra? bathroom 2
Further in the following sections, we will go step by step from
- Getting data
- Getting User Stories : Pre-Processing
- Feature Extraction
- Building Memory Network model
- Train the Model
- Tests & Results
1. Getting Data
Importing Libraries
import re import tarfile import numpy as np from functools import reduce from keras.utils.data_utils import get_file from keras.preprocessing.sequence import pad_sequences
Downloading Data
try: #tar.gz data-set get saved on "~/.keras/datasets/" path path = get_file('babi-tasks-v1-2.tar.gz', origin='https://s3.amazonaws.com/text-datasets/babi_tasks_1-20_v1-2.tar.gz') except: print('Error downloading dataset, please download it manually:\n' '$ wget http://www.thespermwhale.com/jaseweston/babi/tasks_1-20_v1-2.tar.gz\n' '$ mv tasks_1-20_v1-2.tar.gz ~/.keras/datasets/babi-tasks-v1-2.tar.gz') raise #reading a tar.gz file tar = tarfile.open(path)
If you are not able to download, The data-set can be downloaded from here manually.
2. Getting User Stories : Pre-processing
This is a crucial step where we need to ingest the train and test file and extract stories out of it. The raw format in which text stories, questions and answers are kept is aforementioned. Initially, we will write some helper functions.
def tokenize(sent): ''' argument: a sentence string returns a list of tokens(words) ''' return [ x.strip() for x in re.split('(\W+)?', sent) if x.strip()] def parse_stories(lines): ''' - Parse stories provided in the bAbI tasks format - A story starts from line 1 to line 15. Every 3rd line, there is a question & answer. - Function extracts sub-stories within a story and creates tuples ''' data = [] story = [] for line in lines: line = line.decode('utf-8').strip() nid, line = line.split(' ', 1) nid = int(nid) if nid == 1: # reset story when line ID=1 (start of new story) story = [] if '\t' in line: # this line is tab separated Q, A & support fact ID q, a, supporting = line.split('\t') # tokenize the words of question q = tokenize(q) # Provide all the sub-stories till this question substory = [x for x in story if x] # A story ends and is appended to global story data-set data.append((substory, q, a)) story.append('') else: # this line is a sentence of story sent = tokenize(line) story.append(sent) return data def get_stories(f): ''' argument: filename returns list of all stories in the argument data-set file ''' # read the data file and parse 10k stories data = parse_stories(f.readlines()) # lambda func to flatten the list of sentences into one list flatten = lambda data: reduce(lambda x, y: x + y, data) # creating list of tuples for each story data = [(flatten(story), q, answer) for story, q, answer in data] return data
Further, we use the above helper functions to get the train and test stories.
challenge = 'tasks_1-20_v1-2/en-10k/qa1_single-supporting-fact_{}.txt' print('Extracting stories for the challenge: single_supporting_fact_10k') # Extracting train stories train_stories = get_stories(tar.extractfile(challenge.format('train'))) # Extracting test stories test_stories = get_stories(tar.extractfile(challenge.format('test')))
To validate that the stories are extracted properly, we can simply check the number of stories in train and test stories variables. Also, let us see how a story looks like as of now.
print('Number of training stories:', len(train_stories)) print('Number of test stories:', len(test_stories)) train_stories[0]
3. Feature Extraction
Let us first write a helper function to vectorize each stories in order to fetch it to memory network model which we will be creating later.
def vectorize_stories(data, word_idx, story_maxlen, query_maxlen): # story vector initialization X = [] # query vector initialization Xq = [] # answer vector intialization Y = [] for story, query, answer in data: # creating list of story word indices x = [word_idx[w] for w in story] # creating list of query word indices xq = [word_idx[w] for w in query] # let's not forget that index 0 is reserved y = np.zeros(len(word_idx) + 1) # creating label 1 for the answer word index y[word_idx] = 1 X.append(x) Xq.append(xq) Y.append(y) return (pad_sequences(X, maxlen=story_maxlen), pad_sequences(Xq, maxlen=query_maxlen), np.array(Y))
The following snippet creates a vocabulary dictionary and extract word indices vectors as input features.
# creating vocabulary of words in train and test set vocab = set() for story, q, answer in train_stories + test_stories: vocab |= set(story + q + [answer]) # sorting the vocabulary vocab = sorted(vocab) # Reserve 0 for masking via pad_sequences vocab_size = len(vocab) + 1 # calculate maximum length of story story_maxlen = max(map(len, (x for x, _, _ in train_stories + test_stories))) # calculate maximum length of question/query query_maxlen = max(map(len, (x for _, x, _ in train_stories + test_stories))) # creating word to index dictionary word_idx = dict((c, i + 1) for i, c in enumerate(vocab)) # creating index to word dictionary idx_word = dict((i+1, c) for i,c in enumerate(vocab)) # vectorize train story, query and answer sentences/word using vocab inputs_train, queries_train, answers_train = vectorize_stories(train_stories, word_idx, story_maxlen, query_maxlen) # vectorize test story, query and answer sentences/word using vocab inputs_test, queries_test, answers_test = vectorize_stories(test_stories, word_idx, story_maxlen, query_maxlen)
We have prepared the feature vectors for our toy question and answering system. Let us display some analysis on the extracted vectors. The below python code snippets are used to explore and see how these vectors and vocabulary looks like.
print('-------------------------') print('Vocabulary:\n',vocab,"\n") print('Vocab size:', vocab_size, 'unique words') print('Story max length:', story_maxlen, 'words') print('Query max length:', query_maxlen, 'words') print('Number of training stories:', len(train_stories)) print('Number of test stories:', len(test_stories)) print('-------------------------')
print('-------------------------') print('inputs: integer tensor of shape (samples, max_length)') print('inputs_train shape:', inputs_train.shape) print('inputs_test shape:', inputs_test.shape) print('input train sample', inputs_train[0,:]) print('-------------------------')
print('-------------------------') print('queries: integer tensor of shape (samples, max_length)') print('queries_train shape:', queries_train.shape) print('queries_test shape:', queries_test.shape) print('query train sample', queries_train[0,:]) print('-------------------------')
print('-------------------------') print('answers: binary (1 or 0) tensor o<span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>f shape (samples, vocab_size)') print('answers_train shape:', answers_train.shape) print('answers_test shape:', answers_test.shape) print('answer train sample', answers_train[0,:]) print('-------------------------')
What’s Next ?
Hope it was easy to go through tutorial as I have tried to keep each step precisely and explainable in order to understand and reproduce (till feature extraction). Most of the codes are commented to make the logic clear. In the subsequent blog-post (PART 2), we will walk through the following steps using Keras to train a neural network model for question answering system.
- Understanding & Building Memory Network Model
- Training the Model
- Visualization
- Tests Results & Demo
If you liked the post, follow this blog to get updates about upcoming articles. Also, share it so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.
Happy deep learning 🙂
THANK YOU SO MUCH..VERY INFORMATIVE ARTICLE..I NEED YOUR MAIL ID FOR FURTHER DISCUSSION ABOUT MY RESEARCH WORK PLEASE
Like
Thanks a lot . please check error in line y[word_idx] = 1 .it should be y[word_idx[answer]]=1
Like
thanks for sharing amazing information.
https://nearlearn.com/courses/programming-web-development/python-django-certification-training/
Like
can u pls explain the code when we create vocab?? “set(story + q + )”.. why it does’nt have answer in it??
Like
I was able to incorporate ‘answer’ the following way:
for story, q, answer in train_stories + test_stories:
vocab |= set(story + q + list(answer))
Like
It was some issue with wordpress. Thanks for pointing (updated now). It can be written as “vocab |= set(story + q + [answer]“
Like