Question answering system is a field of information retrieval and natural language processing which is concerned with building systems that automatically answer questions asked by a human. Ideally, the task would like a English reading comprehension task where given a passage or paragraph, system would be able to process texts, understand it and correctly answer the questions from the passage like we human do.

I will be covering this task in couple of blog-posts. The purpose of the blog-post series is to build the question answering toy application understanding each step and provide an insight of implementation of the same to our readers.

The Task

In 2015, Facebook came up with a bAbI data-set and 20 toy tasks for testing text understanding and reasoning in the bAbI project. The tasks are described in detail in the paper here and GitHub link here. The aim is that each task tests a unique aspect of text and reasoning, and hence test different capabilities of learning models. In this blog-post, we will demonstrate one of the toy task named : “Basic factoid QA with single supporting fact“. The task is very simple illustrated below.

Sandra travelled to the kitchen. 
Sandra travelled to the hallway. 
Mary went to the bathroom. 
Sandra moved to the garden. 

Where is Sandra ?
Ground Truth: Garden (based on single supporting fact 4)

Looking at this reasoning task per se, is not generic in nature and answer is only based on single supporting fact from the story. The learning model should be able to learn the sequence of events and then answer the question “Where is Actor ?“.  Factoid QA means the answer would be single word.

bAbI Dataset

For the aforementioned task, We have 10,000 training examples and 1000 testing examples.

The file format for the task is as follows:

ID text
ID text
ID text
ID question[tab]answer[tab]supporting_fact ID.
...

Each sentence is provided with an ID. The IDs for a given “story” start at 1 and increase. When the IDs in a file reset back to 1 you can consider the following sentences as a new “story”. Supporting fact ID refer to the sentences within a “story”.

1 Mary moved to the bathroom.
2 John went to the hallway.
3 Where is Mary?        bathroom        1
4 Daniel went back to the hallway.
5 Sandra moved to the garden.
6 Where is Daniel?      hallway         4
7 John moved to the office.
8 Sandra journeyed to the bathroom.
9 Where is Daniel?      hallway         4
10 Mary moved to the hallway.
11 Daniel travelled to the office.
12 Where is Daniel?     office          11
13 John went back to the garden.
14 John moved to the bedroom.
15 Where is Sandra?     bathroom        8
1 Sandra travelled to the office.
2 Sandra went to the bathroom.
3 Where is Sandra?      bathroom        2

Further in the following sections, we will go step by step from

  • Getting data
  • Getting User Stories : Pre-Processing
  • Feature Extraction
  • Building Memory Network model
  • Train the Model
  • Tests & Results

1. Getting Data

Importing Libraries

import re
import tarfile
import numpy as np
from functools import reduce
from keras.utils.data_utils import get_file
from keras.preprocessing.sequence import pad_sequences

Downloading Data

try:
    #tar.gz data-set get saved on "~/.keras/datasets/" path
    path = get_file('babi-tasks-v1-2.tar.gz', origin='https://s3.amazonaws.com/text-datasets/babi_tasks_1-20_v1-2.tar.gz')
except:
    print('Error downloading dataset, please download it manually:\n'
          '$ wget http://www.thespermwhale.com/jaseweston/babi/tasks_1-20_v1-2.tar.gz\n'
          '$ mv tasks_1-20_v1-2.tar.gz ~/.keras/datasets/babi-tasks-v1-2.tar.gz')
    raise

#reading a tar.gz file
tar = tarfile.open(path)

If you are not able to download, The data-set can be downloaded from here manually.

2. Getting User Stories : Pre-processing

This is a crucial step where we need to ingest the train and test file and extract stories out of it. The raw format in which text stories, questions and answers are kept is aforementioned. Initially, we will write some helper functions.

def tokenize(sent):
    '''
    argument: a sentence string
    returns a list of tokens(words)
    '''
    return [ x.strip() for x in re.split('(\W+)?', sent) if x.strip()]

def parse_stories(lines):
    '''
    - Parse stories provided in the bAbI tasks format
    - A story starts from line 1 to line 15. Every 3rd line,
      there is a question & answer.
    - Function extracts sub-stories within a story and
      creates tuples
    '''
    data = []
    story = []
    for line in lines:
        line = line.decode('utf-8').strip()
        nid, line = line.split(' ', 1)
        nid = int(nid)
        if nid == 1:
            # reset story when line ID=1 (start of new story)
            story = []
        if '\t' in line:
            # this line is tab separated Q, A & support fact ID
            q, a, supporting = line.split('\t')
            # tokenize the words of question
            q = tokenize(q)
            # Provide all the sub-stories till this question
            substory = [x for x in story if x]
            # A story ends and is appended to global story data-set
            data.append((substory, q, a))
            story.append('')
        else:
            # this line is a sentence of story
            sent = tokenize(line)
            story.append(sent)
    return data

def get_stories(f):
    '''
    argument: filename
    returns list of all stories in the argument data-set file
    '''
    # read the data file and parse 10k stories
    data = parse_stories(f.readlines())
    # lambda func to flatten the list of sentences into one list
    flatten = lambda data: reduce(lambda x, y: x + y, data)
    # creating list of tuples for each story
    data = [(flatten(story), q, answer) for story, q, answer in data]
    return data

Further, we use the above helper functions to get the train and test stories.

challenge = 'tasks_1-20_v1-2/en-10k/qa1_single-supporting-fact_{}.txt'
print('Extracting stories for the challenge: single_supporting_fact_10k')
# Extracting train stories
train_stories = get_stories(tar.extractfile(challenge.format('train')))
# Extracting test stories
test_stories = get_stories(tar.extractfile(challenge.format('test')))

To validate that the stories are extracted properly, we can simply check the number of stories in train and test stories variables. Also, let us see how a story looks like as of now.

print('Number of training stories:', len(train_stories))
print('Number of test stories:', len(test_stories))
train_stories[0]
user story bAbI dataset
bAbI data-set story sample

3. Feature Extraction

Let us first write a helper function to vectorize each stories in order to fetch it to memory network model which we will be creating later.

def vectorize_stories(data, word_idx, story_maxlen, query_maxlen):
    # story vector initialization
    X = []
    # query vector initialization
    Xq = []
    # answer vector intialization
    Y = []
    for story, query, answer in data:
        # creating list of story word indices
        x = [word_idx[w] for w in story]
        # creating list of query word indices
        xq = [word_idx[w] for w in query]
        # let's not forget that index 0 is reserved
        y = np.zeros(len(word_idx) + 1)
        # creating label 1 for the answer word index
        y[word_idx] = 1
        X.append(x)
        Xq.append(xq)
        Y.append(y)
    return (pad_sequences(X, maxlen=story_maxlen),
            pad_sequences(Xq, maxlen=query_maxlen), np.array(Y))

The following snippet creates a vocabulary dictionary and extract word indices vectors as input features.

# creating vocabulary of words in train and test set
vocab = set()
for story, q, answer in train_stories + test_stories:
    vocab |= set(story + q + [answer])

# sorting the vocabulary
vocab = sorted(vocab)

# Reserve 0 for masking via pad_sequences
vocab_size = len(vocab) + 1

# calculate maximum length of story
story_maxlen = max(map(len, (x for x, _, _ in train_stories + test_stories)))

# calculate maximum length of question/query
query_maxlen = max(map(len, (x for _, x, _ in train_stories + test_stories)))

# creating word to index dictionary
word_idx = dict((c, i + 1) for i, c in enumerate(vocab))

# creating index to word dictionary
idx_word = dict((i+1, c) for i,c in enumerate(vocab))

# vectorize train story, query and answer sentences/word using vocab
inputs_train, queries_train, answers_train = vectorize_stories(train_stories,
                                                               word_idx,
                                                               story_maxlen,
                                                               query_maxlen)
# vectorize test story, query and answer sentences/word using vocab
inputs_test, queries_test, answers_test = vectorize_stories(test_stories,
                                                            word_idx,
                                                            story_maxlen,
                                                            query_maxlen)

We have prepared the feature vectors for our toy question and answering system. Let us display some analysis on the extracted vectors. The below python code snippets are used to explore and see how these vectors and vocabulary looks like.

print('-------------------------')
print('Vocabulary:\n',vocab,"\n")
print('Vocab size:', vocab_size, 'unique words')
print('Story max length:', story_maxlen, 'words')
print('Query max length:', query_maxlen, 'words')
print('Number of training stories:', len(train_stories))
print('Number of test stories:', len(test_stories))
print('-------------------------')
vocabulary bAbI dataset
vocabulary list
print('-------------------------')
print('inputs: integer tensor of shape (samples, max_length)')
print('inputs_train shape:', inputs_train.shape)
print('inputs_test shape:', inputs_test.shape)
print('input train sample', inputs_train[0,:])
print('-------------------------')
story vector
story vector with word indices
print('-------------------------')
print('queries: integer tensor of shape (samples, max_length)')
print('queries_train shape:', queries_train.shape)
print('queries_test shape:', queries_test.shape)
print('query train sample', queries_train[0,:])
print('-------------------------')
query vector
question vector with word indices
print('-------------------------')
print('answers: binary (1 or 0) tensor o<span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>f shape (samples, vocab_size)')
print('answers_train shape:', answers_train.shape)
print('answers_test shape:', answers_test.shape)
print('answer train sample', answers_train[0,:])
print('-------------------------')
answer vector
one hot encoded answer vector

What’s Next ?

Hope it was easy to go through tutorial as I have tried to keep each step precisely and explainable in order to understand and reproduce (till feature extraction). Most of the codes are commented to make the logic clear. In the subsequent blog-post (PART 2), we will walk through the following steps using Keras to train a neural network model for question answering system.

  1. Understanding & Building Memory Network Model
  2. Training the Model
  3. Visualization
  4. Tests Results & Demo

If you liked the post, follow this blog to get updates about upcoming articles. Also, share it so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.

Happy deep learning 🙂