Developing Factoid Question Answering System on bAbI Facebook Data-set: Python | Keras (Part 1)

Question answering system is a field of information retrieval and natural language processing which is concerned with building systems that automatically answer questions asked by a human. Ideally, the task would like a English reading comprehension task where given a passage or paragraph, system would be able to process texts, understand it and correctly answer the questions from the passage like we human do.

I will be covering this task in couple of blog-posts. The purpose of the blog-post series is to build the question answering toy application understanding each step and provide an insight of implementation of the same to our readers.

The Task

In 2015, Facebook came up with a bAbI data-set and 20 toy tasks for testing text understanding and reasoning in the bAbI project. The tasks are described in detail in the paper here and GitHub link here. The aim is that each task tests a unique aspect of text and reasoning, and hence test different capabilities of learning models. In this blog-post, we will demonstrate one of the toy task named : “Basic factoid QA with single supporting fact“. The task is very simple illustrated below.

Sandra travelled to the kitchen. 
Sandra travelled to the hallway. 
Mary went to the bathroom. 
Sandra moved to the garden. 

Where is Sandra ?
Ground Truth: Garden (based on single supporting fact 4)

Looking at this reasoning task per se, is not generic in nature and answer is only based on single supporting fact from the story. The learning model should be able to learn the sequence of events and then answer the question “Where is Actor ?“. Factoid QA means the answer would be single word.

bAbI Dataset

For the aforementioned task, We have 10,000 training examples and 1000 testing examples.

The file format for the task is as follows:

ID text
ID text
ID text
ID question[tab]answer[tab]supporting_fact ID.
...

Each sentence is provided with an ID. The IDs for a given “story” start at 1 and increase. When the IDs in a file reset back to 1 you can consider the following sentences as a new “story”. Supporting fact ID refer to the sentences within a “story”.

1 Mary moved to the bathroom.
2 John went to the hallway.
3 Where is Mary?        bathroom        1
4 Daniel went back to the hallway.
5 Sandra moved to the garden.
6 Where is Daniel?      hallway         4
7 John moved to the office.
8 Sandra journeyed to the bathroom.
9 Where is Daniel?      hallway         4
10 Mary moved to the hallway.
11 Daniel travelled to the office.
12 Where is Daniel?     office          11
13 John went back to the garden.
14 John moved to the bedroom.
15 Where is Sandra?     bathroom        8
1 Sandra travelled to the office.
2 Sandra went to the bathroom.
3 Where is Sandra?      bathroom        2

Further in the following sections, we will go step by step from

Getting data
Getting User Stories : Pre-Processing
Feature Extraction
Building Memory Network model
Train the Model
Tests & Results

1. Getting Data

Importing Libraries

import re
import tarfile
import numpy as np
from functools import reduce
from keras.utils.data_utils import get_file
from keras.preprocessing.sequence import pad_sequences

Downloading Data

try:
    #tar.gz data-set get saved on "~/.keras/datasets/" path
    path = get_file('babi-tasks-v1-2.tar.gz', origin='https://s3.amazonaws.com/text-datasets/babi_tasks_1-20_v1-2.tar.gz')
except:
    print('Error downloading dataset, please download it manually:\n'
          '$ wget http://www.thespermwhale.com/jaseweston/babi/tasks_1-20_v1-2.tar.gz\n'
          '$ mv tasks_1-20_v1-2.tar.gz ~/.keras/datasets/babi-tasks-v1-2.tar.gz')
    raise

#reading a tar.gz file
tar = tarfile.open(path)

If you are not able to download, The data-set can be downloaded from here manually.

2. Getting User Stories : Pre-processing

This is a crucial step where we need to ingest the train and test file and extract stories out of it. The raw format in which text stories, questions and answers are kept is aforementioned. Initially, we will write some helper functions.

def tokenize(sent):
    '''
    argument: a sentence string
    returns a list of tokens(words)
    '''
    return [ x.strip() for x in re.split('(\W+)?', sent) if x.strip()]

def parse_stories(lines):
    '''
    - Parse stories provided in the bAbI tasks format
    - A story starts from line 1 to line 15. Every 3rd line,
      there is a question &amp;amp;amp; answer.
    - Function extracts sub-stories within a story and
      creates tuples
    '''
    data = []
    story = []
    for line in lines:
        line = line.decode('utf-8').strip()
        nid, line = line.split(' ', 1)
        nid = int(nid)
        if nid == 1:
            # reset story when line ID=1 (start of new story)
            story = []
        if '\t' in line:
            # this line is tab separated Q, A &amp;amp;amp; support fact ID
            q, a, supporting = line.split('\t')
            # tokenize the words of question
            q = tokenize(q)
            # Provide all the sub-stories till this question
            substory = [x for x in story if x]
            # A story ends and is appended to global story data-set
            data.append((substory, q, a))
            story.append('')
        else:
            # this line is a sentence of story
            sent = tokenize(line)
            story.append(sent)
    return data

def get_stories(f):
    '''
    argument: filename
    returns list of all stories in the argument data-set file
    '''
    # read the data file and parse 10k stories
    data = parse_stories(f.readlines())
    # lambda func to flatten the list of sentences into one list
    flatten = lambda data: reduce(lambda x, y: x + y, data)
    # creating list of tuples for each story
    data = [(flatten(story), q, answer) for story, q, answer in data]
    return data

Further, we use the above helper functions to get the train and test stories.

challenge = 'tasks_1-20_v1-2/en-10k/qa1_single-supporting-fact_{}.txt'
print('Extracting stories for the challenge: single_supporting_fact_10k')
# Extracting train stories
train_stories = get_stories(tar.extractfile(challenge.format('train')))
# Extracting test stories
test_stories = get_stories(tar.extractfile(challenge.format('test')))

To validate that the stories are extracted properly, we can simply check the number of stories in train and test stories variables. Also, let us see how a story looks like as of now.

print('Number of training stories:', len(train_stories))
print('Number of test stories:', len(test_stories))
train_stories[0]

user story bAbI dataset — bAbI data-set story sample

3. Feature Extraction

Let us first write a helper function to vectorize each stories in order to fetch it to memory network model which we will be creating later.

def vectorize_stories(data, word_idx, story_maxlen, query_maxlen):
    # story vector initialization
    X = []
    # query vector initialization
    Xq = []
    # answer vector intialization
    Y = []
    for story, query, answer in data:
        # creating list of story word indices
        x = [word_idx[w] for w in story]
        # creating list of query word indices
        xq = [word_idx[w] for w in query]
        # let's not forget that index 0 is reserved
        y = np.zeros(len(word_idx) + 1)
        # creating label 1 for the answer word index
        y[word_idx] = 1
        X.append(x)
        Xq.append(xq)
        Y.append(y)
    return (pad_sequences(X, maxlen=story_maxlen),
            pad_sequences(Xq, maxlen=query_maxlen), np.array(Y))

The following snippet creates a vocabulary dictionary and extract word indices vectors as input features.

# creating vocabulary of words in train and test set
vocab = set()
for story, q, answer in train_stories + test_stories:
    vocab |= set(story + q + [answer])

# sorting the vocabulary
vocab = sorted(vocab)

# Reserve 0 for masking via pad_sequences
vocab_size = len(vocab) + 1

# calculate maximum length of story
story_maxlen = max(map(len, (x for x, _, _ in train_stories + test_stories)))

# calculate maximum length of question/query
query_maxlen = max(map(len, (x for _, x, _ in train_stories + test_stories)))

# creating word to index dictionary
word_idx = dict((c, i + 1) for i, c in enumerate(vocab))

# creating index to word dictionary
idx_word = dict((i+1, c) for i,c in enumerate(vocab))

# vectorize train story, query and answer sentences/word using vocab
inputs_train, queries_train, answers_train = vectorize_stories(train_stories,
                                                               word_idx,
                                                               story_maxlen,
                                                               query_maxlen)
# vectorize test story, query and answer sentences/word using vocab
inputs_test, queries_test, answers_test = vectorize_stories(test_stories,
                                                            word_idx,
                                                            story_maxlen,
                                                            query_maxlen)

We have prepared the feature vectors for our toy question and answering system. Let us display some analysis on the extracted vectors. The below python code snippets are used to explore and see how these vectors and vocabulary looks like.

print('-------------------------')
print('Vocabulary:\n',vocab,"\n")
print('Vocab size:', vocab_size, 'unique words')
print('Story max length:', story_maxlen, 'words')
print('Query max length:', query_maxlen, 'words')
print('Number of training stories:', len(train_stories))
print('Number of test stories:', len(test_stories))
print('-------------------------')

vocabulary bAbI dataset — vocabulary list

print('-------------------------')
print('inputs: integer tensor of shape (samples, max_length)')
print('inputs_train shape:', inputs_train.shape)
print('inputs_test shape:', inputs_test.shape)
print('input train sample', inputs_train[0,:])
print('-------------------------')

print('-------------------------')
print('queries: integer tensor of shape (samples, max_length)')
print('queries_train shape:', queries_train.shape)
print('queries_test shape:', queries_test.shape)
print('query train sample', queries_train[0,:])
print('-------------------------')

query vector — question vector with word indices

print('-------------------------')
print('answers: binary (1 or 0) tensor o&amp;amp;lt;span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"&amp;amp;gt;&amp;amp;lt;/span&amp;amp;gt;f shape (samples, vocab_size)')
print('answers_train shape:', answers_train.shape)
print('answers_test shape:', answers_test.shape)
print('answer train sample', answers_train[0,:])
print('-------------------------')

What’s Next ?

Hope it was easy to go through tutorial as I have tried to keep each step precisely and explainable in order to understand and reproduce (till feature extraction). Most of the codes are commented to make the logic clear. In the subsequent blog-post (PART 2), we will walk through the following steps using Keras to train a neural network model for question answering system.

Understanding & Building Memory Network Model
Training the Model
Visualization
Tests Results & Demo

If you liked the post, follow this blog to get updates about upcoming articles. Also, share it so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.

Happy deep learning 🙂

7 thoughts on “Developing Factoid Question Answering System on bAbI Facebook Data-set: Python | Keras (Part 1)”

Pingback: Building End to End Memory Network for Q&A System on bAbI Facebook Data-set: Python | Keras (Part 2) – Machine Learning in Action
DEEPU says:

June 25, 2019 at 9:48 am

THANK YOU SO MUCH..VERY INFORMATIVE ARTICLE..I NEED YOUR MAIL ID FOR FURTHER DISCUSSION ABOUT MY RESEARCH WORK PLEASE


Tehniat mirza says:

July 1, 2019 at 9:14 am

Thanks a lot . please check error in line y[word_idx] = 1 .it should be y[word_idx[answer]]=1


gayatri says:

July 11, 2019 at 11:02 am

thanks for sharing amazing information.
https://nearlearn.com/courses/programming-web-development/python-django-certification-training/


Astha says:

July 22, 2019 at 6:34 am

can u pls explain the code when we create vocab?? “set(story + q + )”.. why it does’nt have answer in it??


1. Melissa Cirtain says:
  
  July 25, 2019 at 3:29 am
  
  I was able to incorporate ‘answer’ the following way:
  for story, q, answer in train_stories + test_stories:
  vocab |= set(story + q + list(answer))
  
  
2. Abhijeet Kumar says:
  
  July 25, 2019 at 4:35 am
  
  It was some issue with wordpress. Thanks for pointing (updated now). It can be written as “vocab |= set(story + q + [answer]“

Machine Learning in Action

A perfect hands-on practice for beginners to elevate their ML skills

Developing Factoid Question Answering System on bAbI Facebook Data-set: Python | Keras (Part 1)

The Task

bAbI Dataset

1. Getting Data

2. Getting User Stories : Pre-processing

3. Feature Extraction

What’s Next ?

7 thoughts on “Developing Factoid Question Answering System on bAbI Facebook Data-set: Python | Keras (Part 1)”

Leave a Reply Cancel reply

The Task

bAbI Dataset

1. Getting Data

2. Getting User Stories : Pre-processing

3. Feature Extraction

What’s Next ?

Sharing is Caring

Like this:

Related

7 thoughts on “Developing Factoid Question Answering System on bAbI Facebook Data-set: Python | Keras (Part 1)”

Leave a Reply Cancel reply