I participated in one HackerEarth Challenge, “Predict the Happiness” and hence I am coming up with this tutorial of the solution submitted by me which gives 88% accuracy on the test data. I was ranked among the top 70 in the challenge.

In the past several years, Neural Networks (NNs) have become the state-of-the-art solution for many applications. Till now, I have posted articles on applications mostly based on traditional machine learning methods. So, it’s time to explore the hyped NN models for our readers.

Keras is an easy to use Python library which is the right place to start for beginners. It is built above NN frameworks like Theano and Tensorflow. Here, I am assuming that you have some familiarity with the terminologies of NN. In this basic tutorial, we will learn how to build a multi-layered fully connected Neural Network for a text processing application (HackerEarth challenge). Let’s get started.

Overview

Problem Statement

“TripAdvisor is the world’s largest travel site where you can compare and book hotels, flights, restaurants etc. The data set provided in this challenge consists of a sample of hotel reviews provided by the customers. Analyzing customers reviews will help them understand about the hotels listed on their website i.e. if they are treating customers well or if they are providing hospitality services as expected.

In this challenge, you have to predict if a customer is happy or not happy.” 

Data-set Description

You are given three files to download: train.csv, test.csv and sample_submission.csv. The training data has 38932 rows, while the test data has 29404 rows. You can download the .csv files from here.

Variable Description
User_ID unique ID of the customer
Description description of the review posted
Browser_Used browser used to post the review
Device_Used device used to post the review
Is_Response target variable

We are interested only in 2 columns. ‘Description’ which contains hotel reviews given by different users and ‘Is_Response’ which keeps the record of ‘happy’ or ‘not_happy’. So, in essence, this is simply a 2-class sentiment analysis problem.

The steps we are going to follow in this blog-post are as follows:

  1. Prepare data
  2. Feature Extraction.
  3. Build the Model.
  4. Train the Model.
  5. Checking Performance.

1. Prepare Data

Training data is provided in .csv format which can be ingested easily with pandas as shown in the code below. ‘Is_Response’ field of data carries strings, i.e. ‘happy’ and ‘not_happy’, which needs to be encoded in integer format, i.e. 0 and 1. Here, it is done by LabelEncoder class of scikit-learn library. The function returns list of a hotel reviews and their respective happiness labels.

def data_prepare(training_file_path):
    dataset = pd.read_csv(training_file_path)
    reviews = []
    labels = []    
    
    # Enconding Categorical Data     
    labelencoder_y = LabelEncoder()
    dataset['Is_Response'] = labelencoder_y.fit_transform(dataset['Is_Response'])
    cLen = len(dataset['Description'])
        
    for i in range(0,cLen):
        review = dataset['Description'][i]
        reviews.append(review) 
        label = dataset["Is_Response"][i]
        labels.append(label)    
    labels = np.asarray(labels)
    return reviews,label

2. Feature Extraction

In this task, words are features, hence the bag-of-words model can be used to create a feature vector. It can be done in following steps:

1. Make a dictionary : We create a dictionary containing word-index tuples of all the distinct words in training text reviews. We assume that the ordering of words is not important.

2. Convert words of each text review into word index array and store the index array of each review in global array. Example of a text review –

The room was kind of clean but had a VERY strong smell of dogs. Generally below average but ok for a overnight stay if you're not too fussy. Would consider staying again if the price was right. Breakfast was free and just about better than nothing.
[1, 14, 5, 436, 9, 52, 17, 25, 3, 22, 1735, 628, 9, 1727, 1109, 943, 492, 17, 322, 11, 3, 1010, 34, 42, 411, 24, 131, 3754, 40, 941, 181, 72, 42, 1, 126, 5, 117, 60, 5, 89, 2, 56, 64, 172, 100, 268]

3. Convert the global array of index into a feature matrix. Each text review is represented by a sparse vector of the size of the vocabulary, with 1 in the entries representing the word and 0 in all other entries. We use the maximum number of features as 10,000. Thus the final feature matrix will be of shape (38392,10000).

3. Build the Model

Finally, let’s talk about the neural network model which we will be building for this task. It is very easy to build a NN model using Keras. In this solution, I have used a fully connected, 2-hidden layered neural network. We need the Sequential module for initializing a NN and the Dense module to add the hidden Layers. In the output layer, there are 2 nodes, one for the positive and another for the negative sentiment class. The Python code for building the model is shown below:

# Creating a Dense Neural Network Model 
model = Sequential() 
model.add(Dense(256, input_shape=(max_words,), activation='elu')) 
model.add(Dropout(0.5))
model.add(Dense(128, activation='elu'))
model.add(Dropout(0.5)) 
model.add(Dense(2, activation='softmax'))

The summary of the built model can be obtained by the model.summary() statement:

_____________________________________________________________________________________
Layer (type)                                     Output Shape                             Param #
==================================================================
dense_1 (Dense)                               (None, 256)                               2560256
_____________________________________________________________________________________
dropout_1 (Dropout)                      (None, 256)                                      0    _____________________________________________________________________________________
dense_2 (Dense)                               (None, 128)                                 32896    _____________________________________________________________________________________
dense_2 (Dropout)                           (None, 128)                                     0
_____________________________________________________________________________________
dense_3 (Dense)                                (None, 2)                                      258
==================================================================
Total params: 2,593,410
Trainable params: 2,593,410
Non-trainable params: 0

You must be wondering what a typical 2-hidden layer fully connected neural networks looks like. Here it is shown in the figure below:

nn
A typical 2-hidden layered fully connected NN

4. Train the Model

Now, we have completed both feature extraction and model building. For training the model, it is required to first compile the model with categorical_cross entropy loss function and stochastic gradient descent learning algorithm. Once compiled, one can train the model by utilizing the GPU of their system. In Keras, it can be done as :

model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy']) 
model.fit(train_X, labels, batch_size=32, epochs=5, verbose=1, validation_split=0.1, shuffle=True)

While training the model, we pass the feature matrix, the labels, input batch size to process, the number of iterations etc as parameters . We also save the dictionary and the NN model in order to use them later while performing predictions on the test data. Once the NN model has been trained, we can check the performance of the model on test .csv data.

Below is the entire code for training the NN model for sentiment analysis application. You have to include the function for data preparation we have defined before in Step 1.

import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import json
import keras
import keras.preprocessing.text as kpt
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.layers import Dense, Dropout

def convert_text_to_index_array(text):
    return [dictionary[word] for word in kpt.text_to_word_sequence(text)]

train_file_path = "./train.csv"
[reviews,labels] = data_prepare(train_file_path)

# Create Dictionary of words and their indices
max_words = 10000
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(reviews)
dictionary = tokenizer.word_index

# save dictionary
with open('dictionary.json','w') as dictionary_file:
    json.dump(dictionary,dictionary_file)

# Replace words of each text review to indices
allWordIndices = []
for num,text in enumerate(reviews):
    wordIndices = convert_text_to_index_array(text)
    allWordIndices.append(wordIndices)

# Convert the index sequences into binary bag of words vector (one hot encoding) 
allWordIndices = np.asarray(allWordIndices)
train_X = tokenizer.sequences_to_matrix(allWordIndices, mode='binary')
labels = keras.utils.to_categorical(labels,num_classes=2)

# Creating Dense Neural Network Model
model = Sequential()
model.add(Dense(256, input_shape=(max_words,), activation='elu'))
model.add(Dropout(0.5))
model.add(Dense(128, activation='elu'))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))

model.compile(loss='categorical_crossentropy',
  optimizer='sgd',
  metrics=['accuracy'])

# Training the Model
model.fit(train_X, labels,
  batch_size=32,
  epochs=5,
  verbose=1,
  validation_split=0.1,
  shuffle=True)

# Save model to disk
model_json = model.to_json()
with open('model.json', 'w') as json_file:
    json_file.write(model_json)
model.save_weights('model.h5')    

The IPython console will show the training accuracy and the validation accuracy in iterations like this:

Train on 35038 samples, validate on 3894 samples
Epoch 1/5
35038/35038 [==============================] - 7s 207us/step - loss: 0.3985 - acc: 0.8233 - val_loss: 0.3168 - val_acc: 0.8675
Epoch 2/5
35038/35038 [==============================] - 6s 161us/step - loss: 0.3162 - acc: 0.8701 - val_loss: 0.3019 - val_acc: 0.8760
Epoch 3/5
35038/35038 [==============================] - 6s 163us/step - loss: 0.2982 - acc: 0.8780 - val_loss: 0.2934 - val_acc: 0.8770
Epoch 4/5
35038/35038 [==============================] - 6s 163us/step - loss: 0.2849 - acc: 0.8853 - val_loss: 0.2960 - val_acc: 0.8837
Epoch 5/5
35038/35038 [==============================] - 6s 163us/step - loss: 0.2776 - acc: 0.8877 - val_loss: 0.2971 - val_acc: 0.8765

5. Checking Performance

In HackerEarth challenge, the test.csv file is provided and it consists of 29404 hotel reviews. We will now predict the sentiment for all the hotel reviews. To find the accuracy (score) of the model, one needs to upload the prediction csv file on the portal here.

To check the performance of the “Predict the Happiness” system, the trained dictionary and the NN model is loaded. For each of the hotel reviews, we extract the bag of word features in a similar way as in training. The softmax scores of the output layer are calculated by feedforwarding the input features to the trained NN model. A higher score shows more probability of that sentiment. Finally, the prediction csv file is written with User_ID and the predicted response. The Python code for performing predictions on the test data is shown below.

import json
import numpy as np
import keras.preprocessing.text as kpt
from keras.preprocessing.text import Tokenizer
from keras.models import model_from_json
import pandas as pd

def convert_text_to_index_array(text):
    words = kpt.text_to_word_sequence(text)
    wordIndices = []
    for word in words:
        if word in dictionary:
            wordIndices.append(dictionary[word])
    return wordIndices

# Load the dictionary
labels = ['happy','not_happy']
with open('dictionary.json', 'r') as dictionary_file:
    dictionary = json.load(dictionary_file)

# Load trained model
json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
model = model_from_json(loaded_model_json)
model.load_weights('model.h5')

testset = pd.read_csv("./test.csv")    
cLen = len(testset['Description'])
tokenizer = Tokenizer(num_words=10000)

# Predict happiness for each review in test.csv
y_pred = []   
for i in range(0,cLen):
    review = testset['Description'][i]
    testArr = convert_text_to_index_array(review)   
    input = tokenizer.sequences_to_matrix([testArr], mode='binary')
    pred = model.predict(input)
    #print pred[0][np.argmax(pred)] * 100, labels[np.argmax(pred)]
    y_pred.append(labels[np.argmax(pred)])


# Write the results in submission csv file
raw_data = {'User_ID': testset['User_ID'], 
        'Is_Response': y_pred}
df = pd.DataFrame(raw_data, columns = ['User_ID', 'Is_Response'])
df.to_csv('submission_model1.csv', sep=',',index=False)

Test Results

After evaluating the prediction file, the achieved accuracy score was 87.79. I submitted it a few more times and the score mostly lied between 87 – 88 due to random initialization of weights in training the NN model. The top scorer in the leader board had score 90.

submission_score
Accuracy score (evaluation) – 87.79%

Final Thoughts

Hope it was easy to follow this Keras tutorial. If you are totally unaware of neural networks then probably it will be a little burdensome to follow. I would suggest to first go through the basics of Neural Networks from the abundant material available online. Before signing off, few more thoughts to the post are:

  1. I experimented to train the NN model with more layers and different activation functions but the results were more or less the same.
  2. I ran the model on PC with Nvidia GeForce GTX 1080 GPU (8 GB) and 32 GB RAM. It took hardly 1 – 1.5 minutes to train the NN model.
  3. For most of the natural language processing (NLP) applications, NN architectures like LSTM or CNN (1-dim) are very popular. I am looking forward to experiment on the same task with these architectures.
  4. Exploring more ways of the feature extraction process may produce better results. Word2Vec is a popular feature in NLP that can be used as input to NN models.
  5. Accuracy can possibly be further improved by fusing or boosting various NN models (may be fusing at soft-max score level)

You can get the full python implementation from GitHub link here.

If you liked the post, follow this blog to get updates about upcoming articles. Also, share it so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.

Happy deep learning 🙂

Advertisements