I participated in one HackerEarth Challenge, “Predict the Happiness” and hence I am coming up with this tutorial of the solution submitted by me which gives 88% accuracy on the test data. I was ranked among the top 70 in the challenge.
In the past several years, Neural Networks (NNs) have become the state-of-the-art solution for many applications. Till now, I have posted articles on applications mostly based on traditional machine learning methods. So, it’s time to explore the hyped NN models for our readers.
Keras is an easy to use Python library which is the right place to start for beginners. It is built above NN frameworks like Theano and Tensorflow. Here, I am assuming that you have some familiarity with the terminologies of NN. In this basic tutorial, we will learn how to build a multi-layered fully connected Neural Network for a text processing application (HackerEarth challenge). Let’s get started.
Overview
Problem Statement
“TripAdvisor is the world’s largest travel site where you can compare and book hotels, flights, restaurants etc. The data set provided in this challenge consists of a sample of hotel reviews provided by the customers. Analyzing customers reviews will help them understand about the hotels listed on their website i.e. if they are treating customers well or if they are providing hospitality services as expected.
In this challenge, you have to predict if a customer is happy or not happy.”
Data-set Description
You are given three files to download: train.csv, test.csv and sample_submission.csv. The training data has 38932 rows, while the test data has 29404 rows. You can download the .csv
files from here.
Variable | Description |
---|---|
User_ID | unique ID of the customer |
Description | description of the review posted |
Browser_Used | browser used to post the review |
Device_Used | device used to post the review |
Is_Response | target variable |
We are interested only in 2 columns. ‘Description’ which contains hotel reviews given by different users and ‘Is_Response’ which keeps the record of ‘happy’ or ‘not_happy’. So, in essence, this is simply a 2-class sentiment analysis problem.
The steps we are going to follow in this blog-post are as follows:
- Prepare data
- Feature Extraction.
- Build the Model.
- Train the Model.
- Checking Performance.
1. Prepare Data
Training data is provided in .csv
format which can be ingested easily with pandas
as shown in the code below. ‘Is_Response’ field of data carries strings, i.e. ‘happy’ and ‘not_happy’, which needs to be encoded in integer format, i.e. 0 and 1. Here, it is done by LabelEncoder
class of scikit-learn library. The function returns list of a hotel reviews and their respective happiness labels.
def data_prepare(training_file_path): dataset = pd.read_csv(training_file_path) reviews = [] labels = [] # Enconding Categorical Data labelencoder_y = LabelEncoder() dataset['Is_Response'] = labelencoder_y.fit_transform(dataset['Is_Response']) cLen = len(dataset['Description']) for i in range(0,cLen): review = dataset['Description'][i] reviews.append(review) label = dataset["Is_Response"][i] labels.append(label) labels = np.asarray(labels) return reviews,labels
2. Feature Extraction
In this task, words are features, hence the bag-of-words model can be used to create a feature vector. It can be done in following steps:
1. Make a dictionary : We create a dictionary containing word-index tuples of all the distinct words in training text reviews. We assume that the ordering of words is not important.
2. Convert words of each text review into word index array and store the index array of each review in global array. Example of a text review –
The room was kind of clean but had a VERY strong smell of dogs. Generally below average but ok for a overnight stay if you're not too fussy. Would consider staying again if the price was right. Breakfast was free and just about better than nothing. [1, 14, 5, 436, 9, 52, 17, 25, 3, 22, 1735, 628, 9, 1727, 1109, 943, 492, 17, 322, 11, 3, 1010, 34, 42, 411, 24, 131, 3754, 40, 941, 181, 72, 42, 1, 126, 5, 117, 60, 5, 89, 2, 56, 64, 172, 100, 268]
3. Convert the global array of index into a feature matrix. Each text review is represented by a sparse vector of the size of the vocabulary, with 1 in the entries representing the word and 0 in all other entries. We use the maximum number of features as 10,000. Thus the final feature matrix will be of shape (38392,10000).
3. Build the Model
Finally, let’s talk about the neural network model which we will be building for this task. It is very easy to build a NN model using Keras. In this solution, I have used a fully connected, 2-hidden layered neural network. We need the Sequential
module for initializing a NN and the Dense
module to add the hidden Layers. In the output layer, there are 2 nodes, one for the positive and another for the negative sentiment class. The Python code for building the model is shown below:
# Creating a Dense Neural Network Model model = Sequential() model.add(Dense(256, input_shape=(max_words,), activation='elu')) model.add(Dropout(0.5)) model.add(Dense(128, activation='elu')) model.add(Dropout(0.5)) model.add(Dense(2, activation='softmax'))
The summary of the built model can be obtained by the model.summary()
statement:
_____________________________________________________________________________________
Layer (type) Output Shape Param #
==================================================================
dense_1 (Dense) (None, 256) 2560256
_____________________________________________________________________________________
dropout_1 (Dropout) (None, 256) 0 _____________________________________________________________________________________
dense_2 (Dense) (None, 128) 32896 _____________________________________________________________________________________
dense_2 (Dropout) (None, 128) 0
_____________________________________________________________________________________
dense_3 (Dense) (None, 2) 258
==================================================================
Total params: 2,593,410
Trainable params: 2,593,410
Non-trainable params: 0
You must be wondering what a typical 2-hidden layer fully connected neural networks looks like. Here it is shown in the figure below:
4. Train the Model
Now, we have completed both feature extraction and model building. For training the model, it is required to first compile the model with categorical_cross entropy
loss function and stochastic gradient descent
learning algorithm. Once compiled, one can train the model by utilizing the GPU of their system. In Keras, it can be done as :
model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy']) model.fit(train_X, labels, batch_size=32, epochs=5, verbose=1, validation_split=0.1, shuffle=True)
While training the model, we pass the feature matrix, the labels, input batch size to process, the number of iterations etc as parameters . We also save the dictionary
and the NN model in order to use them later while performing predictions on the test data. Once the NN model has been trained, we can check the performance of the model on test .csv
data.
Below is the entire code for training the NN model for sentiment analysis application. You have to include the function for data preparation we have defined before in Step 1.
import numpy as np import pandas as pd from sklearn.preprocessing import LabelEncoder import json import keras import keras.preprocessing.text as kpt from keras.preprocessing.text import Tokenizer from keras.models import Sequential from keras.layers import Dense, Dropout def convert_text_to_index_array(text): return [dictionary[word] for word in kpt.text_to_word_sequence(text)] train_file_path = "./train.csv" [reviews,labels] = data_prepare(train_file_path) # Create Dictionary of words and their indices max_words = 10000 tokenizer = Tokenizer(num_words=max_words) tokenizer.fit_on_texts(reviews) dictionary = tokenizer.word_index # save dictionary with open('dictionary.json','w') as dictionary_file: json.dump(dictionary,dictionary_file) # Replace words of each text review to indices allWordIndices = [] for num,text in enumerate(reviews): wordIndices = convert_text_to_index_array(text) allWordIndices.append(wordIndices) # Convert the index sequences into binary bag of words vector (one hot encoding) allWordIndices = np.asarray(allWordIndices) train_X = tokenizer.sequences_to_matrix(allWordIndices, mode='binary') labels = keras.utils.to_categorical(labels,num_classes=2) # Creating Dense Neural Network Model model = Sequential() model.add(Dense(256, input_shape=(max_words,), activation='elu')) model.add(Dropout(0.5)) model.add(Dense(128, activation='elu')) model.add(Dropout(0.5)) model.add(Dense(2, activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy']) # Training the Model model.fit(train_X, labels, batch_size=32, epochs=5, verbose=1, validation_split=0.1, shuffle=True) # Save model to disk model_json = model.to_json() with open('model.json', 'w') as json_file: json_file.write(model_json) model.save_weights('model.h5')
The IPython console will show the training accuracy and the validation accuracy in iterations like this:
Train on 35038 samples, validate on 3894 samples Epoch 1/5 35038/35038 [==============================] - 7s 207us/step - loss: 0.3985 - acc: 0.8233 - val_loss: 0.3168 - val_acc: 0.8675 Epoch 2/5 35038/35038 [==============================] - 6s 161us/step - loss: 0.3162 - acc: 0.8701 - val_loss: 0.3019 - val_acc: 0.8760 Epoch 3/5 35038/35038 [==============================] - 6s 163us/step - loss: 0.2982 - acc: 0.8780 - val_loss: 0.2934 - val_acc: 0.8770 Epoch 4/5 35038/35038 [==============================] - 6s 163us/step - loss: 0.2849 - acc: 0.8853 - val_loss: 0.2960 - val_acc: 0.8837 Epoch 5/5 35038/35038 [==============================] - 6s 163us/step - loss: 0.2776 - acc: 0.8877 - val_loss: 0.2971 - val_acc: 0.8765
5. Checking Performance
In HackerEarth challenge, the test.csv
file is provided and it consists of 29404 hotel reviews. We will now predict the sentiment for all the hotel reviews. To find the accuracy (score) of the model, one needs to upload the prediction csv
file on the portal here.
To check the performance of the “Predict the Happiness” system, the trained dictionary and the NN model is loaded. For each of the hotel reviews, we extract the bag of word features in a similar way as in training. The softmax
scores of the output layer are calculated by feedforwarding the input features to the trained NN model. A higher score shows more probability of that sentiment. Finally, the prediction csv
file is written with User_ID and the predicted response. The Python code for performing predictions on the test data is shown below.
import json import numpy as np import keras.preprocessing.text as kpt from keras.preprocessing.text import Tokenizer from keras.models import model_from_json import pandas as pd def convert_text_to_index_array(text): words = kpt.text_to_word_sequence(text) wordIndices = [] for word in words: if word in dictionary: wordIndices.append(dictionary[word]) return wordIndices # Load the dictionary labels = ['happy','not_happy'] with open('dictionary.json', 'r') as dictionary_file: dictionary = json.load(dictionary_file) # Load trained model json_file = open('model.json', 'r') loaded_model_json = json_file.read() json_file.close() model = model_from_json(loaded_model_json) model.load_weights('model.h5') testset = pd.read_csv("./test.csv") cLen = len(testset['Description']) tokenizer = Tokenizer(num_words=10000) # Predict happiness for each review in test.csv y_pred = [] for i in range(0,cLen): review = testset['Description'][i] testArr = convert_text_to_index_array(review) input = tokenizer.sequences_to_matrix([testArr], mode='binary') pred = model.predict(input) #print pred[0][np.argmax(pred)] * 100, labels[np.argmax(pred)] y_pred.append(labels[np.argmax(pred)]) # Write the results in submission csv file raw_data = {'User_ID': testset['User_ID'], 'Is_Response': y_pred} df = pd.DataFrame(raw_data, columns = ['User_ID', 'Is_Response']) df.to_csv('submission_model1.csv', sep=',',index=False)
Test Results
After evaluating the prediction file, the achieved accuracy score was 87.79. I submitted it a few more times and the score mostly lied between 87 – 88 due to random initialization of weights in training the NN model. The top scorer in the leader board had score 90.
Final Thoughts
Hope it was easy to follow this Keras tutorial. If you are totally unaware of neural networks then probably it will be a little burdensome to follow. I would suggest to first go through the basics of Neural Networks from the abundant material available online. Before signing off, few more thoughts to the post are:
- I experimented to train the NN model with more layers and different activation functions but the results were more or less the same.
- I ran the model on PC with Nvidia GeForce GTX 1080 GPU (8 GB) and 32 GB RAM. It took hardly 1 – 1.5 minutes to train the NN model.
- For most of the natural language processing (NLP) applications, NN architectures like LSTM or CNN (1-dim) are very popular. I am looking forward to experiment on the same task with these architectures.
- Exploring more ways of the feature extraction process may produce better results.
Word2Vec
is a popular feature in NLP that can be used as input to NN models. - Accuracy can possibly be further improved by fusing or boosting various NN models (may be fusing at soft-max score level)
You can get the full python implementation from GitHub link here.
If you liked the post, follow this blog to get updates about upcoming articles. Also, share it so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.
Happy deep learning 🙂
nice article.
Liked by 1 person
Hi Nice article. But on python 3 in stuck into
convert_text_to_index_array
method. Actually i do not know the idea here about it.
def convert_text_to_index_array(text):
return [ word for word in kpt.text_to_word_sequence(text) ]
can you please help me out?
Like
Finally done. But before building bag of words model. Can we do lemmatization to improve results
Like
You can try out but with deep learning models, it’s not required.
Like
It was a nice article. There is a typo in “data_prepare” function. It should return “reviews, labels” instead of “label” (note the missing ‘s’) as mentioned in your post.
Liked by 1 person
Thanks. Will edit.
Like