Sentiment analysis in text mining is the process of categorizing opinions expressed in a piece of text. A basic form of such analysis would be to predict whether the opinion about something is positive or negative (polarity). There can be other forms of sentiment analysis or opinion mining like predicting rating scale on product’s review, predicting polarity on aspects of a product, detecting subjectivity and objectivity in sentences etc.
Our objective : To do sentiment analysis on movie reviews. In other words, to classify opinions expressed in a text review (document) in order to determine whether the reviewer’s sentiment towards the movie is positive or negative.
The corpus being used here is polarity dataset v2.0. This corpus contains 2000 labelled files of movie reviews with 1000 files for each of the two sentiments. The sentences in the files are processed and downcased. So, we do not need to do any preprocessing here and can directly get started with building the application. As any other classification problem, we have to train a classifier based on some feature of the sentiment classes. So, basically there are two sub-tasks:
1. Feature extraction process
2. Training the classifier
In this blog-post, we will be focusing mainly on the most popular and widely adopted word weighing scheme in text mining problems, known as term frequency and inverse document frequency (tf-idf) . Further, we will be training a Support Vector Machine(SVM) classifier and Multinomial Naive Bayes classifier on tf-idf weighted word frequency features. Finally, we will analyse the effect of using this scheme while checking the performance of the trained model on test movie reviews files.
Tf-Idf weighted Word Count: Feature Extraction
Conventionally, histogram of words are the features for the text classification problems. In general, we first build the vocabulary of the corpus and then we generate word count vector from each file which is nothing but frequency of words present in the vocabulary. Most of them will be zero as a single file won’t contain all the words in the vocabulary.
For example, suppose we have 500 words in vocabulary. So, each word count vector will contains the frequency of 500 vocab words in the text file. Suppose text in a file was “Get the work done, work done”. So, a fixed length encoding will be generated as [0,0,0,0,0,…….0,0,2,0,0,0,……,0,0,1,0,0,…0,0,1,0,0,……2,0,0,0,0,0]. Here, all the word counts are placed at 296th, 359th, 415th, 495th index of 500 length word count vector and the rest are zero. This blog-post presents a document classification application using mentioned conventional approach. But there are limitations in this conventional approach of extracting features as listed below:
a) Frequently occurring words present in all files of corpus irrespective of the sentiment, like in this case, ‘movie’, ‘acting’, etc will be treated equally like other distinguishing words in the document.
b) Stop words will be present in the vocab if not processed properly.
c) Rare words or key words which can be distinguishing will not get special weight.
Here comes tf-idf weighting factor which eliminates these limitations. The first question that comes to your mind is “what does tf-idf do to these conventional features ?”.
Term frequency
It increases the weight of the terms (words) that occur more frequently in the document. Quite intuitive, right ?? So it can be defined as tf(t,d) = F(t,d) where F(t,d) is number of occurrences of term ‘t’ in document ‘d’. But practically, it seems unlikely that thirty occurrences of a term in a document truly carry thirty times the significance of a single occurrence. So, in order to make it more pragmatic, we scale tf in logarithmic way so that as the frequency of terms increases exponentially, we will be increasing the weights of terms in additive manner.
tf(t,d) = log(F(t,d))
Inverse document frequency
It diminishes the weight of the terms that occur in all the documents of corpus and similarly increases the weight of the terms that occur in rare documents across the corpus. Basically, the rare keywords get special treatment and stop words/non-distinguishing words get punishment. We define idf as:
idf(t,D) = log(N/Nt ∈ d)
Here, ‘N’ is the total number of files in the corpus ‘D’ and ‘Nt ∈ d‘ is number of files in which term ‘t’ is present. By now, we can agree to the fact that tf is a intra-document factor which depends on individual document and idf is a per corpus factor which is constant for a corpus. Finally, We calculate tf-idf as:
tf-idf(t,d,D) = tf(t,d) . idf(t,D)
Enough with the theory part, let’s get hands on and write python code for extracting such features using scikit-learn machine learning library. It is an open source python ML library which comes bundled in 3rd party distribution anaconda or can be used by separate installation following this.
Library: sklearn.feature_extraction.text
sklearn.feature_extraction.text is a library class implemented in sklearn library for extraction of text features. We can extract tf-idf weighted features with the help of its functions. Lets recall the size of polarity movie review data-set here. We will divide the corpus in 90:10 split so that 1800 review files will be utilized as training set and rest 200 review files as test set. The below code snippet shows how to extract features from the text files.
vectorizer = TfidfVectorizer(min_df=5, max_df = 0.8, sublinear_tf=True, use_idf =True, stop_words = 'english') train_corpus_tf_idf = vectorizer.fit_transform(X_train) test_corpus_tf_idf = vectorizer.transform(X_test)
Class Parameters
Let us understand this. We can initialize object of TfidfVectorizer class with the following parameters –
- min_df – remove the words from the vocabulary which have occurred in less than ‘min_df’ number of files.
- max_df – remove the words from the vocabulary which have occurred in more than ‘max_df’ * total number of files in corpus.
- sublinear_tf – scale the term frequency in logarithmic scale.(talked about this earlier).
- stop_words – remove the predefined stop words of that language if present.
- use_idf – weight factor must use inverse document frequency(obviously).
- token_pattern – It is a regular expression for the kind of words chosen in vocabulary. default: u'(?u)\b\w\w+\b’ which means words only with 2 or more alphanumeric characters. If you want to keep only words with 2 or more alphabets(no numeric) then set token_pattern as ur'(?u)\b[^\W\d][^\W\d]+\b’
- max_features – choose maximum number of words for vocabulary ordered by term frequency.
- vocabulary – If you have created your own vocabulary, give it as a list here otherwise it will generate vocabulary from the training data.
fit_transform()
Upon initialization, fit_transform()
function is called by vectorizer object with parameter X_train
. X_train
is a list(iterable) of strings where each string represents the content of the document. It is obvious that the length of this list is the number of training documents. Here, the fit_transform(X_train)
does the following things.
1. Tokenizes each of the iterable string in words, preprocesses it for removing special characters, stop words etc. Also removes words that do not agree to the token_pattern regex.
2. Creates a vocabulary of words with count in training set. Takes max_features, min_df and max_df in consideration.
3. Finally, for each single string(document), it creates the tf-idf word count vector. The word count vector is a vector of all words in vocabulary with its frequency weighted by term frequency and inverse document frequency.
It returns a feature vectors matrix having a fixed length tf-idf weighted word count feature for each document in training set. This is also called term-document matrix. With this, we are ready to train our SVM and MultinomialNB classifiers. These classifiers take two parameters namely term-document matrix and polarity labels of the 1800 training files. This completes our training process. Similarly, vectorizer.transform(X_test)
will generate a term document matrix for the 200 test files using the same vocabulary generated while training.
Python implementation: Sentiment Analysis
Now, we can check the performance of trained models on the term document matrix of test set. Below is the full code of sentiment analysis on movie review polarity data-set using tf-idf features.
import os import numpy as np from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import confusion_matrix from sklearn.svm import LinearSVC from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.model_selection import StratifiedKFold def make_Corpus(root_dir): polarity_dirs = [os.path.join(root_dir,f) for f in os.listdir(root_dir)] corpus = [] for polarity_dir in polarity_dirs: reviews = [os.path.join(polarity_dir,f) for f in os.listdir(polarity_dir)] for review in reviews: doc_string = ""; with open(review) as rev: for line in rev: doc_string = doc_string + line if not corpus: corpus = [doc_string] else: corpus.append(doc_string) return corpus #Create a corpus with each document having one string root_dir = 'txt_sentoken' corpus = make_Corpus(root_dir) #Stratified 10-cross fold validation with SVM and Multinomial NB labels = np.zeros(2000); labels[0:1000]=0; labels[1000:2000]=1; kf = StratifiedKFold(n_splits=10) totalsvm = 0 # Accuracy measure on 2000 files totalNB = 0 totalMatSvm = np.zeros((2,2)); # Confusion matrix on 2000 files totalMatNB = np.zeros((2,2)); for train_index, test_index in kf.split(corpus,labels): X_train = [corpus[i] for i in train_index] X_test = [corpus[i] for i in test_index] y_train, y_test = labels[train_index], labels[test_index] vectorizer = TfidfVectorizer(min_df=5, max_df = 0.8, sublinear_tf=True, use_idf=True,stop_words='english') train_corpus_tf_idf = vectorizer.fit_transform(X_train) test_corpus_tf_idf = vectorizer.transform(X_test) model1 = LinearSVC() model2 = MultinomialNB() model1.fit(train_corpus_tf_idf,y_train) model2.fit(train_corpus_tf_idf,y_train) result1 = model1.predict(test_corpus_tf_idf) result2 = model2.predict(test_corpus_tf_idf) totalMatSvm = totalMatSvm + confusion_matrix(y_test, result1) totalMatNB = totalMatNB + confusion_matrix(y_test, result2) totalsvm = totalsvm+sum(y_test==result1) totalNB = totalNB+sum(y_test==result2) print totalMatSvm, totalsvm/2000.0, totalMatNB, totalNB/2000.0
There are two things which may seem unexpected to you. Firstly, def make_Corpus()
is reading each file of data-set to convert multiple lines of texts in a document into one string per document. Secondly, kf = StratifiedKFold(n_splits=10)
initializes the K-cross fold validation technique so that the data-set is partitioned in 10 parts. From these 10 parts, we use 1 part to test the model while the other 9 parts for training. Also, we repeat the validation process K times. Therefore, each of the partition is validated at least once without getting included in training set.
Checking Performance: Sentiment analysis
10-cross fold validation allows us to test all the files of corpus. There are 1000 text reviews with each of the two sentiments (positive and negative). The results below shows the number of text files correctly predicted by the classifier for its sentiment. One can see the comparison between both the classifiers (MultinomialNB and SVM). Also, we can compare the results when the same was implemented using conventional word count features (Github link).
Features/Models | Multinomial NB | SVM |
Conventional Word count | 1646 (82.3%) | 1636 (81.8%) |
Tf-Idf weighted factor | 1665 (83.25%) | 1748 (87.4%) |
SVM outperforms Multinomial NB while using tf-idf weighted features, Also we can see the improvement of 5.5% in true identification rate when word count features uses weighting factors. We see the confusion matrix for both Multinomial NB and SVM using tf-idf features below:
Multinomial NB | Negative | Positive |
Negative | 856 | 144 |
Positive | 191 | 809 |
SVM(Linear) | Negative | Positive |
Negative | 874 | 126 |
Positive | 126 | 874 |
Concluding Remarks
Hope I have made justice to Tf-Idf features in this blog. I have tried to explain the usefulness of these features with sentiment analysis application. Beginners are encouraged to implement it, match their outputs with the results shown here. Also, try to analyse the difference between conventional word count features and tf-idf weighted features. One can read my previous post to know how to implement conventional features for classification problem. In order to get more insight about TfidfVectorizer
class of sklearn, we must play with its various parameters like token_pattern, vocabulary, stop_words, max_features, etc.
The machine learning models (Multinomial NB and SVM) have been implemented here without giving mathematical background. It may overwhelm the readers with so much of information in a single blog-post. One may apply other variants of these classifiers in order to make comparison and analyse the underlying differences among them. Here, the purpose was to present an understanding of term frequency and inverse document frequency and its importance in text mining applications.
In addition, the full python implementation of sentiment analysis on polarity movie review data-set using both type of features can be found on Github link here.
If you liked the post, follow this blog to get updates about the upcoming articles. Also, share this article so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.
Happy machine learning 🙂
good work sir,
Thank you so much.
Liked by 1 person
yeahh….good to know it helped…follow the blog if you wish to get updates of new posts.
Liked by 1 person
Hi ,
I’m sorry i could not understand the testing process.Here you took 200+- document for testing and taking there tf-idf but if i want to classify a single document what will be the approach here ? you cant take idf for a single document can you ? .Can you please clear the approach of testing the model? what actually happening under the hood?. please help me to understand ,please.Thank you
Like
hi,
Testing of a single file can easily be done. If you see the variable “test_corpus_tf_idf” and “X_test”. They are basically set of vectors and set of strings respectively.. If you want to test only one string, you need to transform that one string to tf_idf vector using vectorizer.transform() function.
Here, K-cross validation was applied in order to test all 1000 files. You can have separate train corpus of say 900 files from each category and test corpus of rest of the files.
Just execute make_Corpus and fit train_corpus_tf_idf vectors on train_corpus and finally
Like
here you have assumed class labels for 0-1000 data is 0 and 1000-2000 is 1 but if we already know class labels then this code labels[0:1000]=0;
labels[1000:2000]=1; is irrelevant . isn’t it?
please help me if I getting wrong.
Liked by 1 person
Hi,
Polarity dataset contains 1000 positive (labels[0:1000] = 0) and 1000 negative reviews.(labels[1000:2000]=1). So, its not assumption.
Yes, we already know the ground truth and that helps to evaluate the accuracy of our approach/solution.
Y_train and Y_test are the labels used in every iteration. Y_train (ground truth for 1800 files) is used while training SVM or Naive bayes model. Y_test (ground truth for 200 test files) is only used for evaluating the confusion matrix. The predicted labels and Y_test labels are matched to find out how many files the models classified correctly.
So, for evaluation purpose, We need to know the ground truth.
Hope it helps. Ask for any further clarification needed.
Thanks.
Like
Traceback (most recent call last):
File “movie-polarity.py”, line 58, in
dictionary = make_Dictionary(root_dir)
File “movie-polarity.py”, line 11, in make_Dictionary
emails_dirs = [os.path.join(root_dir,f) for f in os.listdir(root_dir)]
OSError: [Errno 2] No such file or directory: ‘txt_sentoken’
Like
You should first download the polarity data-set. Check out the directory structure.
Keep the movie-polarity.py in same folder, then execute the code.
I would suggest to read the article. Have patience and then check the code. Path of the data-set needs to be set correctly. Hope you have done coding in python.
Like
Thanks for sharing this article. It has nice explanation.
Liked by 1 person
Thanks and follow the blog for getting updates for new blog-posts.
Like
hi
you have really done a good work. what i have understood is here for every 200 reviews a confusion matrix is generated and finally they are combined to single matrix. so am i correct?
my question: is there any way to increase the accuracy even better?
Like
Yes..you are correct. With 10-CrossFold validation techniques, its happening.
There can be lot of things which can be tried. One such thing which can increase the accuracy is improving the dictionary. As of now, Dictionary is created by sklearn library function.
Also you may find some other models in literature which have given better result than 87% achieved here.
Like
using the same the model is it possible to improve the dictionary?
if yes can u suggest a few?
Like
In this algorithm as I understood you have used term frequency and related weights for building the dictionary .So can you suggest me any ideas for improving the dictionary?
Like
Building a dictionary is totally different from term frequency and inverse document frequency weight. Dictionary is simply list of tuples which keeps the record of words and its count.
I would suggest you to read another article in this blog named “Email Spam Filtering : A python implementation“
Like
I am doing project on sentiment analysis . So what can I do to improve accuracy of svm algorithm and how to improve dictionary.
Like
using the same the model is it possible to improve the dictionary?
if yes can u suggest few methods?
can we employ kernel for this?
Like
hi Bhavitha,
It is possible to manually improve dictionary. In the implementation, it will not be possible as TfidfVectorizer() is taking care of building the dictionary. You can make create dictionary on your own (as created in another post “Email spam filtering”) and pass it to TfidfVectorizer as vocabulary.
If you are talking about SVM kernels here, I have tried it with few SVM kernels and got best accuracy with Linear kernel only.
Like
What would be the approach to go back and figure out which words/phrases were significant for better classification?
Like
How do I plot the SVM plane of the above classification process? [ using tf-idf and svm]
I want to plot the graph and get the hyperplane generated that classify the above points (tf-idf) into two classes. Please help.
Like
Hi haruo,
You can plot the SVM plane but it can be plotted in 3D at the best.
You need to reduce the dimensionality of feature vectors to 3 using PCA and then plot the SVM plane.
You can follow this:
https://stackoverflow.com/questions/36232334/plotting-3d-decision-boundary-from-linear-svm
Like
Where can i find the test data for this code?
Like
Train and Test have been created from same corpus.
I have performed 10-cross validation.
Like
Hi, I have a list of sentences, in two separate text files T1 and T2. I have train and test T1 with your code. And it’s working as expected. Now, I want to predict the sentences in T2 using the trained model. Please, guide me. Thanks.
Like
Pickle the trained model. It will be saved on your disk.
Further, You can write a code to preprocess sentences in T2 in similar way as sentences in T1 and then predict it with the trained model you have saved.
Like
Thanks, I will try this one.
Like
Could you please guide me, how to implement the same problem (text classification into 2 or 3 classes using tf-idf) using Ensemble Voting classifiers? I tried searching on the internet, but couldn’t find any proper one.
Like
Hope this helps
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html
Like
I already tried that one. When I tried to replace
eclf1 = VotingClassifier(estimators=[(‘lr’, clf1), (‘rf’, clf2), (‘gnb’, clf3)], voting=’hard’)
eclf1 = eclf1.fit(X, y)
(X,y) with (train_corpus_tf_idf,y_train), its showing error.
Like
It should not. You may want to dig it more to that error and correct it. You can use several models and use ensemble voting to predict output class.
Alternatively,
You can simply train 5 models separately and then for any test case, code a simple logic to take a vote count of output classes and results a class which gets maximum votes.
Like
Hi, could you guide me in transforming the text with BM25 term weighting scheme instead of TF-IDF so that I can feed it to the same classifier of sklearn package in Python? Thanks.
Like
BM25 has not shown improvement in machine learning context (classification or clustering). So, No one has implemented it in sklearn.
BM25 is more often used in information retrieval. It is implemented in “Whoosh” pypi package. You can probably explore that.
Like