Classifiers, Natural Language Processing

Sentiment Analysis using Tf-Idf weighting: Python/Scikit-learn

Date: February 12, 2017Author: Abhijeet Kumar 36 Comments

Sentiment analysis in text mining is the process of categorizing opinions expressed in a piece of text. A basic form of such analysis would be to predict whether the opinion about something is positive or negative (polarity). There can be other forms of sentiment analysis or opinion mining like predicting rating scale on product’s review, predicting polarity on aspects of a product, detecting subjectivity and objectivity in sentences etc.

Our objective : To do sentiment analysis on movie reviews. In other words, to classify opinions expressed in a text review (document) in order to determine whether the reviewer’s sentiment towards the movie is positive or negative.

The corpus being used here is polarity dataset v2.0. This corpus contains 2000 labelled files of movie reviews with 1000 files for each of the two sentiments. The sentences in the files are processed and downcased. So, we do not need to do any preprocessing here and can directly get started with building the application. As any other classification problem, we have to train a classifier based on some feature of the sentiment classes. So, basically there are two sub-tasks:

1. Feature extraction process
2. Training the classifier

In this blog-post, we will be focusing mainly on the most popular and widely adopted word weighing scheme in text mining problems, known as term frequency and inverse document frequency (tf-idf) . Further, we will be training a Support Vector Machine(SVM) classifier and Multinomial Naive Bayes classifier on tf-idf weighted word frequency features. Finally, we will analyse the effect of using this scheme while checking the performance of the trained model on test movie reviews files.

Tf-Idf weighted Word Count: Feature Extraction

Conventionally, histogram of words are the features for the text classification problems. In general, we first build the vocabulary of the corpus and then we generate word count vector from each file which is nothing but frequency of words present in the vocabulary. Most of them will be zero as a single file won’t contain all the words in the vocabulary.

For example, suppose we have 500 words in vocabulary. So, each word count vector will contains the frequency of 500 vocab words in the text file. Suppose text in a file was “Get the work done, work done”. So, a fixed length encoding will be generated as [0,0,0,0,0,…….0,0,2,0,0,0,……,0,0,1,0,0,…0,0,1,0,0,……2,0,0,0,0,0]. Here, all the word counts are placed at 296th, 359th, 415th, 495th index of 500 length word count vector and the rest are zero. This blog-post presents a document classification application using mentioned conventional approach. But there are limitations in this conventional approach of extracting features as listed below:

a) Frequently occurring words present in all files of corpus irrespective of the sentiment, like in this case, ‘movie’, ‘acting’, etc will be treated equally like other distinguishing words in the document.
b) Stop words will be present in the vocab if not processed properly.
c) Rare words or key words which can be distinguishing will not get special weight.

Here comes tf-idf weighting factor which eliminates these limitations. The first question that comes to your mind is “what does tf-idf do to these conventional features ?”.

Term frequency

It increases the weight of the terms (words) that occur more frequently in the document. Quite intuitive, right ?? So it can be defined as tf(t,d) = F(t,d) where F(t,d) is number of occurrences of term ‘t’ in document ‘d’. But practically, it seems unlikely that thirty occurrences of a term in a document truly carry thirty times the significance of a single occurrence. So, in order to make it more pragmatic, we scale tf in logarithmic way so that as the frequency of terms increases exponentially, we will be increasing the weights of terms in additive manner.

tf(t,d) = log(F(t,d))

Inverse document frequency

It diminishes the weight of the terms that occur in all the documents of corpus and similarly increases the weight of the terms that occur in rare documents across the corpus. Basically, the rare keywords get special treatment and stop words/non-distinguishing words get punishment. We define idf as:

idf(t,D) = log(N/N_{t ∈ d})

Here, ‘N’ is the total number of files in the corpus ‘D’ and ‘N_{t ∈ d}‘ is number of files in which term ‘t’ is present. By now, we can agree to the fact that tf is a intra-document factor which depends on individual document and idf is a per corpus factor which is constant for a corpus. Finally, We calculate tf-idf as:

tf-idf(t,d,D) = tf(t,d) . idf(t,D)

Enough with the theory part, let’s get hands on and write python code for extracting such features using scikit-learn machine learning library. It is an open source python ML library which comes bundled in 3rd party distribution anaconda or can be used by separate installation following this.

Library: sklearn.feature_extraction.text

sklearn.feature_extraction.text is a library class implemented in sklearn library for extraction of text features. We can extract tf-idf weighted features with the help of its functions. Lets recall the size of polarity movie review data-set here. We will divide the corpus in 90:10 split so that 1800 review files will be utilized as training set and rest 200 review files as test set. The below code snippet shows how to extract features from the text files.

vectorizer = TfidfVectorizer(min_df=5, max_df = 0.8, sublinear_tf=True, use_idf =True, stop_words = 'english')
train_corpus_tf_idf = vectorizer.fit_transform(X_train)
test_corpus_tf_idf = vectorizer.transform(X_test)

Class Parameters

Let us understand this. We can initialize object of TfidfVectorizer class with the following parameters –

min_df – remove the words from the vocabulary which have occurred in less than ‘min_df’ number of files.
max_df – remove the words from the vocabulary which have occurred in more than ‘max_df’ * total number of files in corpus.
sublinear_tf – scale the term frequency in logarithmic scale.(talked about this earlier).
stop_words – remove the predefined stop words of that language if present.
use_idf – weight factor must use inverse document frequency(obviously).
token_pattern – It is a regular expression for the kind of words chosen in vocabulary. default: u'(?u)\b\w\w+\b’ which means words only with 2 or more alphanumeric characters. If you want to keep only words with 2 or more alphabets(no numeric) then set token_pattern as ur'(?u)\b[^\W\d][^\W\d]+\b’
max_features – choose maximum number of words for vocabulary ordered by term frequency.
vocabulary – If you have created your own vocabulary, give it as a list here otherwise it will generate vocabulary from the training data.

fit_transform()

Upon initialization, fit_transform() function is called by vectorizer object with parameter X_train. X_train is a list(iterable) of strings where each string represents the content of the document. It is obvious that the length of this list is the number of training documents. Here, the fit_transform(X_train) does the following things.

1. Tokenizes each of the iterable string in words, preprocesses it for removing special characters, stop words etc. Also removes words that do not agree to the token_pattern regex.
2. Creates a vocabulary of words with count in training set. Takes max_features, min_df and max_df in consideration.
3. Finally, for each single string(document), it creates the tf-idf word count vector. The word count vector is a vector of all words in vocabulary with its frequency weighted by term frequency and inverse document frequency.

It returns a feature vectors matrix having a fixed length tf-idf weighted word count feature for each document in training set. This is also called term-document matrix. With this, we are ready to train our SVM and MultinomialNB classifiers. These classifiers take two parameters namely term-document matrix and polarity labels of the 1800 training files. This completes our training process. Similarly, vectorizer.transform(X_test) will generate a term document matrix for the 200 test files using the same vocabulary generated while training.

Python implementation: Sentiment Analysis

Now, we can check the performance of trained models on the term document matrix of test set. Below is the full code of sentiment analysis on movie review polarity data-set using tf-idf features.

import os
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import StratifiedKFold

def make_Corpus(root_dir):
    polarity_dirs = [os.path.join(root_dir,f) for f in os.listdir(root_dir)]
    corpus = []
    for polarity_dir in polarity_dirs:
        reviews = [os.path.join(polarity_dir,f) for f in os.listdir(polarity_dir)]
        for review in reviews:
            doc_string = "";
            with open(review) as rev:
                for line in rev:
                    doc_string = doc_string + line
            if not corpus:
                corpus = [doc_string]
            else:
                corpus.append(doc_string)
    return corpus

#Create a corpus with each document having one string
root_dir = 'txt_sentoken'
corpus = make_Corpus(root_dir)

#Stratified 10-cross fold validation with SVM and Multinomial NB
labels = np.zeros(2000);
labels[0:1000]=0;
labels[1000:2000]=1; 

kf = StratifiedKFold(n_splits=10)

totalsvm = 0           # Accuracy measure on 2000 files
totalNB = 0
totalMatSvm = np.zeros((2,2));  # Confusion matrix on 2000 files
totalMatNB = np.zeros((2,2));

for train_index, test_index in kf.split(corpus,labels):
    X_train = [corpus[i] for i in train_index]
    X_test = [corpus[i] for i in test_index]
    y_train, y_test = labels[train_index], labels[test_index]
    vectorizer = TfidfVectorizer(min_df=5, max_df = 0.8, sublinear_tf=True, use_idf=True,stop_words='english')
    train_corpus_tf_idf = vectorizer.fit_transform(X_train)
    test_corpus_tf_idf = vectorizer.transform(X_test)

    model1 = LinearSVC()
    model2 = MultinomialNB()
    model1.fit(train_corpus_tf_idf,y_train)
    model2.fit(train_corpus_tf_idf,y_train)
    result1 = model1.predict(test_corpus_tf_idf)
    result2 = model2.predict(test_corpus_tf_idf)

    totalMatSvm = totalMatSvm + confusion_matrix(y_test, result1)
    totalMatNB = totalMatNB + confusion_matrix(y_test, result2)
    totalsvm = totalsvm+sum(y_test==result1)
    totalNB = totalNB+sum(y_test==result2)

print totalMatSvm, totalsvm/2000.0, totalMatNB, totalNB/2000.0

There are two things which may seem unexpected to you. Firstly, def make_Corpus() is reading each file of data-set to convert multiple lines of texts in a document into one string per document. Secondly, kf = StratifiedKFold(n_splits=10) initializes the K-cross fold validation technique so that the data-set is partitioned in 10 parts. From these 10 parts, we use 1 part to test the model while the other 9 parts for training. Also, we repeat the validation process K times. Therefore, each of the partition is validated at least once without getting included in training set.

Checking Performance: Sentiment analysis

10-cross fold validation allows us to test all the files of corpus. There are 1000 text reviews with each of the two sentiments (positive and negative). The results below shows the number of text files correctly predicted by the classifier for its sentiment. One can see the comparison between both the classifiers (MultinomialNB and SVM). Also, we can compare the results when the same was implemented using conventional word count features (Github link).

Features/Models	Multinomial NB	SVM
Conventional Word count	1646 (82.3%)	1636 (81.8%)
Tf-Idf weighted factor	1665 (83.25%)	1748 (87.4%)

SVM outperforms Multinomial NB while using tf-idf weighted features, Also we can see the improvement of 5.5% in true identification rate when word count features uses weighting factors. We see the confusion matrix for both Multinomial NB and SVM using tf-idf features below:

Multinomial NB	Negative	Positive
Negative	856	144
Positive	191	809

SVM(Linear)	Negative	Positive
Negative	874	126
Positive	126	874

Concluding Remarks

Hope I have made justice to Tf-Idf features in this blog. I have tried to explain the usefulness of these features with sentiment analysis application. Beginners are encouraged to implement it, match their outputs with the results shown here. Also, try to analyse the difference between conventional word count features and tf-idf weighted features. One can read my previous post to know how to implement conventional features for classification problem. In order to get more insight about TfidfVectorizer class of sklearn, we must play with its various parameters like token_pattern, vocabulary, stop_words, max_features, etc.

The machine learning models (Multinomial NB and SVM) have been implemented here without giving mathematical background. It may overwhelm the readers with so much of information in a single blog-post. One may apply other variants of these classifiers in order to make comparison and analyse the underlying differences among them. Here, the purpose was to present an understanding of term frequency and inverse document frequency and its importance in text mining applications.

In addition, the full python implementation of sentiment analysis on polarity movie review data-set using both type of features can be found on Github link here.

If you liked the post, follow this blog to get updates about the upcoming articles. Also, share this article so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.

Happy machine learning 🙂

36 thoughts on “Sentiment Analysis using Tf-Idf weighting: Python/Scikit-learn”

Add Comment

anoopucity says:

February 19, 2017 at 7:58 am

good work sir,
Thank you so much.

Liked by 1 person

Reply
1. ML bot2 says:
  
  February 19, 2017 at 1:51 pm
  
  yeahh….good to know it helped…follow the blog if you wish to get updates of new posts.
  
  Liked by 1 person
  
  Reply
mhr says:

April 25, 2017 at 10:22 am

Hi ,
I’m sorry i could not understand the testing process.Here you took 200+- document for testing and taking there tf-idf but if i want to classify a single document what will be the approach here ? you cant take idf for a single document can you ? .Can you please clear the approach of testing the model? what actually happening under the hood?. please help me to understand ,please.Thank you

Like

Reply
1. ML bot2 says:
  
  April 25, 2017 at 2:37 pm
  
  hi,
  Testing of a single file can easily be done. If you see the variable “test_corpus_tf_idf” and “X_test”. They are basically set of vectors and set of strings respectively.. If you want to test only one string, you need to transform that one string to tf_idf vector using vectorizer.transform() function.
  
  Here, K-cross validation was applied in order to test all 1000 files. You can have separate train corpus of say 900 files from each category and test corpus of rest of the files.
  Just execute make_Corpus and fit train_corpus_tf_idf vectors on train_corpus and finally
  
  Like
  
  Reply
sneha says:

April 27, 2017 at 1:47 am

here you have assumed class labels for 0-1000 data is 0 and 1000-2000 is 1 but if we already know class labels then this code labels[0:1000]=0;
labels[1000:2000]=1; is irrelevant . isn’t it?
please help me if I getting wrong.

Liked by 1 person

Reply
1. ML bot2 says:
  
  April 27, 2017 at 2:15 am
  
  Hi,
  Polarity dataset contains 1000 positive (labels[0:1000] = 0) and 1000 negative reviews.(labels[1000:2000]=1). So, its not assumption.
  Yes, we already know the ground truth and that helps to evaluate the accuracy of our approach/solution.
  
  Y_train and Y_test are the labels used in every iteration. Y_train (ground truth for 1800 files) is used while training SVM or Naive bayes model. Y_test (ground truth for 200 test files) is only used for evaluating the confusion matrix. The predicted labels and Y_test labels are matched to find out how many files the models classified correctly.
  
  So, for evaluation purpose, We need to know the ground truth.
  Hope it helps. Ask for any further clarification needed.
  Thanks.
  
  Like
  
  Reply
Pingback: Naive Bayes Classifier – Machine Learning in Action
says:

June 11, 2017 at 12:24 pm

Traceback (most recent call last):
File “movie-polarity.py”, line 58, in
dictionary = make_Dictionary(root_dir)
File “movie-polarity.py”, line 11, in make_Dictionary
emails_dirs = [os.path.join(root_dir,f) for f in os.listdir(root_dir)]
OSError: [Errno 2] No such file or directory: ‘txt_sentoken’

Like

Reply
1. ML bot2 says:
  
  June 11, 2017 at 12:58 pm
  
  You should first download the polarity data-set. Check out the directory structure.
  Keep the movie-polarity.py in same folder, then execute the code.
  
  I would suggest to read the article. Have patience and then check the code. Path of the data-set needs to be set correctly. Hope you have done coding in python.
  
  Like
  
  Reply
Pingback: Sentiment Analysis with Core ML on iOS 11 – martinmitrevski
Saurabh Jain says:

July 18, 2017 at 3:03 pm

Thanks for sharing this article. It has nice explanation.

Liked by 1 person

Reply
1. ML bot2 says:
  
  July 18, 2017 at 6:26 pm
  
  Thanks and follow the blog for getting updates for new blog-posts.
  
  Like
  
  Reply
says:

September 6, 2017 at 12:21 pm

hi
you have really done a good work. what i have understood is here for every 200 reviews a confusion matrix is generated and finally they are combined to single matrix. so am i correct?
my question: is there any way to increase the accuracy even better?

Like

Reply
1. ML bot2 says:
  
  September 6, 2017 at 12:49 pm
  
  Yes..you are correct. With 10-CrossFold validation techniques, its happening.
  
  There can be lot of things which can be tried. One such thing which can increase the accuracy is improving the dictionary. As of now, Dictionary is created by sklearn library function.
  Also you may find some other models in literature which have given better result than 87% achieved here.
  
  Like
  
  Reply
  1. says:
    
    September 9, 2017 at 5:17 am
    
    using the same the model is it possible to improve the dictionary?
    if yes can u suggest a few?
    
    Like
    
    Reply
Jyothsna says:

September 7, 2017 at 9:07 am

In this algorithm as I understood you have used term frequency and related weights for building the dictionary .So can you suggest me any ideas for improving the dictionary?

Like

Reply
1. ML bot2 says:
  
  September 7, 2017 at 11:35 am
  
  Building a dictionary is totally different from term frequency and inverse document frequency weight. Dictionary is simply list of tuples which keeps the record of words and its count.
  I would suggest you to read another article in this blog named “Email Spam Filtering : A python implementation“
  
  Like
  
  Reply
Srujana says:

September 8, 2017 at 1:05 am

I am doing project on sentiment analysis . So what can I do to improve accuracy of svm algorithm and how to improve dictionary.

Like

Reply
says:

September 11, 2017 at 9:30 am

using the same the model is it possible to improve the dictionary?
if yes can u suggest few methods?
can we employ kernel for this?

Like

Reply
1. ML bot2 says:
  
  September 11, 2017 at 9:55 am
  
  hi Bhavitha,
  It is possible to manually improve dictionary. In the implementation, it will not be possible as TfidfVectorizer() is taking care of building the dictionary. You can make create dictionary on your own (as created in another post “Email spam filtering”) and pass it to TfidfVectorizer as vocabulary.
  
  If you are talking about SVM kernels here, I have tried it with few SVM kernels and got best accuracy with Linear kernel only.
  
  Like
  
  Reply
says:

April 11, 2018 at 3:42 pm

What would be the approach to go back and figure out which words/phrases were significant for better classification?

Like

Reply
Haruo says:

May 14, 2018 at 4:11 pm

How do I plot the SVM plane of the above classification process? [ using tf-idf and svm]
I want to plot the graph and get the hyperplane generated that classify the above points (tf-idf) into two classes. Please help.

Like

Reply
1. Abhijeet Kumar says:
  
  May 15, 2018 at 1:24 am
  
  Hi haruo,
  You can plot the SVM plane but it can be plotted in 3D at the best.
  
  You need to reduce the dimensionality of feature vectors to 3 using PCA and then plot the SVM plane.
  
  You can follow this:
  https://stackoverflow.com/questions/36232334/plotting-3d-decision-boundary-from-linear-svm
  
  Like
  
  Reply
Pingback: Developing a fast Indexing and Full text Search Engine with Whoosh: A Pure-Python Library – Machine Learning in Action
ajay says:

October 22, 2018 at 7:04 am

Where can i find the test data for this code?

Like

Reply
1. Abhijeet Kumar says:
  
  October 22, 2018 at 6:32 pm
  
  Train and Test have been created from same corpus.
  I have performed 10-cross validation.
  
  Like
  
  Reply
Haruo says:

December 8, 2018 at 9:06 am

Hi, I have a list of sentences, in two separate text files T1 and T2. I have train and test T1 with your code. And it’s working as expected. Now, I want to predict the sentences in T2 using the trained model. Please, guide me. Thanks.

Like

Reply
1. Abhijeet Kumar says:
  
  December 8, 2018 at 11:37 am
  
  Pickle the trained model. It will be saved on your disk.
  Further, You can write a code to preprocess sentences in T2 in similar way as sentences in T1 and then predict it with the trained model you have saved.
  
  Like
  
  Reply
  1. Haruo says:
    
    December 12, 2018 at 9:49 am
    
    Thanks, I will try this one.
    
    Like
    
    Reply
Haruo says:

December 12, 2018 at 9:48 am

Could you please guide me, how to implement the same problem (text classification into 2 or 3 classes using tf-idf) using Ensemble Voting classifiers? I tried searching on the internet, but couldn’t find any proper one.

Like

Reply
1. Abhijeet Kumar says:
  
  December 12, 2018 at 9:58 am
  
  Hope this helps
  https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html
  
  Like
  
  Reply
  1. Haruo says:
    
    December 13, 2018 at 6:01 am
    
    I already tried that one. When I tried to replace
    
    eclf1 = VotingClassifier(estimators=[(‘lr’, clf1), (‘rf’, clf2), (‘gnb’, clf3)], voting=’hard’)
    eclf1 = eclf1.fit(X, y)
    
    (X,y) with (train_corpus_tf_idf,y_train), its showing error.
    
    Like
    
    Reply
    1. Abhijeet Kumar says:
      
      December 13, 2018 at 6:14 am
      
      It should not. You may want to dig it more to that error and correct it. You can use several models and use ensemble voting to predict output class.
      
      Alternatively,
      
      You can simply train 5 models separately and then for any test case, code a simple logic to take a vote count of output classes and results a class which gets maximum votes.
      
      Like
      
      Reply
Haruo says:

February 12, 2019 at 10:29 am

Hi, could you guide me in transforming the text with BM25 term weighting scheme instead of TF-IDF so that I can feed it to the same classifier of sklearn package in Python? Thanks.

Like

Reply
1. Abhijeet Kumar says:
  
  February 13, 2019 at 7:31 pm
  
  BM25 has not shown improvement in machine learning context (classification or clustering). So, No one has implemented it in sklearn.
  BM25 is more often used in information retrieval. It is implemented in “Whoosh” pypi package. You can probably explore that.
  
  Like
  
  Reply
Pingback: Understanding Naive Bayes Classifier from scratch : Python code - Machine Learning in Action