Email Spam Filter : A python implementation with scikit-learn

Text mining (deriving information from text) is a wide field which has gained popularity with the huge text data being generated. Automation of a number of applications like sentiment analysis, document classification, topic classification, text summarization, machine translation, etc has been done using machine learning models. Email Spam filter is a beginner’s example of document classification task which involves classifying an email as spam or non-spam (a.k.a. ham) mail. Spam box in your Gmail account is the best example of this.

So lets get started in building a email spam filter on a publicly available mail corpus. I have extracted equal number of spam and non-spam emails from Ling-spam corpus. The extracted subset on which we will be working can be downloaded from here.

We will walk through the following steps to build email spam filter application :

  1. Preparing the text data.
  2. Creating word dictionary.
  3. Feature extraction process
  4. Training the classifier

Further, we will check the results on test set of the subset created.

1. Preparing the text data.

The data-set used here, is split into a training set and a test set containing 702 mails and 260 mails respectively, divided equally between spam and ham mails. You will easily recognize spam mails as it contains *spmsg* in its filename.

In any text mining problem, text cleaning is the first step where we remove those words from the document which may not contribute to the information we want to extract. Emails may contain a lot of undesirable characters like punctuation marks, stop words, digits, etc which may not be helpful in detecting the spam email. The emails in Ling-spam corpus have been already pre-processed in the following ways:

a) Removal of stop words – Stop words like “and”, “the”, “of”, etc are very common in all English sentences and are not very meaningful in deciding spam or legitimate status, so these words have been removed from the emails.

b) Lemmatization – It is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. For example, “include”, “includes,” and “included” would all be represented as “include”. The context of the sentence is also preserved in lemmatization as opposed to stemming (another buzz word in text mining which does not consider meaning of the sentence).

We still need to remove the non-words like punctuation marks or special characters from the mail documents. There are several ways to do it. Here, we will remove such words after creating a dictionary, which is a very convenient method to do so since when you have a dictionary, you need to remove every such word only once. So cheers !! As of now you don’t need to do anything.

2. Creating word dictionary.

A sample email in the data-set looks like this:

Subject: posting

hi , ' m work phonetics project modern irish ' m hard source . anyone recommend book article english ? ' , specifically interest palatal ( slender ) consonant , work helpful too . thank ! laurel sutton ( sutton @ garnet . berkeley . edu

It can be seen that the first line of the mail is subject and the 3rd line contains the body of the email. We will only perform text analytics on the content to detect the spam mails. As a first step, we need to create a dictionary of words and their frequency. For this task, training set of 700 mails is utilized. This python function creates the dictionary for you.

def make_Dictionary(train_dir):
    emails = [os.path.join(train_dir,f) for f in os.listdir(train_dir)]
    all_words = []
    for mail in emails:
        with open(mail) as m:
            for i,line in enumerate(m):
                if i == 2:  #Body of email is only 3rd line of text file
                    words = line.split()
                    all_words += words

    dictionary = Counter(all_words)
    # Paste code for non-word removal here(code snippet is given below)
    return dictionary

Once the dictionary is created we can add just a few lines of code written below to the above function to remove non-words about which we talked in step 1. I have also removed absurd single characters in the dictionary which are irrelevant here. Do not forget to insert the below code in the function def make_Dictionary(train_dir).

list_to_remove = dictionary.keys()
for item in list_to_remove:
    if item.isalpha() == False:
        del dictionary[item]
    elif len(item) == 1:
        del dictionary[item]
dictionary = dictionary.most_common(3000)

Dictionary can be seen by the command print dictionary. You may find some absurd word counts to be high but don’t worry, it’s just a dictionary and you always have the scope of  improving it later. If you are following this blog with provided data-set, make sure your dictionary has some of the entries given below as most frequent words. Here I have chosen 3000 most frequently used words in the dictionary.

[('order', 1414), ('address', 1293), ('report', 1216), ('mail', 1127), ('send', 1079), ('language', 1072), ('email', 1051), ('program', 1001), ('our', 987), ('list', 935), ('one', 917), ('name', 878), ('receive', 826), ('money', 788), ('free', 762)

3. Feature extraction process.

Once the dictionary is ready, we can extract word count vector (our feature here) of 3000 dimensions for each email of training set. Each word count vector contains the frequency of 3000 words in the training file. Of course you might have guessed by now that most of them will be zero. Let us take an example. Suppose we have 500 words in our dictionary. Each word count vector contains the frequency of 500 dictionary words in the training file. Suppose text in training file was “Get the work done, work done” then it will be encoded as [0,0,0,0,0,…….0,0,2,0,0,0,……,0,0,1,0,0,…0,0,1,0,0,……2,0,0,0,0,0]. Here, all the word counts are placed at 296th, 359th, 415th, 495th index of 500 length word count vector and the rest are zero.

The below python code will generate a feature vector matrix whose rows denote 700 files of training set and columns denote 3000 words of dictionary. The value at index ‘ij’ will be the number of occurrences of jth word of dictionary in ith file.

def extract_features(mail_dir):
    files = [os.path.join(mail_dir,fi) for fi in os.listdir(mail_dir)]
    features_matrix = np.zeros((len(files),3000))
    docID = 0;
    for fil in files:
      with open(fil) as fi:
        for i,line in enumerate(fi):
          if i == 2:
            words = line.split()
            for word in words:
              wordID = 0
              for i,d in enumerate(dictionary):
                if d[0] == word:
                  wordID = i
                  features_matrix[docID,wordID] = words.count(word)
        docID = docID + 1
    return features_matrix

4. Training the classifiers.

For building email spam filter, we will train mathematical model that learns a decision boundary in features space between the two classes. Here, I will be using scikit-learn ML library for training classifiers. It is an open source python ML library which comes bundled in 3rd party distribution anaconda or can be used by separate installation following this. Once installed, we only need to import it in our program.

Further, I have trained two models here namely Naive Bayes classifier and Support Vector Machines (SVM). Naive Bayes classifier is a conventional and very popular method for document classification problem. It is a supervised probabilistic classifier based on Bayes theorem assuming independence between every pair of features. SVMs are supervised binary classifiers which are very effective when you have higher number of features. The goal of SVM is to separate some subset of training data from rest called the support vectors (boundary of separating hyper-plane). The decision function of SVM model that predicts the class of the test data is based on support vectors and makes use of a kernel trick.

Once the classifiers are trained, we can check the performance of the models on test-set. We extract word count vector for each mail in test-set and predict its class(ham or spam) with the trained NB classifier and SVM model. Below is the full code for spam filtering application. You have to include the two functions we have defined before in step 2 and step 3.

import os
import numpy as np
from collections import Counter
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.svm import SVC, NuSVC, LinearSVC
from sklearn.metrics import confusion_matrix
# Create a dictionary of words with its frequency

train_dir = 'train-mails'
dictionary = make_Dictionary(train_dir)

# Prepare feature vectors per training mail and its labels

train_labels = np.zeros(702)
train_labels[351:701] = 1
train_matrix = extract_features(train_dir)

# Training SVM and Naive bayes classifier

model1 = MultinomialNB()
model2 = LinearSVC()
model1.fit(train_matrix,train_labels)
model2.fit(train_matrix,train_labels)

# Test the unseen mails for Spam
test_dir = 'test-mails'
test_matrix = extract_features(test_dir)
test_labels = np.zeros(260)
test_labels[130:260] = 1
result1 = model1.predict(test_matrix)
result2 = model2.predict(test_matrix)
print confusion_matrix(test_labels,result1)
print confusion_matrix(test_labels,result2)

Checking Performance: Email spam filter

Let us check the performance of built email spam filter. Test-set contains 130 spam emails and 130 non-spam emails. If you have come so far, you will find below results. I have shown the confusion matrix of the test-set for both the models. The diagonal elements represents the correctly identified(a.k.a. true identification) mails where as non-diagonal elements represents wrong classification (false identification) of mails.

Multinomial NB Ham Spam
Ham 129 1
Spam 9 121
SVM(Linear) Ham Spam
Ham 126 4
Spam 6 124

Both the models had similar performance on the test-set except that the SVM has slightly balanced false identifications. I must remind you that the test data was neither used in creating dictionary nor in the training set.

Task for you

Download the pre-processed form of Euron-spam corpus. The corpus contains 33716 emails in 6 directories. Each of 6 directories contains ‘ham’ and ‘spam’ folders. Total number of non-spam emails and spam emails are 16545 and 17171 respectively.

Follow the same steps described in this blog post and check how is it performing with Support Vector Machines and Multinomial Naive Bayes models. As the directory structure of this corpus is different than the directory structure of ling-spam subset used in the blog post, you may have to either reorganize it or do modifications in def make_Dictionary(dir) and def extract_features(dir) functions.

I divided the Euron-spam corpus into training set and test set in 60:40 split. After performing the same steps of this blog, i got the following results on 13487 test set emails. We can see that SVM has performed slightly better than Naive Bayes classifier in detecting spam emails correctly.

Multinomial NB Ham Spam
Ham 6445 225
Spam 137 6680
SVM(Linear) Ham Spam
Ham 6490 180
Spam 109 6708

Final Thoughts

Hope it was easy to go through tutorial as I have tried to keep it short and simple. Beginners who are interested in text analytics can start with demonstrated email spam filter application in here.

You might be thinking about the mathematical techniques behind the used models like Naive Bayes and SVM. SVM is mathematically complex model where as Naive bayes is relatively easy to understand. You are encouraged to study about these models from online sources. Apart from that, there can be a lot of experiments that can be done in order to find the effect of various parameters like

a) Amount of training data
b) Dictionary size
c) Variants of the ML techniques used (GaussianNB, BernoulliNB, SVC)
d) Fine tuning of parameters of SVM models
e) Improving the dictionary by eliminating insignificant words (may be manually)
f) Some other feature (look for td-idf)

Moreover, I penned down the mathematical explanation about SVM in the blog-post here and another blog-post about naive bayes here.

In addition, you can get the full python implementation for both the corpus from GitHub link here.

If you liked the post, follow this blog to get updates about upcoming articles. Also, share it so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.

Happy machine learning 🙂

Advertisements

151 thoughts on “Email Spam Filter : A python implementation with scikit-learn

  1. I’ve been browsing online greater than three hours these days,
    but I by no means found any attention-grabbing article like
    yours. It is beautiful value sufficient for me.
    In my view, if all web owners and bloggers made good content material as you probably did, the internet can be a lot
    more helpful than ever before.

  2. Hey Abhijeet Nice blog post I must say… But don’t you think False positive rate for both MNV and SVM are increasing exponentially as the data set is increasing? How can we reduce that/avoid this?

    • Thanks for the appreciation !!.
      In the implementations here,
      False positive rate for ling-spam dataset – 3.84% (10 FPs out of 260)
      False positive rate for Euron dataset – 2.68% for MultinomialNB (362 FPs out of 13487)

      False positive rate has decreased as the amount of training data was large in Euron dataset.

  3. Hey Abhijeet, it is really a very useful blog post for me. Thanks for providing the codes too.
    Will you write any blog on Support Vector Machine (with mathematical explanation)?
    Happy Machine Learning 🙂

  4. Thanks a lot for such a detailed explanation. I am completely new to Python and this blog helped me a lot for gaining some understanding. Most of the things were clear except below 2 line:

    train_labels[351:701] = 1
    test_labels[130:260] = 1

    Can you or anyone please let me know what it does and means? Thanks. 🙂

    • Hi Sandy,
      train_labels and test_labels are class labels of training and testing e-mails respectively.
      There are 702 emails in train directory, in which 351 has label 0 (ham) and remaining 351 has label 1 (spam).
      Same is followed with 260 emails of test directory.

      In supervised learning (Naive Bayes and SVM here), you have to give training data as well as their labels (category) for training a model.
      Hope it helps. !!

      • can you tell me where to put those datasets i am using pycharm and don,t know about it so your help is appreciated

  5. Thanks a lot for a great post. I have downloaded the code and execute on my laptop. However, the results on the testing set is very poor. The confusion matrix are:
    [[79 51]
    [72 58]]
    [[89 41]
    [86 44]]

    I have tried to change the dictionary size but have not achieved any as the good result as in your post.

    I wonder if you know there is any problem? I used ubuntu 16.04 and python 2.7, scikit learn 0.18.

    Thank you in advanced.

    • Hey hi !! Just after i saw your comment i re-ran the github code ‘lingspam_filter.py’ and its giving the same result as in blog-post.

      I would suggest you to debug the steps:

      1. Print the dictionary and check if it is getting created.

      2. line 21 and line 43 (if i == 2) in github view may create issue if the train/test mail text files has only 1 line. You can remove it if dictionary is not created.

      3. Make sure your train and test directory contains 700 and 260 mails respectively. Both directories have equal number of spam and non-spam mails.

      Ubuntu or python version won’t be the issue. Apprise if you find the solution.
      Cheers

      • I printed the dictionary and the result is: [(‘order’, 1414), (‘address’, 1293), (‘report’, 1216), (‘mail’, 1127), (‘send’, 1079), (‘language’, 1072), (’email’, 1051), (‘program’, 1001), (‘our’, 987),… It seems that this is correct.

        I have checked the number of training and testing emails and they are 702 and 260 respectively. These value are as described in the post. I am still working on this and trying to find the solution.

  6. I finally could fix this. The problem is listdir does not list the file in the right order. This function list the file in a random order. Subsequently, the assignment for labels lead to te wrong assignment. I fixed this by added sorted function to this and the whole command now become: files = [os.path.join(mail_dir,fi) for fi in sorted(os.listdir(mail_dir))].

    This worked fine for me now.
    Thanks,

  7. For those who are using Python 3, do the following changes:

    list_to_remove = dictionary.keys()
    With
    list_to_remove = list(dictionary)

    And whenever you find a print “test”, change it to print (“test”)

    • This is path of files in Euron-data-set.

      Enron-data-set\enron1\ham\.txt file
      Enron-data-set\enron1\spam\.txt file

      Enron-data-set\enron2\ham\.txt file
      Enron-data-set\enron2\spam\.txt file

      Enron-data-set\enron3\ham\.txt file
      Enron-data-set\enron3\spam\.txt file

      Enron-data-set\enron4\ham\.txt file
      Enron-data-set\enron4\spam\.txt file

      Enron-data-set\enron5\ham\.txt file
      Enron-data-set\enron5\spam\.txt file

      Enron-data-set\enron6\ham\.txt file
      Enron-data-set\enron6\spam\.txt file

  8. I’m having this error:
    File “test.py”, line 84
    print confusion_matrix(test_labels,result1)
    ^
    SyntaxError: invalid syntax

  9. It’s ok now, i just added parenthesis both sides:
    print (confusion_matrix(test_labels,result1))

  10. yup i am using python 3.6.1. can you display the output similar to weka with correctly/incorrectly instance classified in percentage and with detailed accuracy in terms of precision, recall, and roc value. thanks in advance.

  11. I’m stuck here:

    Traceback (most recent call last):
    File “test_enron.py”, line 72, in
    dictionary = make_Dictionary(root_dir)
    File “test_enron.py”, line 21, in make_Dictionary
    dirs = [os.path.join(emails_dir,f) for f in os.listdir(emails_dir)]
    NotADirectoryError: [WinError 267] The directory name is invalid: ‘Enron-data-set\\0001.1999-12-10.farmer.ham.txt’

    I need your help so bad…

    • Hey hi…
      Just check this error “NotADirectoryError: Enron-data-set\\0001.1999-12-10.farmer.ham.txt’ is not a valid directory”

      You need to fix your directory structure buddy. The implementation assumes the following file paths. There will be 6 directories inside Euron-data-set and each of these directories will have two sub-directories (ham and spam). The files will be in ham and spam directories.

      Enron-data-set\enron1\ham\*.txt file
      Enron-data-set\enron1\spam\*.txt file

      Enron-data-set\enron2\ham\*.txt file
      Enron-data-set\enron2\spam\*.txt file

      Enron-data-set\enron3\ham\*.txt file
      Enron-data-set\enron3\spam\*.txt file

      Enron-data-set\enron4\ham\*.txt file
      Enron-data-set\enron4\spam\*.txt file

      Enron-data-set\enron5\ham\*.txt file
      Enron-data-set\enron5\spam\*.txt file

      Enron-data-set\enron6\ham\*.txt file
      Enron-data-set\enron6\spam\*.txt file

  12. Traceback (most recent call last):
    File “test_enron.py”, line 72, in
    dictionary = make_Dictionary(root_dir)
    File “test_enron.py”, line 26, in make_Dictionary
    for line in m:
    File “C:\ProgramData\Anaconda3\lib\encodings\cp1252.py”, line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
    UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x9d in position 1651: character maps to

  13. Hi, thank you for a helpful post.

    I just wonder the following code (line 47, 48 in your github)
    for i,d in enumerate(dictionary):
    if d[0] == word:

    I think line 48 should be ‘if d == word’. Because for example if dictionary = [‘ham’: 2, ‘spam’: 1, ‘total’: 3], when we loop, the results with d are [ham, spam, total] and the results with d[0] are [h, s, t].

    Please let me know if I am wrong or not. Otherwise could you fix this small thing.
    P.s: I could not download your sub-dataset with dropbox link (702 + 260). Could you check it

    • Glad you liked it.

      for i,d in enumerate(dictionary)
      Here, ‘d’ will be ‘ham’:2 in first iteration as per your example.
      Again,
      d[0] will be ‘ham’ and d[1] will be 2
      Similarly, in next iteration, d[0] will be ‘spam’ and d[1] will be 1

      You can run and check it.

      I don’t know sometimes that download link gives error. I will suggest you save that to your dropbox and then download. If at all you can not download it, provide your email in comment. I will send it to you.

      Thanks.

  14. Thank you for your quick reply.

    But I tested with the following code:

    from collections import Counter
    emails = [‘ham’, ‘spam’, ‘ham’]
    dic = Counter(emails)
    for i, d in enumerate(dic):
    print d
    print d[0]

    —– Results with d are ham, spam. Not ‘ham’: 2, ‘spam’: 1
    —– Results with d[0] are h, s. Not ham, spam.

    Could you please check it, or let me where I misunderstood.

    By saving to my dropbox, now I can download the sub-datasets. Thank you again.

    • You are misunderstanding.

      Your dictionary
      emails = [‘ham’, ‘spam’, ‘ham’]

      My dictionary
      emails = [‘ham’:2, ‘spam’:3, ‘ham’:1]

      Dictionary is a collections of tuples. So, when we loop, the results with d are tuple (‘ham’:2) and the results with d[0] are words (‘ham’).

      You have not made tuples in your dictionary.

      Thanks.

      • I dont think i am misunderstanding, could you carefully check again my above comment.

        emails = [‘ham’, ‘spam’, ‘ham’] is just my input.

        dic = Counter(emails) –> dic = [‘ham’: 2, ‘spam’: 1] (same with your dictionary)

        But when I enumerate(dic), the results is at I mentioned
        —– Results with d are ham, spam. Not ‘ham’: 2, ‘spam’: 1
        —– Results with d[0] are h, s. Not ham, spam.

        Please help me to clear this. Thanks!

    • Hey,

      I ran the code.
      See, most_common() to produce a sequence of the n most frequently encountered input values and their respective counts.

      The dictionary is a list because of what most_common function returns (github code line 33).
      It is different from what Counter(emails) returns.
      I would suggest that you to apply most_common function and then check.

      Actually it returns list, that why the code runs fine.

      Thanks for grilling.. :-p

      • Thank you very much for making it clear.
        Now I understood, just because of Counter() returns different with dic.most_common(…)

        Just another small thing i just found is in line 47, ‘dictionary’ variable is not available in function “extract_features(train_dir)”. So I think this function should be “extract_features(dictionary, train_dir)”. And line 63 should be “train_matrix = extract_features(dictionary, train_dir)”

    • Yeah i know that, Cleaner way would be to pass dictionary but it works even though you do not pass it.

      May be because when dictionary is returned initially, it becomes a global variable and is accessible to all. If you notice, there is no “main” kind of function in my python file.

  15. thanks for your tutorials. It is so clear and helpful. But I couldnt get the file from dropbox. Do you mind upload it to github or share it by other means? Thanks in advance.

  16. I am getting this error while running the exact code as in github

    for item in list_to_remove:

    RuntimeError: dictionary changed size during iteration

  17. I have a requirement to implement gmail like classification of mails. Can you please tell how can we achieve that using this.
    BTW I successfully ran this . It was a very good article.

    • I assume that by gmail like classification you mean primary, social and promotion classes. That becomes an n-class classification problem while spam vs ham is a 2-class problem.

      The fundamental approach remains the same – you need to have an email training corpus for each of the categories that you want to have. Then you can train a NB model like the approach followed in the blog. Once the classes are trained, you can test unseen emails for the performance check.

      There are other document classification approaches also which can fulfil your requirement like topic modelling followed by similarity check etc.

    • You might be doing some silly mistake. Hope you have figured it by now. If not check for following steps:
      1. check train_labels whether the assigned labels are correct or not.
      2. check the os.listdir() method whether it is listing the files in sequential order. It may be the case that linux lists the files in some other order.
      3. print test labels and see the assigned values.

      Thanks

    • Hi,
      If i understand correct, do you mean “Naive Bayes calculation” ?
      I have written a separate blog-post for understanding Naive Bayes as well as SVM.

  18. I was just wondering how we can input Enron dataset for training and testing as we have 6 different directories i.e. Enron 1,…,Enron 6 and in each Enron dataset folder we have ham and spam folders but in case of ling-spam we have test-email and train-mail folders. so for training the classifier we simply enter the path of train-mail folder. Thank you

    • Yeah directory structure of both datasets are different. There are 2 ways.
      1. Manually create train and test folder by copying files from all 6 directories.
      2. In program read all the mails from 6 directory and keep track of the labels. Later you can split it in train and test.

      Please check the github link provided at the end of the blog. You will find the implementation on euron dataset programmatically.

      • Thank you for your reply. Could you please send/email me a full working Enron Corpus python script including the dummy dataset paths for training and testing data in it. I’m still confused and I already tried to run that script which is uploaded on the Github but that doesn’t work for me. I will really appreciate that.

  19. I am a beginner.. Can you please tell me which is the most efficient method in spam filtering? Either Feature Selection or Feature Extraction ? In this blog.. why did you go for feature extraction while feature selection are showing good results(as I inferred from googling) ?

    • If you truly ask, most efficient method would be from deep learning implemented by google.

      Feature selection and feature extraction are completely two different things.
      Feature extraction is a method to generate features from your data.
      Feature selection is a method to select only useful features (remove deteriorating features) from the generated features.

      I have only extracted the word count features here. One may apply feature selection on generated features here and remove the words/features which may be deteriorating the accuracy.

  20. very usefull stuff, i have done similar project but used spambase dataset for testing and training .,
    i need full documentation of your project, can you provide, please..!

  21. I am just trying to figure out at which point you differentiate between the spam and ham emails, I see you say the spam ones are named *spmmsg* but at what point do you have to specify somewhere which ones are spam and which are ham so your train set knows which are good and which are bad?

  22. Hey, I doubt that there is a bug In feature extraction function. So we are trying to find the frequency of a N common words in each email . so we are iterating line by line for each email and then for each line we are finding the word count, But what if the same word is found twice in a line is found once again 1times on other line of the same email ?

    Example:

    suppose this is the message:
    ‘ hello this hitesh! hello ma’am
    I would like to tell you hello’

    lets assume hello is the present in our vocabulary and in the first line we find hello is occurring twice and we assign number of occurrence as 2 but again we encounter hello in the next line and instead of assigning the number 3 as occurrence we assign 1 according to the algorithm. is that right ?

    .

    • Hi Hitesh,

      If you print the each line, you will find that in the data-set, each file has only 2 line.
      The second line contains all the content of the file as one string. So the issue of same word coming in multiple line will never come.

      If your data-set has files with multiple lines, you can change the logic accordingly.

  23. Hi
    Is it possible to actually assign a label to the outputs so it specifies which part is spam and ham? as mine only displays the numbers.

  24. Hello Sir, your work is great. Sir, please tell how can i give path for the training and testing dataset .I have created two folders manually ‘train’ and ‘test’ containing 700 mails and 260 mails respectively.

  25. The content is excellent and very useful. I am very thankful for your posting. It is very useful to understand the implementation point of view. Thank you very much.

  26. Hi,

    I am getting the following error when I run the code:
    UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xf8 in position 321: ordinal not in range(128)

    What do you think went wrong?
    Thanks!

  27. Getting error..

    Traceback (most recent call last):
    File “C:\Program Files\JetBrains\PyCharm 2018.1\helpers\pydev\pydev_run_in_console.py”, line 53, in run_file
    pydev_imports.execfile(file, globals, locals) # execute the script
    File “C:\Program Files\JetBrains\PyCharm 2018.1\helpers\pydev\_pydev_imps\_pydev_execfile.py”, line 18, in execfile
    exec(compile(contents+”\n”, file, ‘exec’), glob, loc)
    File “C:/Users/user/PycharmProjects/SmartCommSurveillance/SpamEmail-enron.py”, line 72, in
    dictionary = make_Dictionary(root_dir)
    File “C:/Users/user/PycharmProjects/SmartCommSurveillance/SpamEmail-enron.py”, line 25, in make_Dictionary
    for line in m:
    File “C:\Users\user\Anaconda3\envs\untitled\lib\encodings\cp1252.py”, line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
    UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x9d in position 1651: character maps to
    PyDev console: starting.
    Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)] on win32

  28. hello!
    My lingspam code is runing fine but when run th enron spam code it takes too much time to show results even i ran the code even more then 2 and 3 hours but no output shown so i just kill the program. there is no error in program. I think it takes time in those lines np.save……. like after 2 hours it just save 1 file. So i waana ask if there is any other approach other than first saving train mails and lables then loading???
    Sir please Replay. As i took this as my assignment for my Python Course.

    Thanks

    • I wonder why would such a thing happen.
      I would suggest you to run the codes from my github link for euron-spam dataset.

      You can find the GitHub link at the end of blog. The code was written in Python 2. It should run fine.

  29. I am new in python .I am doing master degree in communication & knowledge engineering.& i am doing my project on the topics of Spam mail detection using Support Vector Machine.so i require python code for this project.If you can help me,it will be so fruitful for me.I have read your blog,it made me so easy to do my project . I have also opened your all github link but i haven’t found exact code that suits my project.

    • Hi Praveen,

      Your project topic seems to be exactly what this blog is all about.
      Give yourself little time. Do read about the features and models. Try ti figure out how they work.

      If you will try to reproduce the results without understanding, then it won’t help you in anyways.
      Hope it helps.

  30. Below these are error in spyder module in anaconda…..plz say me suggestion to solve it

    • runfile(‘D:/project/Mail-Spam-Filtering-master/lingspam_filter.py’, wdir=’D:/project/Mail-Spam-Filtering-master’)
      Traceback (most recent call last):

      File “C:\Users\HP\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py”, line 2910, in run_code
      exec(code_obj, self.user_global_ns, self.user_ns)

      File “”, line 1, in
      runfile(‘D:/project/Mail-Spam-Filtering-master/lingspam_filter.py’, wdir=’D:/project/Mail-Spam-Filtering-master’)

      File “C:\Users\HP\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py”, line 705, in runfile
      execfile(filename, namespace)

      File “C:\Users\HP\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py”, line 102, in execfile
      exec(compile(f.read(), filename, ‘exec’), namespace)

      File “D:/project/Mail-Spam-Filtering-master/lingspam_filter.py”, line 83
      print confusion_matrix(test_labels,result1)
      ^
      SyntaxError: invalid syntax

      • Probably you are using Python3 and this code was developed in python 2.7.
        There are few differences in python 2 and python 3.
        For eg: In python 3, print() is the correct syntax where as in Python 2, print can be directly used without paranthesis.

        Try to read the error carefully and then resolve it.
        Follow the blog end to end in order to understand it.
        Thanks

  31. Hey,can you provide me code for graph for SVM in comparison to different evaluation techniques such as for precision,Recall,Confusion matrix,F-score etc. & that can plot in graph.
    Finally I have to give u many many thanks from my heart.because i have get output from ling_spam.py & now i am working on enron-spamfilter.py to get output. And also i have one query that it is not necessary to get output from enron-spamfilter.py..

  32. runfile(‘D:/project/Mail-Spam-Filtering-master/enron-spamfilter.py’, wdir=’D:/project/Mail-Spam-Filtering-master’)
    Traceback (most recent call last):

    File “”, line 1, in
    runfile(‘D:/project/Mail-Spam-Filtering-master/enron-spamfilter.py’, wdir=’D:/project/Mail-Spam-Filtering-master’)

    File “C:\Users\HP\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py”, line 705, in runfile
    execfile(filename, namespace)

    File “C:\Users\HP\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py”, line 102, in execfile
    exec(compile(f.read(), filename, ‘exec’), namespace)

    File “D:/project/Mail-Spam-Filtering-master/enron-spamfilter.py”, line 71, in
    dictionary = make_Dictionary(dir)

    File “D:/project/Mail-Spam-Filtering-master/enron-spamfilter.py”, line 26, in make_Dictionary
    for line in m:

    File “C:\Users\HP\Anaconda3\lib\encodings\cp1252.py”, line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]

    UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x9d in position 688: character maps to

    Above error is obtained when I run ‘enron-spamfilter.py’ code.what should i have to do to overcome this problem.but ‘lingspam_filter.py’ works correctly & it gives output.plz..reply me soon..

  33. how to solve below provlem..

    File “D:/project/Mail-Spam-Filtering-master/enron-spamfilter.py”, line 25
    with open (mail,encoding=”utf8″) as m
    ^
    SyntaxError: invalid syntax

  34. how i predict either it is spam or not in random email in ling-spam corpus. why training dataset and testing dataset is divided into 73%:27% ratio?what is the reason behind it?
    plz..plz..reply me soon i will be grateful

    • Hi Praveen,

      For any given email. It is easy to predict spam or not.
      1. You need to read the text of email.
      2. Extract the word count vector from the text (as done here in program).
      3. Further , predict the class(spam or non-spam) from the already trained model here.

      The train test split is as per corpora. It’s not necessary to do follow single rule. People go with 70:30, 80:20 or even 90:10 depending on size of dataset.

      I would suggest you to understand the codes instead of just running it for your project.

      Thanks.
      Abhijeet

  35. IF you can provide me code in lingspam.py .In that sense if i enter any sentence then it must identify that as spam or non-spam.
    For example if i enter sentence like”hey you have won free tickets for football match so contact me 9875789067″ then it display as spam or ham………so plz provide me code for this criteria in linspam.py

  36. What are the keywords that make a email Spam? What is the logic behind this algorithm.Can you please explain in brief.I didn’t get the logic.

  37. Hi Abhijeeth,

    I liked your content very much and this is my first project in python and Im finding difficulty in debugging this error .Please help in fixing this .

    —————————————————————————
    IndexError Traceback (most recent call last)
    in ()
    10 if d[0] == word:
    11 wordID = i
    —> 12 features_matrix[docID,wordID]=words.count(word)
    13 docID = docID + 1

    IndexError: index 700 is out of bounds for axis 0 with size 700

  38. Thanx for ur post.Can u pls tell how do we predict for a random email,means which is not in this dataset,if it is spam or not?

    • Hi Prakhar,

      For any given email. It is easy to predict spam or not.
      1. You need to read the text of email.
      2. Extract the word count vector from the text (as done here in program using the dictionary of training data).
      3. Further , predict the class(spam or non-spam) from the already trained model here.

    • Hi Hassan,

      train_labels – There are 702 emails. ‘train_labels’ labels them 0 if it is ham and 1 if it is spam emails. The first half is labels 0 and other half is labelled 1. It is necessary to generate labels to apply a supervised classification model.

      features_matrix – It is a matrix where rows are number of email files and columns are words in dictionary. So, each row represents the count of words (in columns) occurring in that email file. Dimension of features_matrix will be number of emails * words dictionary size

      Thanks.

  39. hiii… I am not able to run this code at all. I don’t know what’s the problem. but can you tell me all the necessary changed I should do for python 3?

      • Many Many thank u to ur response & help from yours ..i have completed this project of master degree (3rd sem). But now i want to upgrade this topics as my master degree on communication& knowledge engineering thesis on ” Comparative & performance Analysis of Spam mail identification using LSTM & SVM” which main objective is:
        -To Analyze the e-mail and classify it into spam ,non-spam,Social,promotion using SVM and LSTM.
        -To compare the result using different dataset & choose best method.
        So i want ur help if you can provide me different dataset above 10,000 & code on python platform.

        • Hello, excuse me for bothering you, maybe you can help me with your code to be able to guide me and be able to complete my project, I thank you for being attentive

      • Many Many thank u for ur response & help .Without ur help it couldn’t be possible. i have completed this project of master degree (3rd sem). But now i want to upgrade this topics as my master degree on communication& knowledge engineering thesis on ” Comparative & performance Analysis of Spam mail identification using LSTM & SVM” which main objective is:
        -To Analyze the e-mail and classify it into spam ,non-spam,Social,promotion using SVM and LSTM.
        -To compare the result using different dataset & choose best method.
        So i want ur help if you can provide me different dataset above 10,000 & code on python platform.
        I am waiting for your response very soon…plz plz help me

    • Can you explain more !!
      Which OS are you using ?
      Are you executing from command promt ? Make sure you have proper rights or make sure you have opened cmd with administrator rights.

  40. I am getting problem like below in enron-spamfilter file…..how to solve it.plz help me

    File “C:\Users\HP\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py”, line 705, in runfile
    execfile(filename, namespace)

    File “C:\Users\HP\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py”, line 102, in execfile
    exec(compile(f.read(), filename, ‘exec’), namespace)

    File “D:/project/Mail-Spam-Filtering-master/enron-spamfilter.py”, line 72, in
    dictionary = make_Dictionary(dir)

    File “D:/project/Mail-Spam-Filtering-master/enron-spamfilter.py”, line 32, in make_Dictionary
    for item in list_to_remove:

    RuntimeError: dictionary changed size during iteration

    • it’s because he is using phyton 2.x and we are using phyton 3.x
      the problem is because of the program used “keys” in list_to_remove = dictionary.keys()… And how to fix that… Right now I also don’t know how to fix this issued 🙁

      • Hi readers,
        This is happening because it’s an old post when I used to code in python2. Just convert it in python3.
        For the above problem just try this.

        list_to_remove = list(dictionary.keys())

        May be I will migrate the whole code to python3 soon.
        Thanks.

        • Yap… it’s work well now… thank you… for phyton 3 just use () in print and your program is still running well

        • And… open (mail) –> open(mail,encoding=”Latin-1″) (for the enron corpus)
          But I still running the program because the enron dataset is quite big… I will update again soon if there are any changes needed. Thanx…

  41. can u provide me detail about enron.py as lingsapm.py…..plz..reply soon if possible provide me also

  42. Hi, I just realize that in step 1 (Preparing the text data) for Enron dataset I think the removal of stop word and lemmatization hasn’t proceeded (or it has proceeded?) Because I check the file inside the dataset and all the stop word still in there, and the first word that inside the dictionary is “the”. So it’s quite different than Ling-spam corpus that used in this article…

  43. Traceback (most recent call last):

    File “”, line 8, in
    train_matrix = extract_features(train_dir)

    File “”, line 15, in extract_features
    features_matrix[docID,wordID] = words.count(word)

    IndexError: index 15049 is out of bounds for axis 1 with size 15000

    why am i getting the above error while training the classifiers?
    please help me with that.
    thank you!!!!!!!!!!!!!

Leave a Reply