Applications, Clustering, Natural Language Processing, Unsupervised Learning

Topic Modelling (Part 3): Document Clustering, Exploration & Theme Extraction from SimpleWiki Articles

Date: October 13, 2017Author: Abhijeet Kumar 18 Comments

This blog-post is third in the series of blog-posts covering applications of “Topic Modelling” from simple Wikipedia articles. Before reading this post, I would suggest reading our earlier two articles here and here. In the first blog-post, we were able to create a corpus of around 70,000 articles in the directory “articles-corpus”. In the second blog-post, we had discovered 50 topics from the article-corpus using Latent Dirichlet Allocation (LDA) algorithm.

In this final blog-post of the “Topic Modelling” series, we will see following usages of the knowledge acquired (the topics discovered) after LDA training.

Document Clustering : Clustering the set of similar wiki-articles in 50 clusters.
Document Exploration : Given some word, search related articles.
Theme Extraction : Find out the theme of the article.

As the first step, we will write a function to clean the test article similar to what we had used before training the corpus. The pre-processing is always required before fetching the articles for any of the above mentioned applications.

def rem_ascii(s):
    return "".join([ch for ch in s if ord(ch) < 128 ])
    return y

In the pre-processing step, we are basically removing the non-ASCII characters, punctuation marks and stop words. Other than that, we are also lemmatizing the words of the input articles.

1. Document Clustering

Document clustering is an unsupervised approach to cluster the articles depending upon the topics which have been discovered in the training phase. Document clustering takes a corpus of unlabeled articles as an input and categorizes them in various groups according to the best matched word distributions (topics) generated from training. Following are the steps performed for document clustering.

Clean all the articles in the input cluster.
Convert each of the text articles into bag-of-words features using the same dictionary of trained model.
Extract the best matched topic from each article using the trained LDA model. In gensim implementation, we have get_document_topic()function which does the same.
Write the article in the directory belonging to extracted topic if minimum probability criteria is satisfied, otherwise push it in the “unknown” directory.
If the extracted topic (word distribution) is ambiguous , then we choose the 2nd best matched topic (as some of the discovered 50 topics are content neutral).

def cluster_similar_documents(corpus, dirname):
    clean_docs = [clean_doc(doc) for doc in corpus]
    test_term = [ldamodel.id2word.doc2bow(doc) for doc in clean_docs]
    doc_topics = ldamodel.get_document_topics(test_term, minimum_probability=0.20)
    for k,topics in enumerate(doc_topics):
        if topics:
            topics.sort(key = itemgetter(1), reverse=True)
            dir_name = dirname + "/" + str(topics[0][0])
            file_name = dir_name + "/" + str(k) + ".txt"
            if not os.path.exists(dir_name):
                os.makedirs(dir_name)
            fp = open(file_name,"w")
            fp.write(docs_test[k] + "\n\n" + str(topics[0][1]) )
            fp.close()
        else:
            if not os.path.exists(dirname + "/unknown"):
                os.makedirs(dirname + "/unknown")
            file_name = dirname + "/unknown/" + str(k) + ".txt"
            fp = open(file_name,"w")
            fp.write(docs_test[k])

The above Python function follows the above steps to perform document clustering given an article corpus. It also takes a parameter dirname under which it creates 50 sub-directories containing clustered articles.

2. Document Exploration

Document exploration can be another application that can be build over the trained LDA model. Here, given a word or theme, we extract the documents related to it. Mainly, it is a two step process:

Get the best matched topic cluster (highest probability) for the given word.
Get “top” most probable related articles from the matched topic cluster in step 1.

get_term_topics() is the function which is used for getting the best matched topic cluster given a theme/word.

def get_related_documents(term, top, corpus):
    clean_docs = [clean_doc(doc) for doc in corpus]
    related_docid = []
    test_term = [ldamodel.id2word.doc2bow(doc) for doc in clean_docs]
    doc_topics = ldamodel.get_document_topics(test_term, minimum_probability=0.20)
    term_topics =  ldamodel.get_term_topics(term, minimum_probability=0.000001)
    for k,topics in enumerate(doc_topics):
        if topics:
            topics.sort(key = itemgetter(1), reverse=True)
            if topics[0][0] == term_topics[0][0]:
                related_docid.append((k,topics[0][1]))

    related_docid.sort(key = itemgetter(1), reverse=True)
    for j,doc_id in enumerate(related_docid):
        print doc_id[1],"\n\n",docs_test[doc_id[0]]
        if j == (top-1):
            break

The above Python function implements a documents exploring system where given a word/theme/topic as an input, it prints the “top” most related articles from the simple wiki test corpus. The test corpus is also given as input to the function.

3. Theme Extraction

We know that 50 word distributions were discovered in Part-2 of this blog-post series. I have manually given theme names to each of the word distributions. You may give different theme names depending upon how you look at the word distributions. If each of the word distributions discovered accurately belong to a particular theme, then topic extraction from articles can be another useful application that can be implemented. You can view this file to see the mapping of manually given topics to the word distributions discovered.

def get_theme(doc):
    topics = "Electrical_systems_or_Education unknown music unknown Software International_event Literature War_or_Church Lingual_or_Research Biology Waterbody Wikipedia_or_Icehockey unknown unknown html_tags sports TV_shows Terms_and_Services music US_states Timeline Chemistry Germany Location_area Film_awards Games US_school unknown Railways Biography Directions_Australlia France India_Pakistan Canada_politcs_or_WWE Politics unknown British_Royal_Family American_Movies unknown Colors_or_Birds Fauna Chinese_Military unknown unknown unknown unknown unknown html_tags US_Govt Music_band".split()
    theme = ""
    cleandoc = clean_doc(doc)
    doc_bow = ldamodel.id2word.doc2bow(cleandoc)
    doc_topics = ldamodel.get_document_topics(doc_bow, minimum_probability=0.20)
    if doc_topics:
        doc_topics.sort(key = itemgetter(1), reverse=True)
        theme = topics[doc_topics[0][0]]
        if theme == "unknown":
            theme = topics[doc_topics[1][0]]
    else:
        theme = "unknown"
    return theme

The above Python function extracts the theme from article given as an argument. Having written 3 different functions, we will see now how we can call them in main program. The following Python snippet can be executed to perform these applications.

import cPickle
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from operator import itemgetter
import os

# initialize WordNetLemmatizer and get the list of english stop words
stop = set(stopwords.words('english'))
lemma = WordNetLemmatizer()

# Load trained LDA model (described in Part-2 of blog-post series)
lda_fp = open("lda_model_sym_wiki.pkl", 'rb')
ldamodel = cPickle.load(lda_fp)

# Load the articles corpus to choose 10,000 files for test purpose
docs_fp = open("docs_wiki.pkl", 'rb')
docs_all = cPickle.load(docs_fp)
docs_test = docs_all[60000:]

# Get 'top' related documents given a word(term)
get_related_documents("music",5,docs_test`)
# performs document clustering given a set of documents
cluster_similar_documents(docs_test,"root")

# Extract the theme of article
article = "Mohandas Karamchand Gandhi[14] was born on 2 October 1869[1] to a Hindu Modh Baniya family[15] in Porbandar (also known as Sudamapuri), a coastal town on the Kathiawar Peninsula and then part of the small princely state of Porbandar in the Kathiawar Agency of the Indian Empire. His father, Karamchand Uttamchand Gandhi (1822–1885), served as the diwan (chief minister) of Porbandar state.[16] Although he only had an elementary education and had previously been a clerk in the state administration, Karamchand proved a capable chief minister.[17] During his tenure, Karamchand married four times. His first two wives died young, after each had given birth to a daughter, and his third marriage was childless. In 1857, Karamchand sought his third wife's permission to remarry; that year, he married Putlibai (1844–1891), who also came from Junagadh,[18] and was from a Pranami Vaishnava family.[19][20][21][22] Karamchand and Putlibai had three children over the ensuing decade, a son, Laxmidas (c. 1860 – March 1914), a daughter, Raliatbehn (1862–1960) and another son, Karsandas (c. 1866–1913)"
print "For the given article :", "\n"
print "Theme -> ",get_theme(article)

Once the program is executed, one will find the following results:

------------------- 5  top articles related to  music -----------------------
 Best Friend is a song by Brandy Norwood The song was released in June The song peaked at on the Hot R B Hip Hop Songs Chart It peaked at on the Billboard Hot Billboard Hot The song was used on Norwood s show Moesha music stub  
0.884800194836 

 Losing Grip is a Single single by Avril Lavigne It is the fourth single from her first studio album Let Go On the Billboard Billboard charts Losing Grip was able to peak at on the Mainstream Top on the Adult Top and at on the Billboard Hot Hot charts  
0.781855140359 

 One of the Boys is the first album by pop music pop singer Katy Perry Four singles were released from it The album s first single I Kissed a Girl was no  
0.760228311362 

 The Edge of Glory is a song by American pop singer Lady Gaga from her third album Born This Way It was released as the third single from the album on May  
0.755567022942 

 Indelibly Stamped is the second studio album of Supertramp A M Records released it in June Indelibly Stamped was Supertramp s first album that was released in the United States Supertramp recorded this album in April and May in London  
0.721697470636

For the given article:

Theme -> Biography

It can be seen that the articles which are the most probable to be related to “music” get printed. For the given article, the theme has been extracted (as “Biography” in this example) using the trained LDA model. Also, grouping of all the 10,000 test simple wiki-articles has been done in 50 clusters (sub-directories) under the directory “root”.

Final Thoughts

Hope you were able to easily follow the applications developed on the trained LDA model. In this post, I have tried to demonstrate the applications which can be implemented once we have discovered the latent topics from the text database. We briefly saw how one can implement document clustering on unlabeled data, search the most probable related documents given a term and extraction of topics/themes from texts. I would encourage readers to implement these applications after completing Part-1 and Part-2 of this blog-post series.

The full Python implementation of document clustering, document exploration & theme extraction on simple-wiki articles dataset can be found on Github link here. This completes the final step of blog-post series, i.e. Applications related to Topic modelling. Also, ML enthusiasts can perform similar applications on some other areas like on corpus of research papers or tweets etc.

If you liked the post, follow this blog to get updates about the upcoming articles. Also, share this article so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.

Happy machine learning 🙂

18 thoughts on “Topic Modelling (Part 3): Document Clustering, Exploration & Theme Extraction from SimpleWiki Articles”

Add Comment

Yasir says:

June 19, 2018 at 4:34 pm

Hi,

Nice series of articles. Had fun working them over.

What is this doc_test variable which suddenly appears while writing topics to file? it seems to be undeclared.

fp.write(docs_test[k] + “\n\n” + str(topics[0][1]) )

——
Yasir

Like

Reply
1. Abhijeet Kumar says:
  
  June 19, 2018 at 4:55 pm
  
  Glad you liked.!!
  
  Please find the following code snippet:
  # Load the articles corpus to choose 10,000 files for test purpose
  docs_fp = open(“docs_wiki.pkl”, ‘rb’)
  docs_all = cPickle.load(docs_fp)
  docs_test = docs_all[60000:]
  
  docs_test are the test articles taken out from whole corpus. These articles were not included in training.
  Thanks
  
  Like
  
  Reply
  1. Yasir says:
    
    June 21, 2018 at 1:21 pm
    
    great, thanks
    
    Like
    
    Reply
Astha Kaushik says:

October 19, 2018 at 10:00 am

Very well written, helped a lot as I’m new to this but the part 3 seems to be a little typical to understand though it was executed successfully. Can’t understand how to interpret the results.

Liked by 1 person

Reply
1. Abhijeet Kumar says:
  
  October 19, 2018 at 12:02 pm
  
  Glad that it helped !! Thanks.
  Let me know specifics of your query. I would suggest spending sometime in playing around. You will be able to understand the interpretations as well as application.
  
  Like
  
  Reply
  1. Astha Kaushik says:
    
    October 20, 2018 at 8:33 am
    
    Thanks Abhijeet for your reply. Actually the problem is whatever text I’m putting in article in this section:print( “Theme -> “,get_theme(article)) , its only returning theme as Waterbody , unknown or anything irrelevent.
    Please help.
    
    Like
    
    Reply
Astha Kaushik says:

October 24, 2018 at 4:31 am

My query is:
The third part/link/blog of your repository wherein you have created functions :
def get_theme(doc):
def get_related_documents(term, top, corpus):
def cluster_similar_documents(corpus, dirname):

Among these, these both functions are working perfectly well.:
1. get_related_documents(“effect”,5,docs_test)
2. cluster_similar_documents(docs_test,”root”)
But the problem comes in the first function –
def get_theme(doc):
topics = “Electrical_systems_or_Education unknown music unknown Software \
International_event Literature War_or_Church Lingual_or_Research Biology \
Waterbody Wikipedia_or_Icehockey unknown unknown html_tags sports TV_shows \
Terms_and_Services music US_states Timeline Chemistry Germany Location_area \
Film_awards Games US_school unknown Railways Biography Directions_Australlia \
France India_Pakistan Canada_politcs_or_WWE Politics unknown British_Royal_Family \
American_Movies unknown Colors_or_Birds Fauna Chinese_Military unknown unknown \
unknown unknown unknown html_tags US_Govt Music_band”.split()
theme = “”
cleandoc = clean_doc(doc)
doc_bow = ldamodel.id2word.doc2bow(cleandoc)
doc_topics = ldamodel.get_document_topics(doc_bow) #minimum_probability=0.20)
if doc_topics:
doc_topics.sort(key = itemgetter(1), reverse=True)
theme = topics[doc_topics[0][0]]
#if theme == “unknown”:
#theme = topics[doc_topics[1][0]]
else:
theme = “unknown”
return theme

After calling this function.
print (article, “\n”)
print( “Theme -> “,get_theme(article))
This is returning theme as Waterbody or something irrelevent.

Please help I’ll be very grateful to you .

Thanks
Astha Kaushik (Data Science Fresher)

Like

Reply
1. Abhijeet Kumar says:
  
  October 25, 2018 at 3:58 am
  
  Are you trying same theme/topic (water body) every time for any article you are passing ?
  You may want to check word distribution of water body theme and check if the input article contains same words.
  
  As in example, I have given a string about life of Mahatma Gandhi, So it gives “Biography”. Are you not able to reproduce it. ?
  
  If you are not getting result on some other texts, then you can probably check the top words (distribution) for the extracted theme and compare it with your input text.
  The point is model has learnt word distribution for each of 50 themes. For any free text it will give the theme which it matches best from word distributions.
  
  Also, If you truely wants to test this specific model, try giving texts or paragraph on some topic which it has learnt. I agree that it won’t be generic enough. Probably, for much more better result you need to train it on bigger set of data.
  
  Like
  
  Reply
  1. Astha Kaushik says:
    
    October 25, 2018 at 4:48 am
    
    Thanks a lot for replying, For this (As in example, I have given a string about life of Mahatma Gandhi, So it gives “Biography”. Are you not able to reproduce it. ?): No its returning theme as ‘unknown’.
    But will surely figure out something. Thinking of generating 100 topics & assigning more relevant topic names to it.
    
    Thanks Astha Kaushik
    
    Like
    
    Reply
    1. Abhijeet Kumar says:
      
      October 25, 2018 at 4:52 am
      
      Seems like you are not able to reproduce the results. You may like to check if you are getting the same topics and word distributions.
      
      This is the distribution which I had got.
      https://www.dropbox.com/s/8l7i6r4oc68jau9/topics_with_manual_labels.txt?dl=0
      
      Like
      
      Reply
  2. Rohit says:
    
    October 29, 2018 at 5:04 pm
    
    Hi Abhijeet,
    
    Thanks for this post. It was very helpful.
    Could like to know how to scale up the training set to get better results.
    
    Like
    
    Reply
    1. Abhijeet Kumar says:
      
      November 1, 2018 at 7:50 am
      
      Hi Rohit,
      
      It depends. This blog-post has been written for demonstration purpose using 70-80k small articles. The more data you feed, more accurate and relevant topic distribution will come.
      If you can do the same on actual wiki articles (here, it was simple wiki) which are more and large also then you can get topics with all relevant words ideally.
      
      Let me know the results, if you try it on larger scale. Will be helpful for people out here too.
      
      Like
      
      Reply
majid khan says:

November 1, 2018 at 7:34 am

Great work done brother, and i would like to appreciate you used such a easy to understand peoples…….

Liked by 1 person

Reply
1. Abhijeet Kumar says:
  
  November 1, 2018 at 7:50 am
  
  Thanks man 🙂
  
  Like
  
  Reply
Chathu Rajapaksha says:

January 16, 2019 at 11:00 am

Really good article. Helped me a lot.

Like

Reply
SudathJaya says:

February 6, 2019 at 11:55 am

word_id = self.id2word.doc2bow([word_id])[0][0]
IndexError: list index out of range

Can you suggest some possible reason for this error?

Like

Reply
Pingback: Topic Modelling (Part 2): Discovering Topics from Articles with Latent Dirichlet Allocation – Machine Learning in Action
Pingback: Data Science & Machine Learning Job Interview experience: 100 questions

Machine Learning in Action

A perfect hands-on practice for beginners to elevate their ML skills

Topic Modelling (Part 3): Document Clustering, Exploration & Theme Extraction from SimpleWiki Articles

1. Document Clustering

2. Document Exploration

3. Theme Extraction

Final Thoughts

18 thoughts on “Topic Modelling (Part 3): Document Clustering, Exploration & Theme Extraction from SimpleWiki Articles”

Leave a Reply Cancel reply

1. Document Clustering

2. Document Exploration

3. Theme Extraction

Final Thoughts

Sharing is Caring

Like this:

Related

18 thoughts on “Topic Modelling (Part 3): Document Clustering, Exploration & Theme Extraction from SimpleWiki Articles”

Leave a Reply Cancel reply