This blog-post is third in the series of blog-posts covering applications of “Topic Modelling” from simple Wikipedia articles. Before reading this post, I would suggest reading our earlier two articles here and here. In the first blog-post, we were able to create a corpus of around 70,000 articles in the directory “articles-corpus”. In the second blog-post, we had discovered 50 topics from the article-corpus using Latent Dirichlet Allocation (LDA) algorithm.
In this final blog-post of the “Topic Modelling” series, we will see following usages of the knowledge acquired (the topics discovered) after LDA training.
- Document Clustering : Clustering the set of similar wiki-articles in 50 clusters.
- Document Exploration : Given some word, search related articles.
- Theme Extraction : Find out the theme of the article.
As the first step, we will write a function to clean the test article similar to what we had used before training the corpus. The pre-processing is always required before fetching the articles for any of the above mentioned applications.
def rem_ascii(s): return "".join([c for c in s if ord(c) < 128 ]) def clean_doc(doc): doc_ascii = rem_ascii(doc) stop_free = " ".join([i for i in doc_ascii.lower().split() if i not in stop]) normalized = " ".join(lemma.lemmatize(word,'v') for word in stop_free.split()) x = normalized.split() y = [s for s in x if len(s) > 2] return y
In the pre-processing step, we are basically removing the non-ASCII characters, punctuation marks and stop words. Other than that, we are also lemmatizing the words of the input articles.
1. Document Clustering
Document clustering is an unsupervised approach to cluster the articles depending upon the topics which have been discovered in the training phase. Document clustering takes a corpus of unlabeled articles as an input and categorizes them in various groups according to the best matched word distributions (topics) generated from training. Following are the steps performed for document clustering.
- Clean all the articles in the input cluster.
- Convert each of the text articles into bag-of-words features using the same dictionary of trained model.
- Extract the best matched topic from each article using the trained LDA model. In
gensimimplementation, we have
get_document_topic()function which does the same.
- Write the article in the directory belonging to extracted topic if minimum probability criteria is satisfied, otherwise push it in the “unknown” directory.
- If the extracted topic (word distribution) is ambiguous , then we choose the 2nd best matched topic (as some of the discovered 50 topics are content neutral).
def cluster_similar_documents(corpus, dirname): clean_docs = [clean_doc(doc) for doc in corpus] test_term = [ldamodel.id2word.doc2bow(doc) for doc in clean_docs] doc_topics = ldamodel.get_document_topics(test_term, minimum_probability=0.20) for k,topics in enumerate(doc_topics): if topics: topics.sort(key = itemgetter(1), reverse=True) dir_name = dirname + "/" + str(topics) file_name = dir_name + "/" + str(k) + ".txt" if not os.path.exists(dir_name): os.makedirs(dir_name) fp = open(file_name,"w") fp.write(docs_test[k] + "\n\n" + str(topics) ) fp.close() else: if not os.path.exists(dirname + "/unknown"): os.makedirs(dirname + "/unknown") file_name = dirname + "/unknown/" + str(k) + ".txt" fp = open(file_name,"w") fp.write(docs_test[k])
The above Python function follows the above steps to perform document clustering given an article corpus. It also takes a parameter
dirname under which it creates 50 sub-directories containing clustered articles.
2. Document Exploration
Document exploration can be another application that can be build over the trained LDA model. Here, given a word or theme, we extract the documents related to it. Mainly, it is a two step process:
- Get the best matched topic cluster (highest probability) for the given word.
- Get “top” most probable related articles from the matched topic cluster in step 1.
get_term_topics() is the function which is used for getting the best matched topic cluster given a theme/word.
def get_related_documents(term, top, corpus): clean_docs = [clean_doc(doc) for doc in corpus] related_docid =  test_term = [ldamodel.id2word.doc2bow(doc) for doc in clean_docs] doc_topics = ldamodel.get_document_topics(test_term, minimum_probability=0.20) term_topics = ldamodel.get_term_topics(term, minimum_probability=0.000001) for k,topics in enumerate(doc_topics): if topics: topics.sort(key = itemgetter(1), reverse=True) if topics == term_topics: related_docid.append((k,topics)) related_docid.sort(key = itemgetter(1), reverse=True) for j,doc_id in enumerate(related_docid): print doc_id,"\n\n",docs_test[doc_id] if j == (top-1): break
The above Python function implements a documents exploring system where given a word/theme/topic as an input, it prints the “top” most related articles from the simple wiki test corpus. The test corpus is also given as input to the function.
3. Theme Extraction
We know that 50 word distributions were discovered in Part-2 of this blog-post series. I have manually given theme names to each of the word distributions. You may give different theme names depending upon how you look at the word distributions. If each of the word distributions discovered accurately belong to a particular theme, then topic extraction from articles can be another useful application that can be implemented. You can view this file to see the mapping of manually given topics to the word distributions discovered.
def get_theme(doc): topics = "Electrical_systems_or_Education unknown music unknown Software International_event Literature War_or_Church Lingual_or_Research Biology Waterbody Wikipedia_or_Icehockey unknown unknown html_tags sports TV_shows Terms_and_Services music US_states Timeline Chemistry Germany Location_area Film_awards Games US_school unknown Railways Biography Directions_Australlia France India_Pakistan Canada_politcs_or_WWE Politics unknown British_Royal_Family American_Movies unknown Colors_or_Birds Fauna Chinese_Military unknown unknown unknown unknown unknown html_tags US_Govt Music_band".split() theme = "" cleandoc = clean_doc(doc) doc_bow = ldamodel.id2word.doc2bow(cleandoc) doc_topics = ldamodel.get_document_topics(doc_bow, minimum_probability=0.20) if doc_topics: doc_topics.sort(key = itemgetter(1), reverse=True) theme = topics[doc_topics] if theme == "unknown": theme = topics[doc_topics] else: theme = "unknown" return theme
The above Python function extracts the theme from article given as an argument. Having written 3 different functions, we will see now how we can call them in main program. The following Python snippet can be executed to perform these applications.
import cPickle from nltk.corpus import stopwords from nltk.stem.wordnet import WordNetLemmatizer from operator import itemgetter import os # initialize WordNetLemmatizer and get the list of english stop words stop = set(stopwords.words('english')) lemma = WordNetLemmatizer() # Load trained LDA model (described in Part-2 of blog-post series) lda_fp = open("lda_model_sym_wiki.pkl", 'rb') ldamodel = cPickle.load(lda_fp) # Load the articles corpus to choose 10,000 files for test purpose docs_fp = open("docs_wiki.pkl", 'rb') docs_all = cPickle.load(docs_fp) docs_test = docs_all[60000:] # Get 'top' related documents given a word(term) get_related_documents("music",5,docs_test`) # performs document clustering given a set of documents cluster_similar_documents(docs_test,"root") # Extract the theme of article article = "Mohandas Karamchand Gandhi was born on 2 October 1869 to a Hindu Modh Baniya family in Porbandar (also known as Sudamapuri), a coastal town on the Kathiawar Peninsula and then part of the small princely state of Porbandar in the Kathiawar Agency of the Indian Empire. His father, Karamchand Uttamchand Gandhi (1822–1885), served as the diwan (chief minister) of Porbandar state. Although he only had an elementary education and had previously been a clerk in the state administration, Karamchand proved a capable chief minister. During his tenure, Karamchand married four times. His first two wives died young, after each had given birth to a daughter, and his third marriage was childless. In 1857, Karamchand sought his third wife's permission to remarry; that year, he married Putlibai (1844–1891), who also came from Junagadh, and was from a Pranami Vaishnava family. Karamchand and Putlibai had three children over the ensuing decade, a son, Laxmidas (c. 1860 – March 1914), a daughter, Raliatbehn (1862–1960) and another son, Karsandas (c. 1866–1913)" print "For the given article :", "\n" print "Theme -> ",get_theme(article)
Once the program is executed, one will find the following results:
------------------- 5 top articles related to music ----------------------- Best Friend is a song by Brandy Norwood The song was released in June The song peaked at on the Hot R B Hip Hop Songs Chart It peaked at on the Billboard Hot Billboard Hot The song was used on Norwood s show Moesha music stub 0.884800194836 Losing Grip is a Single single by Avril Lavigne It is the fourth single from her first studio album Let Go On the Billboard Billboard charts Losing Grip was able to peak at on the Mainstream Top on the Adult Top and at on the Billboard Hot Hot charts 0.781855140359 One of the Boys is the first album by pop music pop singer Katy Perry Four singles were released from it The album s first single I Kissed a Girl was no 0.760228311362 The Edge of Glory is a song by American pop singer Lady Gaga from her third album Born This Way It was released as the third single from the album on May 0.755567022942 Indelibly Stamped is the second studio album of Supertramp A M Records released it in June Indelibly Stamped was Supertramp s first album that was released in the United States Supertramp recorded this album in April and May in London 0.721697470636 For the given article: Theme -> Biography
It can be seen that the articles which are the most probable to be related to “music” get printed. For the given article, the theme has been extracted (as “Biography” in this example) using the trained LDA model. Also, grouping of all the 10,000 test simple wiki-articles has been done in 50 clusters (sub-directories) under the directory “root”.
Hope you were able to easily follow the applications developed on the trained LDA model. In this post, I have tried to demonstrate the applications which can be implemented once we have discovered the latent topics from the text database. We briefly saw how one can implement document clustering on unlabeled data, search the most probable related documents given a term and extraction of topics/themes from texts. I would encourage readers to implement these applications after completing Part-1 and Part-2 of this blog-post series.
The full Python implementation of document clustering, document exploration & theme extraction on simple-wiki articles dataset can be found on Github link here. This completes the final step of blog-post series, i.e. Applications related to Topic modelling. Also, ML enthusiasts can perform similar applications on some other areas like on corpus of research papers or tweets etc.
If you liked the post, follow this blog to get updates about the upcoming articles. Also, share this article so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.
Happy machine learning 🙂