This blog-post is second in the series of blog-posts covering “Topic Modelling” from simple Wikipedia articles. Before reading this post, I would suggest reading our first article here. In the first step towards Topic modeling which entailed creating a corpus of articles from simple Wikipedia, we were able to create a corpus of around 70,000 articles in the directory “articles-corpus”.
Look at the above featured image of this blog-post – these are some of the topics (word distributions) which are the outcome of the experiment undertaken in this post. Lets get started with discovering topics from the corpus of wiki articles. We will be using an unsupervised machine learning technique, Latent Dirichlet Allocation (LDA), for automatically finding the mixture of similar words together, thus forming the topic or theme. From such a huge corpus of articles, we do not have the information about the categories to which these articles belong to or are related. This forms an unsupervised problem where we do not know the labels/classes/categories of the data and aim to find the groups or the clusters within the population. Having said that, I am now going to list down the steps which we have to perform in order to discover the topics hidden in the 60,000 articles, serving as training data:
- Pre-processing and training corpus creation
- Building dictionary
- Feature extraction
- LDA model training
Later in this blog-post, I will be discussing about the interpretation of the results (discovered topics), which is the outcome of the training process. For the installations required for this application, you can follow this link.
1. Preprocessing & Training data preparation.
As discussed in Part-I, we need to remove the stop words from the articles because they do not contribute to the theme of the article’s content. Similarly, stemming or lemmatization is an effective process in order to treat various inflected forms of words as a single word as they essentially mean the same. I would encourage you to go through the previous post (Part-1) if the above sentences do not make sense to you.
import os import random import codecs from nltk.corpus import stopwords from nltk.stem.wordnet import WordNetLemmatizer # Function to remove stop words from sentences & lemmatize words. def clean(doc): stop_free = " ".join([i for i in doc.lower().split() if i not in stop]) normalized = " ".join(lemma.lemmatize(word,'v') for word in stop_free.split()) x = normalized.split() y = [s for s in x if len(s) > 2] return y # Remember this folder contains 72,000 articles extracted in Part-1 (previous post) corpus_path = "articles-corpus/" article_paths = [os.path.join(corpus_path,p) for p in os.listdir(corpus_path)] # Read contents of all the articles in a list "doc_complete" doc_complete =  for path in article_paths: fp = codecs.open(path,'r','utf-8') doc_content = fp.read() doc_complete.append(doc_content) # Randomly sample 70000 articles from the corpus created from wiki_parser.py docs_all = random.sample(doc_complete, 70000) docs = open("docs_wiki.pkl",'wb') cPickle.dump(docs_all,docs) # Use 60000 articles for training. docs_train = docs_all[:60000] # Cleaning all the 60,000 simplewiki articles stop = set(stopwords.words('english')) exclude = set(string.punctuation) lemma = WordNetLemmatizer() doc_clean = [clean(doc) for doc in docs_train]
In the above code, we are reading all the articles in a list and creating the training data by choosing 60,000 articles from randomly sampled 70,000 articles from that list. The remaining 10,000 articles are left for test purpose (document clustering/categorization) in Part-3. Further, the articles are cleaned by removing stop words and passing each word of corpus through “WordNetLemmatizer”. As a result, we get cleaned articles on which we can build the dictionary and train the LDA model for topic modelling.
2. Building word dictionary
In this step, we need to build the vocabulary of the corpus in which all the unique words of the article corpus are given IDs and their frequency counts are also stored. The following Python code creates the dictionary from the 60,000 randomly sampled cleaned articles. You may note that we are using
gensim library for building the dictionary. In
gensim, the words are referred to as “tokens” and the index of each word in the dictionary is called “id”.
from gensim import corpora # Creating term dictionary of corpus, where each unique term is assigned an index. dictionary = corpora.Dictionary(doc_clean) # Filter terms which occurs in less than 4 articles & more than 40% of the articles dictionary.filter_extremes(no_below=4, no_above=0.4) # List of few words which are removed from dictionary as they are content neutral stoplist = set('also use make people know many call include part find become like mean often different \ usually take wikt come give well get since type list say change see refer actually iii \ aisne kinds pas ask would way something need things want every str'.split()) stop_ids = [dictionary.token2id[stopword] for stopword in stoplist if stopword in dictionary.token2id] dictionary.filter_tokens(stop_ids)
Also, it can be seen that there are 2 additional steps performed after creating the dictionary:
- All the tokens in the dictionary which either have occurred in less than 4 articles or have occurred in more than 40% of the articles are removed from the dictionary, as these words will not be contributing to the various themes or topics.
- After printing the most frequent words of the dictionary, we found that few words which are mostly content neutral words are also present in the dictionary. These words may lead to modeling of “word distribution”(topic) which is neutral and do not capture any theme or content. We made a list of such words and filtered all such words.
Once you have built the dictionary, you may find the most frequent words with their respective frequencies like this:
words frequency [(u'state', 10294), (u'one', 9451), (u'unite', 9213), (u'first', 8511), (u'american', 8383), (u'name', 6933), (u'play', 6043), (u'new', 5701), (u'bear', 5624), (u'two', 5614), (u'time', 5523), (u'world', 4949)] ids [22871, 579, 19641, 3768, 2573, 18650, 19284, 6702, 24598, 17353, 20208, 4284]
Each word is also given a unique id in the vocabulary (dictionary).
3. Feature Extraction (Bag of Words)
Histograms of words are the features used for text representation. In general, we first build the vocabulary of the article corpus and then we generate a word count vector for each article, which is nothing but the frequencies of all the words in the vocabulary for that particular article. Most of them will be zero as a single article won’t contain all the words in the vocabulary. For example, suppose we have 500 words in vocabulary. So, each word count vector will contain the frequencies of these 500 vocabulary words in a particular wiki article. Suppose that the text in an article was “Get the work done, work done”. So, a fixed length encoding will be generated as [0,0,0,0,0,…….0,0,2,0,0,0,……,0,0,1,0,0,…0,0,1,0,0,……2,0,0,0,0,0]. Here, all the word counts are placed at 296th, 359th, 415th, 495th index of the 500 length word count vector and the rest are zero.
# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above. doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]
The above Python code uses
gensim to convert all the 60,000 articles into a document term matrix (word count vector for each document).
4. LDA Model Training
We have finally arrived at the training phase of topic modeling. Latent Dirichlet Allocation is an unsupervised probabilistic model which is used to discover latent themes in a document. Let’s try to understand briefly the working of LDA technique.
LDA technique makes the following two assumptions:
1. Articles/Documents are produced from a mixture of topics. Each article belongs to each of the topics to a certain degree (Each articles is made up of some topic distribution).
2. Each topic is a generative model which generates words of the vocabulary with certain probabilities. Words frequently occurring together will have more probability (Each topic is made of some word distribution).
So, can you guess the input to this algorithm?
Input is the “document-term matrix” which keeps the histograms of words (word count) present in each wiki article. The dimensions of the matrix is
(M,N) ,i.e. number of documents * number of words in vocabulary. Documents and articles are interchangeable words here. We also provide
K as an input, which is the number of topics that have to be discovered.
What is the output of the Latent Dirichlet Allocation algorithm?
The output of LDA algorithm are 2 smaller matrices – a document to topic matrix and a topic to word matrix. Document-Topic matrix is of
(M,K) dimensions where
M is number of articles and
K is number of topics in the vocabulary. Topic-Word matrix is of
N is the number of words in the vocabulary.
Document-Topic matrix accounts for the probability distribution of the topics present in the article. Similarly, Topic-Words matrix accounts for the probability distribution of words that they have been generated from that topic. Both these matrices are initialized randomly and then these distributions are improved upon in an iterative process. After repeating the previous step a large number of times, you’ll eventually reach an approximately steady state where these distributions seem logically correct.
The following Python code runs LDA algorithm using
gensim implementation. Once the training is completed, the model is dumped using
cPickle library for future use and all the 50 topics (learned by model) are printed.
from gensim.models.ldamodel import LdaModel as Lda # Creating the object for LDA model using gensim library & Training LDA model on the document term matrix. ldamodel = Lda(doc_term_matrix, num_topics=50, id2word = dictionary, passes=50, iterations=500) # dump LDA model using cPickle for future use ldafile = open('lda_model_sym_wiki.pkl','wb') cPickle.dump(ldamodel,ldafile) ldafile.close() # Print all the 50 topics for i,topic in enumerate(ldamodel.print_topics(num_topics=50, num_words=10)): words = topic.split("+") print words,"\n"
This completes the pipeline for topic discovery from 60,000 simple wiki articles. You can very easily combine the Python code snippets from the beginning to the end serially to have full implementation for topic modelling.
For the first time, it was really exciting for me to see how the topics have been formed as a mixture of similar words from the same domain which may speak a lot about the theme of the article. These are some of the interesting learned topics resulting from the fitted model:
|Topic 3||Topic 6||Topic 9||Topic 10||Topic 15||Topic 16||Topic 35||Topic 40|
|Topic 41||Topic 30||Topic 50||Topic 1||Topic 5||Topic 37||Topic 35||Topic 38|
The above table shows 16 of 50 topics after the model is trained, where top ten terms are listed for each topic. With LDA training, the word distribution of the same topic tends to be similar. Formally speaking, they are highly associated. For example: Topic 1 is about music, Topic 6 is about war, Topic 9 is related to literature, Topic 10 is related to medical, Topic 16 is about games, Topic 35 is about politics, Topic 40 relates to various colors, Topic 30 relates to biography of some person, Topic is about IT, Topic 35 is related to chemistry and Topic 38 relates to movies.
There are other topics too which have been generated and can describe the theme of the articles. You can check all the 50 topics along with their word probability distribution from this file.
As we analyze the topics discovered by LDA model, we see that these topics are basically probabilistic word distribution which can very well describe a particular theme or content. After experimenting number of times with simple wiki articles, I came to a conclusion that words in the modeled topics may not be perfectly similar but are definitely associated.
A very few topics generated from unsupervised training are content neutral. For example:
(Topic 44, u'0.031*"women" + 0.020*"drug" + 0.019*"blood" + 0.017*"men" + 0.015*"sing" + 0.014*"sex" + 0.014*"god" + 0.014*"feel" + 0.013*"nuclear" + 0.013*"child"')
(Topic 45, u'0.040*"art" + 0.029*"paint" + 0.027*"heart" + 0.026*"attack" + 0.020*"oil" + 0.018*"business" + 0.018*"street" + 0.017*"horse" + 0.016*"police" + 0.015*"work"')
Also, there are some topics in which a few words may seem irrelevant to the theme/content but if you analyze them properly they are somewhat associated. This problem of association of two different context with one word is called “word-sense disambiguation“. Because of association of two context with same word, the topic modeled also contains two themes jointly in one topic. for example:
(Topic 29, u'0.079*"die" + 0.060*"age" + 0.059*"bear" + 0.029*"year" + 0.024*"death" + 0.018*"january" + 0.018*"marry" + 0.016*"years" + 0.016*"cancer" + 0.014*"february"')
(Topic 30, u'0.102*"south" + 0.050*"new" + 0.040*"west" + 0.036*"north" + 0.034*"park" + 0.030*"east" + 0.027*"wales" + 0.026*"division" + 0.020*"coast" + 0.016*"australia"')
(Topic 39, u'0.041*"black" + 0.040*"white" + 0.034*"red" + 0.029*"bird" + 0.027*"blue" + 0.024*"green" + 0.020*"ship" + 0.019*"fly" + 0.019*"brown" + 0.018*"wear"'
In topic 29, the word distribution is representation of biography theme. Usually, in biographies, the articles talk about the birth, marraige and death of the person. The word “cancer” has occurred within the distribution because it may have been the cause of death in most of the articles. Also, the months “January” and “February” have occurred as they might be present in the biography articles to show the timeline of the life of the person. Similarly, if you observe topic 30, the distribution has captured the directions but as the words “south” or “east” are associated with “New South Wales” state which is on the east coast of Australia, therefore related words like “new”,”wales”,”australia”,”coast” are also present. In topic 39, words like “birds” or “wear” are present in the distribution. The probable cause of this may be the usage of the colors with birds and wearing in many wiki articles.
Hope it was an easy task for our readers to follow the blog-post till here. In this post, I have tried to explain the pipeline of the topic discovery process, from preparing the training data to the training of the LDA model. I have also tried to briefly explain the Latent Dirichlet Allocation algorithm to provide an idea of what goes into and what comes out from the LDA model. I would encourage readers to implement this series of blog-posts (see Part-1), and match their outputs with the results shown here (though topics discovered can be different at every run).
There are several factors that you can experiment with in order to get even better word distributions forming the topics:
1. Getting more number of articles : You can try increasing the number of articles by changing the minimum article length from 150 to 100 characters in Part-1. Also see if we can prevent discarding the articles which contains few non-ASCII characters. More training data may lead to better topic-word distribution.
2. Preprocessing : By analyzing the word distributions of topics generated, you may find
- pairs that are always juxtaposed (entities) e.g. “Los Angeles” (topic 20), “New York” (topic 27). These pairs should be combined like Los_Angeles or New_York.
- words that are not properly lemmatized like (germany, german), (chinese, china), (america,americans) etc. Lemmatization of nouns may help. Remember, we did lemmatization of verbs
3. Dictionary : The vocabulary of the corpus can be improved by removing the content neutral words. Iteratively running the whole topic discovery process and analyzing the word distributions (topics) can help in finding content neutral words from dictionary. Some example are “ing” (topic 10), “per” (Topic 43).
4. Parameters of LDA : There are two parameters of LDA to look upon – alpha and beta. Understanding the mathematics behind LDA model may help in tuning these parameters. I would encourage readers to do so.
The full Python implementation of topic modeling on simple-wiki articles dataset can be found on Github link here. This completes the second step towards Topic modeling, i.e. Topic discovery from training articles. After this step, now you will be having a dump of 70,000 randomly sampled cleaned wiki articles and LDA model which consists of 50 discovered topics. We will need both of them while performing article clustering/ categorization in Part-3 of this blog-post series.
I will be writing about clustering the test wiki-articles using the modeled topics in the next blog-post soon. So stay tuned till then!!
If you liked the post, follow this blog to get updates about the upcoming articles. Also, share this article so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.
Happy machine learning 🙂