Topic Modelling (Part 1): Creating Article Corpus from Simple Wikipedia dump

A huge number of text articles are generated everyday from different publishing houses, blogs, media, etc. This leads to one of the major tasks in natural language processing i.e. effectively managing, searching and categorizing articles depending upon their subjects or themes. Typically, these text mining tasks will include text clustering, document similarity and categorization of text. Comprehensively, we have to find out some ways so that the theme of the article can be extracted. In text analytics, this is known as “Topic Modelling”. Also, given a topic, our software should be able to find out articles which are similar to it. This is known as “Document Similarity”.

Deriving such meaningful information from text documents is the main objective of this blog-post series. I will be covering the whole application of topic modelling in 3 blog-posts. The purpose of the blog-post series is to build the system from scratch and provide an insight of implementation of the same to our readers. This particular post will be focusing on creating a corpus of Simple Wikipedia articles from dumped simple wiki XML file. Once the text data (articles) has been retrieved, it can be used by machine learning techniques for model training in order to discover topics from the text corpus.

There are mainly two steps in the text data retrieval process from simple Wikipedia dump:

1. XML parsing of the wiki dump
2. Cleaning of the articles’ text

The Simple Wikipedia is an edition of the online encyclopedia Wikipedia, primarily written in Basic English. The articles on Simple Wikipedia are usually shorter than their English Wikipedia counterparts, presenting only the basic information. It contains over 127,000 content pages for people to search, explore or even edit. We downloaded the free backup XML file in which all the articles are dumped. Then a sample of 60,000 simple Wikipedia articles is randomly selected for building the application. You can download the same backup XML file(used in this blog) from here or it can be downloaded from index of simple wiki website.

1. XML Parsing of Wiki Dump

All the information of an article like title, id, time stamp, contributor, text content, etc lies in the page tag of XML file. There are more than 100,000 such legitimate pages. A typical article in wiki dumped XML file looks like this.

The Document Object Model (tree view) represents this XML snippet like this:

Seeing all this, one can observe that we have to get article text from the text tag in the XML file, which is one of the children of the revision tag (revision itself being a child of the page tag). We will use the Element Tree XML API for parsing the XML file and extracting the text portion of the article. The below Python code traverses down the tree to get the content of the text tag. The contents of each article are extracted from the text tag of that corresponding page in iterations and can be written in separate text files.

import xml.etree.ElementTree as ET
import codecs
import re

tree = ET.parse('simplewiki-20170201-pages-articles-multistream.xml')
root = tree.getroot()
path = 'articles-corpus//'
url  = '{http://www.mediawiki.org/xml/export-0.10/}page'

for i,page in enumerate(root.findall(url)):
    for p in page:
        r_tag = "{http://www.mediawiki.org/xml/export-0.10/}revision"
        if p.tag == r_tag:
            for x in p:
                tag = "{http://www.mediawiki.org/xml/export-0.10/}text"
                if x.tag == tag:
                    text = x.text
                    if not text == None:
                        # Extracting the text portion from the article
                        text = text[:text.find("==")]     

                        # Cleaning of Text (described in Section 2)
                        # Printing the article
                        print text
                        print '\n====================================\n'

Also, we are only interested in getting the introductory text about the title (like in above sample, the title is “Treason”), not its subheading or other contents like Responsibilities to Protect and References. In order to do this, we extract the sub string from starting index to the index location before the start of the first subheading. It is implemented by the Python statement given below:
text = text[: text.find("==")].

The created text article for the above sample page looks like this:

2. Cleaning of Article Text

Data pre-processing (a.k.a data cleaning) is one of the most significant step in text analytics. The purpose is to remove any unwanted words or characters which are written for human readability, but won’t contribute to topic modelling in any way.

There are mainly two steps that need to be done on word level:

a) Removal of stop words – Stop words like “and”, “if”, “the”, etc are very common in all English sentences and are not very meaningful in deciding the theme of the article, so these words have been removed from the articles.

b) Lemmatization – It is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. For example, “include”, “includes,” and “included” would all be represented as “include”. The context of the sentence is also preserved in lemmatization as opposed to stemming (another buzz word in text mining which does not consider the meaning of the sentence).

The following Python code defines a function clean() for cleaning the text article passed as an argument to it:

from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string

stop    = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma   = WordNetLemmatizer()

# pass the article text as string "doc"
def clean(doc):

  # remove stop words & punctuation, and lemmatize words
  s_free  = " ".join([i for i in doc.lower().split() if i not in stop])
  p_free  = ''.join(ch for ch in s_free if ch not in exclude)
  lemm    = " ".join(lemma.lemmatize(word) for word in p_free.split())
  words   = lemm.split()

  # only take words which are greater than 2 characters
  cleaned = [word for word in words if len(word) > 2]
  return cleaned

We will plug the above cleaning code in the next blog-post where the training code of the Latent Dirichlet Allocation (LDA) model will be shown in order to discover hidden topics from the corpus. As of now, we are focusing only on creating the wiki corpus of articles.

Specially for Wikipedia articles, one needs to apply several steps to clean the article text which includes removal of File attachment, Image attachments, URLs, Infobox, XML labels, etc. The following Python code applies regular expression for matching such patterns and removing them. These 30 filters are applied depending on my analysis of the wiki text. There may be several other patterns which might have been missed here.

# remove text written between double curly braces
article_txt = re.sub(r"{{.*}}","",article_txt)

# remove file attachments
article_txt = re.sub(r"\[\[File:.*\]\]","",article_txt)

# remove Image attachments
article_txt = re.sub(r"\[\[Image:.*\]\]","",article_txt)

# remove unwanted lines starting from special characters
article_txt = re.sub(r"\n: \'\'.*","",article_txt)
article_txt = re.sub(r"\n!.*","",article_txt)
article_txt = re.sub(r"^:\'\'.*","",article_txt)

# remove non-breaking space symbols
article_txt = re.sub(r" ","",article_txt)

# remove URLs link
article_txt = re.sub(r"http\S+","",article_txt)

# remove digits from text
article_txt = re.sub(r"\d+","",article_txt)

# remove text written between small braces
article_txt = re.sub(r"\(.*\)","",article_txt)

# remove sentence which tells category of article
article_txt = re.sub(r"Category:.*","",article_txt)

# remove the sentences inside infobox or taxobox
article_txt = re.sub(r"\| .*","",article_txt)
article_txt = re.sub(r"\n\|.*","",article_txt)
article_txt = re.sub(r"\n \|.*","",article_txt)
article_txt = re.sub(r".* \|\n","",article_txt)
article_txt = re.sub(r".*\|\n","",article_txt)

# remove infobox or taxobox
article_txt = re.sub(r"{{Infobox.*","",article_txt)
article_txt = re.sub(r"{{infobox.*","",article_txt)
article_txt = re.sub(r"{{taxobox.*","",article_txt)
article_txt = re.sub(r"{{Taxobox.*","",article_txt)
article_txt = re.sub(r"{{ Infobox.*","",article_txt)
article_txt = re.sub(r"{{ infobox.*","",article_txt)
article_txt = re.sub(r"{{ taxobox.*","",article_txt)
article_txt = re.sub(r"{{ Taxobox.*","",article_txt)

# remove lines starting from *
article_txt = re.sub(r"\* .*","",article_txt)

# remove text written between angle bracket
article_txt = re.sub(r"","",article_txt)

# remove new line character
article_txt = re.sub(r"\n","",article_txt)  

# replace all punctuations with space
article_txt = re.sub(r"\!|\"|\#|\$|\%|\&|\'|\(|\)|\*|\+|\,|\-|\.|\/|\:|\;|\|\?|\@|\[|\\|\]|\^|\_|\`|\{|\||\}|\~"," ",article_txt)

# replace consecutive multiple space with single space
article_txt = re.sub(r" +"," ",article_txt)

# replace non-breaking space with regular space
article_txt = article_txt.replace(u'\xa0', u' ')

# Writing the clean text in file
if len(article_txt) > 150 and is_ascii(article_txt) and not article_txt == None and not article_txt == "":
    outfile = path + str(i+1) +"_article.txt"
    f       = codecs.open(outfile, "w", "utf-8")
    f.write(article_txt)
    f.close()

The above code snippet of text filters can be plugged to the text extracted from the text tag (Figure 1). Finally, we keep only those articles which have length more than 150 characters. Also, we check and write only those text articles which contain only ASCII characters (English characters only).

This completes the first step towards Topic modeling, i.e. creating the corpus of articles from simple Wikipedia. Once you follow this blog till here, you will be able to create a corpus of around 70,000 articles in the directory “articles-corpus” used in python program. I will be writing about discovering the hidden topics from the corpus created in the next blog-post soon. So stay tuned till then !!

You can get the full Python code for parsing, cleaning and creating an article corpus (from simple wiki XML dump file) from GitHub link here.

If you liked the post, follow this blog to get updates about upcoming articles. Also, share it so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.

Happy machine learning 🙂

13 thoughts on “Topic Modelling (Part 1): Creating Article Corpus from Simple Wikipedia dump”

Amman Habib says:

March 12, 2018 at 3:48 pm

Thankyou for the article great help .

Liked by 1 person

1. Abhijeet Kumar says:
  
  March 16, 2018 at 6:53 pm
  
  Glad you liked.
  
  
  1. Thayyaba Khatoon says:
    
    June 19, 2019 at 6:15 am
    
    THANK YOU FOR THE DETAIL INFO…CAN YOU HELP ME IN OBTAINING BIOMEDICAL ARTICLES FROM PUBMED
    
    
Astha says:

December 5, 2018 at 6:31 am

Hi Abhijeet, Sorry to disturb you again.
I tried to extract the whole wikipedia dump, which is around 63GB(after extracting) but my system could not take the load. After changing my system configurations (Ram:16GB, 120GBSSD hard disk, Processor i5), still can’t run my program . It’s giving MemoryError.
Please help me.


1. Abhijeet Kumar says:
  
  December 5, 2018 at 10:30 am
  
  Hi Astha,
  Are you using simple wikipedia dump or wikipedia dump ?
  Seems like you are taking wikipedia dump.
  
  Is this a single xml file of 63 GB ?
  Can you elaborate your situation ?
  
  
  1. Astha says:
    
    December 5, 2018 at 12:46 pm
    
    Hope I’m able to explain myself well
    enwiki-latest-pages-articles.xml.bz2(14.4 GB)(It’s a wikipedia dump)
    After unzipping it’s size is 64.7 GB.
    So when I exract it by your code it throws memory error.
    Please suggest something that works for the above wikipedia dump.
    Thanks.
    
    
    1. Abhijeet Kumar says:
      
      December 5, 2018 at 1:55 pm
      
      Oh it’s huge.
      So basically you may need to use some package which can read/parse it serially. That means it should load only a part of file in RAM at a time.
      
      There would be ways for that. Like this http://boscoh.com/programming/reading-xml-serially.html
      
      But if you are doing it for first time, you can go for simple Wikipedia dump oops which will be small.
      
      
      1. Astha Kaushik says:
        
        December 6, 2018 at 11:32 am
        
        Hi Abhijeet, Thanks for replying
        I tried on Simple Wikipedia Dump & it worked fine. So, I wanted to try on Large wikipedia dump. I followed that link & tried to extract, it’s been more than 24 hours & it’s still running.
        Please tell if it will take this much time
        
        Liked by 1 person
        
        
        Abhijeet Kumar says:
        
        December 6, 2018 at 2:03 pm
        
        Hi,
        Firstly,
        I do not know what’s the tree structure of Large wiki dump. The XML parser written in this post is for simple wiki. It may not work.
        Secondly,
        If at all it is working fine then you would be able to see it in the directory where the program would write text articles in separate files.
        
        If it is running actually from 24 hours it would have generated lakhs of files.
        
        
Pingback: Topic Modelling (Part 3): Document Clustering, Exploration & Theme Extraction from SimpleWiki Articles - Machine Learning in Action
Pingback: Topic Modelling (Part 2): Discovering Topics from Articles with Latent Dirichlet Allocation - Machine Learning in Action
Pingback: Web Crawling and Scraping using Selenium & Beautiful Soup : Automating Data Extraction with Python - Machine Learning in Action
Pingback: Data Science & Machine Learning Job Interview experience: 100 questions

Machine Learning in Action

A perfect hands-on practice for beginners to elevate their ML skills

Topic Modelling (Part 1): Creating Article Corpus from Simple Wikipedia dump

1. XML Parsing of Wiki Dump

2. Cleaning of Article Text

13 thoughts on “Topic Modelling (Part 1): Creating Article Corpus from Simple Wikipedia dump”

Leave a Reply Cancel reply

1. XML Parsing of Wiki Dump

2. Cleaning of Article Text

Sharing is Caring

Like this:

Related

13 thoughts on “Topic Modelling (Part 1): Creating Article Corpus from Simple Wikipedia dump”

Leave a Reply Cancel reply