A huge number of text articles are generated everyday from different publishing houses, blogs, media, etc. This leads to one of the major tasks in natural language processing i.e. effectively managing, searching and categorizing articles depending upon their subjects or themes. Typically, these text mining tasks will include text clustering, document similarity and categorization of text. Comprehensively, we have to find out some ways so that the theme of the article can be extracted. In text analytics, this is known as “Topic Modelling”. Also, given a topic, our software should be able to find out articles which are similar to it. This is known as “Document Similarity”.

Deriving such meaningful information from text documents is the main objective of this blog-post series. I will be covering the whole application of topic modelling in 3 blog-posts. The purpose of the blog-post series is to build the system from scratch and provide an insight of implementation of the same to our readers. This particular post will be focusing on creating a corpus of Simple Wikipedia articles from dumped simple wiki XML file. Once the text data (articles) has been retrieved, it can be used by machine learning techniques for model training in order to discover topics from the text corpus.

There are mainly two steps in the text data retrieval process from simple Wikipedia dump:

1. XML parsing of the wiki dump
2. Cleaning of the articles’ text

The Simple Wikipedia is an edition of the online encyclopedia Wikipedia, primarily written in Basic English. The articles on Simple Wikipedia are usually shorter than their English Wikipedia counterparts, presenting only the basic information. It contains over 127,000 content pages for people to search, explore or even edit. We downloaded the free backup XML file in which all the articles are dumped. Then a sample of 60,000 simple Wikipedia articles is randomly selected for building the application. You can download the same backup XML file(used in this blog) from here or it can be downloaded from index of simple wiki website.

1. XML Parsing of Wiki Dump

All the information of an article like title, id, time stamp, contributor, text content, etc lies in the page tag of XML file. There are more than 100,000 such legitimate pages. A typical article in wiki dumped XML file looks like this.

The Document Object Model (tree view) represents this XML snippet like this:


Seeing all this, one can observe that we have to get article text from the text tag in the XML file, which is one of the children of the revision tag (revision itself being a child of the page tag). We will use the Element Tree XML API for parsing the XML file and extracting the text portion of the article. The below Python code traverses down the tree to get the content of the text tag. The contents of each article are extracted from the text tag of that corresponding page in iterations and can be written in separate text files.

import xml.etree.ElementTree as ET
import codecs
import re

tree = ET.parse('simplewiki-20170201-pages-articles-multistream.xml')  
root = tree.getroot()  
path = 'articles-corpus//'  
url  = '{http://www.mediawiki.org/xml/export-0.10/}page'

for i,page in enumerate(root.findall(url)):          
    for p in page:
        r_tag = "{http://www.mediawiki.org/xml/export-0.10/}revision"                  
        if p.tag == r_tag:  
            for x in p:
                tag = "{http://www.mediawiki.org/xml/export-0.10/}text"                               
                if x.tag == tag:                                                              
                    text = x.text                                          
                    if not text == None:  
                        # Extracting the text portion from the article                                                 
                        text = text[:text.find("==")]     

                        # Cleaning of Text (described in Section 2)                                                     
                        # Printing the article 
                        print text
                        print '\n====================================\n'

Also, we are only interested in getting the introductory text about the title (like in above sample, the title is “Treason”), not its subheading or other contents like Responsibilities to Protect and References. In order to do this, we extract the sub string from starting index to the index location before the start of the first subheading. It is implemented by the Python statement given below:
text = text[: text.find("==")].

The created text article for the above sample page looks like this:


2. Cleaning of Article Text

Data pre-processing (a.k.a data cleaning) is one of the most significant step in text analytics. The purpose is to remove any unwanted words or characters which are written for human readability, but won’t contribute to topic modelling in any way.

There are mainly two steps that need to be done on word level:

a) Removal of stop words – Stop words like “and”, “if”, “the”, etc are very common in all English sentences and are not very meaningful in deciding the theme of the article, so these words have been removed from the articles.

b) Lemmatization – It is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. For example, “include”, “includes,” and “included” would all be represented as “include”. The context of the sentence is also preserved in lemmatization as opposed to stemming (another buzz word in text mining which does not consider the meaning of the sentence).

The following Python code defines a function clean() for cleaning the text article passed as an argument to it:

from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string

stop    = set(stopwords.words('english'))
exclude = set(string.punctuation) 
lemma   = WordNetLemmatizer()

# pass the article text as string "doc"
def clean(doc):

  # remove stop words & punctuation, and lemmatize words
  s_free  = " ".join([i for i in doc.lower().split() if i not in stop])
  p_free  = ''.join(ch for ch in s_free if ch not in exclude)
  lemm    = " ".join(lemma.lemmatize(word) for word in p_free.split())
  words   = lemm.split()

  # only take words which are greater than 2 characters
  cleaned = [word for word in words if len(word) > 2]
  return cleaned

We will plug the above cleaning code in the next blog-post where the training code of the Latent Dirichlet Allocation (LDA) model will be shown in order to discover hidden topics from the corpus. As of now, we are focusing only on creating the wiki corpus of articles.

Specially for Wikipedia articles, one needs to apply several steps to clean the article text which includes removal of File attachment, Image attachments, URLs, Infobox, XML labels, etc. The following Python code applies regular expression for matching such patterns and removing them. These 30 filters are applied depending on my analysis of the wiki text. There may be several other patterns which might have been missed here.

# remove text written between double curly braces
article_txt = re.sub(r"{{.*}}","",article_txt)

# remove file attachments
article_txt = re.sub(r"\[\[File:.*\]\]","",article_txt)

# remove Image attachments
article_txt = re.sub(r"\[\[Image:.*\]\]","",article_txt)

# remove unwanted lines starting from special characters
article_txt = re.sub(r"\n: \'\'.*","",article_txt)
article_txt = re.sub(r"\n!.*","",article_txt)
article_txt = re.sub(r"^:\'\'.*","",article_txt)

# remove non-breaking space symbols
article_txt = re.sub(r"&nbsp","",article_txt)

# remove URLs link
article_txt = re.sub(r"http\S+","",article_txt)

# remove digits from text
article_txt = re.sub(r"\d+","",article_txt)

# remove text written between small braces   
article_txt = re.sub(r"\(.*\)","",article_txt)

# remove sentence which tells category of article
article_txt = re.sub(r"Category:.*","",article_txt)

# remove the sentences inside infobox or taxobox
article_txt = re.sub(r"\| .*","",article_txt)
article_txt = re.sub(r"\n\|.*","",article_txt)
article_txt = re.sub(r"\n \|.*","",article_txt)
article_txt = re.sub(r".* \|\n","",article_txt)
article_txt = re.sub(r".*\|\n","",article_txt)

# remove infobox or taxobox
article_txt = re.sub(r"{{Infobox.*","",article_txt)
article_txt = re.sub(r"{{infobox.*","",article_txt)
article_txt = re.sub(r"{{taxobox.*","",article_txt)
article_txt = re.sub(r"{{Taxobox.*","",article_txt)
article_txt = re.sub(r"{{ Infobox.*","",article_txt)
article_txt = re.sub(r"{{ infobox.*","",article_txt)
article_txt = re.sub(r"{{ taxobox.*","",article_txt)
article_txt = re.sub(r"{{ Taxobox.*","",article_txt)

# remove lines starting from *
article_txt = re.sub(r"\* .*","",article_txt)

# remove text written between angle bracket
article_txt = re.sub(r"<.*>","",article_txt)

# remove new line character
article_txt = re.sub(r"\n","",article_txt)  

# replace all punctuations with space 
article_txt = re.sub(r"\!|\"|\#|\$|\%|\&|\'|\(|\)|\*|\+|\,|\-|\.|\/|\:|\;|\<|\=|\>|\?|\@|\[|\\|\]|\^|\_|\`|\{|\||\}|\~"," ",article_txt)

# replace consecutive multiple space with single space
article_txt = re.sub(r" +"," ",article_txt)

# replace non-breaking space with regular space 
article_txt = article_txt.replace(u'\xa0', u' ')

# Writing the clean text in file
if len(article_txt) > 150 and is_ascii(article_txt) and not article_txt == None and not article_txt == "":
    outfile = path + str(i+1) +"_article.txt"
    f       = codecs.open(outfile, "w", "utf-8")

The above code snippet of text filters can be plugged to the text extracted from the text tag (Figure 1). Finally, we keep only those articles which have length more than 150 characters. Also, we check and write only those text articles which contain only ASCII characters (English characters only).

This completes the first step towards Topic modeling, i.e. creating the corpus of articles from simple Wikipedia. Once you follow this blog till here, you will be able to create a corpus of around 70,000 articles in the directory “articles-corpus” used in python program. I will be writing about discovering the hidden topics from the corpus created in the next blog-post soon. So stay tuned till then !!

You can get the full Python code for parsing, cleaning and creating an article corpus (from simple wiki XML dump file) from GitHub link here.

If you liked the post, follow this blog to get updates about upcoming articles. Also, share it so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.

Happy machine learning 🙂