A huge number of text articles are generated everyday from different publishing houses, blogs, media, etc. This leads to one of the major tasks in natural language processing i.e. effectively managing, searching and categorizing articles depending upon their subjects or themes. Typically, these text mining tasks will include text clustering, document similarity and categorization of text. Comprehensively, we have to find out some ways so that the theme of the article can be extracted. In text analytics, this is known as “Topic Modelling”. Also, given a topic, our software should be able to find out articles which are similar to it. This is known as “Document Similarity”.
Deriving such meaningful information from text documents is the main objective of this blog-post series. I will be covering the whole application of topic modelling in 3 blog-posts. The purpose of the blog-post series is to build the system from scratch and provide an insight of implementation of the same to our readers. This particular post will be focusing on creating a corpus of Simple Wikipedia articles from dumped simple wiki XML file. Once the text data (articles) has been retrieved, it can be used by machine learning techniques for model training in order to discover topics from the text corpus.
There are mainly two steps in the text data retrieval process from simple Wikipedia dump:
1. XML parsing of the wiki dump
2. Cleaning of the articles’ text
The Simple Wikipedia is an edition of the online encyclopedia Wikipedia, primarily written in Basic English. The articles on Simple Wikipedia are usually shorter than their English Wikipedia counterparts, presenting only the basic information. It contains over 127,000 content pages for people to search, explore or even edit. We downloaded the free backup XML file in which all the articles are dumped. Then a sample of 60,000 simple Wikipedia articles is randomly selected for building the application. You can download the same backup XML file(used in this blog) from here or it can be downloaded from index of simple wiki website.
1. XML Parsing of Wiki Dump
All the information of an article like title, id, time stamp, contributor, text content, etc lies in the page
tag of XML file. There are more than 100,000 such legitimate pages. A typical article in wiki dumped XML file looks like this.
The Document Object Model (tree view) represents this XML snippet like this:
Seeing all this, one can observe that we have to get article text from the text
tag in the XML file, which is one of the children of the revision
tag (revision
itself being a child of the page
tag). We will use the Element Tree XML API for parsing the XML file and extracting the text portion of the article. The below Python code traverses down the tree to get the content of the text
tag. The contents of each article are extracted from the text
tag of that corresponding page in iterations and can be written in separate text files.
import xml.etree.ElementTree as ET import codecs import re tree = ET.parse('simplewiki-20170201-pages-articles-multistream.xml') root = tree.getroot() path = 'articles-corpus//' url = '{http://www.mediawiki.org/xml/export-0.10/}page' for i,page in enumerate(root.findall(url)): for p in page: r_tag = "{http://www.mediawiki.org/xml/export-0.10/}revision" if p.tag == r_tag: for x in p: tag = "{http://www.mediawiki.org/xml/export-0.10/}text" if x.tag == tag: text = x.text if not text == None: # Extracting the text portion from the article text = text[:text.find("==")] # Cleaning of Text (described in Section 2) # Printing the article print text print '\n====================================\n'
Also, we are only interested in getting the introductory text about the title
(like in above sample, the title is “Treason”), not its subheading or other contents like Responsibilities to Protect
and References
. In order to do this, we extract the sub string from starting index to the index location before the start of the first subheading. It is implemented by the Python statement given below:
text = text[: text.find("==")]
.
The created text article for the above sample page looks like this:
2. Cleaning of Article Text
Data pre-processing (a.k.a data cleaning) is one of the most significant step in text analytics. The purpose is to remove any unwanted words or characters which are written for human readability, but won’t contribute to topic modelling in any way.
There are mainly two steps that need to be done on word level:
a) Removal of stop words – Stop words like “and”, “if”, “the”, etc are very common in all English sentences and are not very meaningful in deciding the theme of the article, so these words have been removed from the articles.
b) Lemmatization – It is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. For example, “include”, “includes,” and “included” would all be represented as “include”. The context of the sentence is also preserved in lemmatization as opposed to stemming (another buzz word in text mining which does not consider the meaning of the sentence).
The following Python code defines a function clean()
for cleaning the text article passed as an argument to it:
from nltk.corpus import stopwords from nltk.stem.wordnet import WordNetLemmatizer import string stop = set(stopwords.words('english')) exclude = set(string.punctuation) lemma = WordNetLemmatizer() # pass the article text as string "doc" def clean(doc): # remove stop words & punctuation, and lemmatize words s_free = " ".join([i for i in doc.lower().split() if i not in stop]) p_free = ''.join(ch for ch in s_free if ch not in exclude) lemm = " ".join(lemma.lemmatize(word) for word in p_free.split()) words = lemm.split() # only take words which are greater than 2 characters cleaned = [word for word in words if len(word) > 2] return cleaned
We will plug the above cleaning code in the next blog-post where the training code of the Latent Dirichlet Allocation (LDA) model will be shown in order to discover hidden topics from the corpus. As of now, we are focusing only on creating the wiki corpus of articles.
Specially for Wikipedia articles, one needs to apply several steps to clean the article text which includes removal of File attachment, Image attachments, URLs, Infobox, XML labels, etc. The following Python code applies regular expression for matching such patterns and removing them. These 30 filters are applied depending on my analysis of the wiki text. There may be several other patterns which might have been missed here.
# remove text written between double curly braces article_txt = re.sub(r"{{.*}}","",article_txt) # remove file attachments article_txt = re.sub(r"\[\[File:.*\]\]","",article_txt) # remove Image attachments article_txt = re.sub(r"\[\[Image:.*\]\]","",article_txt) # remove unwanted lines starting from special characters article_txt = re.sub(r"\n: \'\'.*","",article_txt) article_txt = re.sub(r"\n!.*","",article_txt) article_txt = re.sub(r"^:\'\'.*","",article_txt) # remove non-breaking space symbols article_txt = re.sub(r" ","",article_txt) # remove URLs link article_txt = re.sub(r"http\S+","",article_txt) # remove digits from text article_txt = re.sub(r"\d+","",article_txt) # remove text written between small braces article_txt = re.sub(r"\(.*\)","",article_txt) # remove sentence which tells category of article article_txt = re.sub(r"Category:.*","",article_txt) # remove the sentences inside infobox or taxobox article_txt = re.sub(r"\| .*","",article_txt) article_txt = re.sub(r"\n\|.*","",article_txt) article_txt = re.sub(r"\n \|.*","",article_txt) article_txt = re.sub(r".* \|\n","",article_txt) article_txt = re.sub(r".*\|\n","",article_txt) # remove infobox or taxobox article_txt = re.sub(r"{{Infobox.*","",article_txt) article_txt = re.sub(r"{{infobox.*","",article_txt) article_txt = re.sub(r"{{taxobox.*","",article_txt) article_txt = re.sub(r"{{Taxobox.*","",article_txt) article_txt = re.sub(r"{{ Infobox.*","",article_txt) article_txt = re.sub(r"{{ infobox.*","",article_txt) article_txt = re.sub(r"{{ taxobox.*","",article_txt) article_txt = re.sub(r"{{ Taxobox.*","",article_txt) # remove lines starting from * article_txt = re.sub(r"\* .*","",article_txt) # remove text written between angle bracket article_txt = re.sub(r"","",article_txt) # remove new line character article_txt = re.sub(r"\n","",article_txt) # replace all punctuations with space article_txt = re.sub(r"\!|\"|\#|\$|\%|\&|\'|\(|\)|\*|\+|\,|\-|\.|\/|\:|\;|\|\?|\@|\[|\\|\]|\^|\_|\`|\{|\||\}|\~"," ",article_txt) # replace consecutive multiple space with single space article_txt = re.sub(r" +"," ",article_txt) # replace non-breaking space with regular space article_txt = article_txt.replace(u'\xa0', u' ') # Writing the clean text in file if len(article_txt) > 150 and is_ascii(article_txt) and not article_txt == None and not article_txt == "": outfile = path + str(i+1) +"_article.txt" f = codecs.open(outfile, "w", "utf-8") f.write(article_txt) f.close()
The above code snippet of text filters can be plugged to the text extracted from the text
tag (Figure 1). Finally, we keep only those articles which have length more than 150 characters. Also, we check and write only those text articles which contain only ASCII characters (English characters only).
This completes the first step towards Topic modeling, i.e. creating the corpus of articles from simple Wikipedia. Once you follow this blog till here, you will be able to create a corpus of around 70,000 articles in the directory “articles-corpus” used in python program. I will be writing about discovering the hidden topics from the corpus created in the next blog-post soon. So stay tuned till then !!
You can get the full Python code for parsing, cleaning and creating an article corpus (from simple wiki XML dump file) from GitHub link here.
If you liked the post, follow this blog to get updates about upcoming articles. Also, share it so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.
Happy machine learning 🙂
Thankyou for the article great help .
Liked by 1 person
Glad you liked.
Like
THANK YOU FOR THE DETAIL INFO…CAN YOU HELP ME IN OBTAINING BIOMEDICAL ARTICLES FROM PUBMED
Like
Hi Abhijeet, Sorry to disturb you again.
I tried to extract the whole wikipedia dump, which is around 63GB(after extracting) but my system could not take the load. After changing my system configurations (Ram:16GB, 120GBSSD hard disk, Processor i5), still can’t run my program . It’s giving MemoryError.
Please help me.
Like
Hi Astha,
Are you using simple wikipedia dump or wikipedia dump ?
Seems like you are taking wikipedia dump.
Is this a single xml file of 63 GB ?
Can you elaborate your situation ?
Like
Hope I’m able to explain myself well
enwiki-latest-pages-articles.xml.bz2(14.4 GB)(It’s a wikipedia dump)
After unzipping it’s size is 64.7 GB.
So when I exract it by your code it throws memory error.
Please suggest something that works for the above wikipedia dump.
Thanks.
Like
Oh it’s huge.
So basically you may need to use some package which can read/parse it serially. That means it should load only a part of file in RAM at a time.
There would be ways for that. Like this http://boscoh.com/programming/reading-xml-serially.html
But if you are doing it for first time, you can go for simple Wikipedia dump oops which will be small.
Like
Hi Abhijeet, Thanks for replying
I tried on Simple Wikipedia Dump & it worked fine. So, I wanted to try on Large wikipedia dump. I followed that link & tried to extract, it’s been more than 24 hours & it’s still running.
Please tell if it will take this much time
Liked by 1 person
Hi,
Firstly,
I do not know what’s the tree structure of Large wiki dump. The XML parser written in this post is for simple wiki. It may not work.
Secondly,
If at all it is working fine then you would be able to see it in the directory where the program would write text articles in separate files.
If it is running actually from 24 hours it would have generated lakhs of files.
Like