Text classification is a problem where we have fixed set of classes/categories and any given text is assigned to one of these categories. In contrast, Text clustering is the task of grouping a set of unlabeled texts in such a way that texts in the same group (called a cluster) are more similar to each other than to those in other clusters.

This blog-post implements the above two tasks using well-known machine algorithms: K-NN and K-Means respectively. We will walk through the following steps:

1. A simple text cleaning definition.
2. Feature Extraction.
3. Training the models.
4. Testing the models.

For training K-NN and K-Means models, the following 30 sentences were collected from 3 categories, namely Cricket, Artificial Intelligence and Chemistry.

1. Cricket is a bat and ball game played between two teams of eleven players each on a cricket field.
2. Each phase of play is called an innings during which one team bats, attempting to score as many runs as possible.
3. The teams have one or two innings apiece and, when the first innings ends, the teams swap roles for the next innings
4. Before a match begins, the two team captains meet on the pitch for the toss of a coin to determine which team will bat first.
5. Two batsmen and eleven fielders then enter the field and play begins when a member of the fielding team, known as the bowler, delivers the ball.
6. The most common dismissal in cricket match are bowled, when the bowler hits the stumps directly with the ball and dislodges the bails. Batsman gets out.
7. Runs are scored by two main methods: either by hitting the ball hard enough for it to cross the boundary, or by the two batsmen swapping ends.
8. The main objective of each team is to score more runs than their opponents.
9. If the team batting last is all out having scored fewer runs than their opponents, they are said to have "lost by n runs".
10. The role of striker batsman is to prevent the ball from hitting the stumps by using his bat and, simultaneously, to strike it well enough to score runs
11. Artificial intelligence is intelligence exhibited by machines, rather than humans or other animals. 
12. the field of AI research defines itself as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of success at some goal
13. The overall research goal of artificial intelligence is to create technology that allows computers and machines to function in an intelligent manner.
14. Natural language processing[77] gives machines the ability to read and understand human language and extract intelligence from it.
15. AI researchers developed sophisticated mathematical tools to solve specific subproblems. These tools are truly scientific, in the sense that their results are both measurable and verifiable.
16. An intelligent agent is a system that perceives its environment and takes actions which maximize its chances of success.
17. AI techniques have become an essential part of the technology industry, helping to solve many challenging problems in computer science.
18. Recent advancements in AI, and specifically in machine learning, have contributed to the growth of Autonomous Things such as drones and self-driving cars.
19. AI research was revived by the commercial success of expert systems,[28] a form of AI program that simulated the knowledge and analytical skills of human experts.
20. Advanced statistical techniques (loosely known as deep learning), access to large amounts of data and faster computers enabled advances in machine learning and perception.
21. A compound is a pure chemical substance composed of more than one element and the properties of a compound bear little similarity to those of its elements.
22. Since the properties of an element are mostly determined by its electron configuration, the properties of the elements likewise show recurring patterns or periodic behaviour.
23. The property of inertness of noble gases makes them very suitable in chemicals where reactions are not wanted.
24. The atom is also the smallest entity that can be envisaged to retain the chemical properties of the element, such as electronegativity, ionization potential and preferred oxidation state.
25. The nucleus is made up of positively charged protons and uncharged neutrons (together called nucleons), while the electron cloud consists of negatively charged electrons which orbit the nucleus
26. The atom is the basic unit of chemistry. It consists of a dense core called the atomic nucleus surrounded by a space called the electron cloud.
27. A chemical reaction is a transformation of some substances into one or more different substances.
28. Chemistry is sometimes called the central science because it bridges other natural sciences, including physics, geology and biology.
29. Chemistry includes topics such as the properties of individual atoms and how atoms form chemical bonds to create chemical compounds.
30. Chemistry is a branch of physical science that studies the composition, structure of atoms, properties and change of matter.

A very simple step by step procedure to build the pipeline for training a machine learning model for text analytics applications can be followed like this:

Texts  ==>  Stop words removal ==> Punctuation free ==> Word Lemmatization ==> Digit removal ==> Feature Extraction (Tf-Idf) ==> Model training

Lets start building the pipeline in Python.

1. A Text cleaning definition.

Data pre-processing (a.k.a data cleaning) is one of the most significant step in text analytics. The purpose is to remove any unwanted words or characters which are written for human readability, but won’t contribute to the classification or clustering task in any way.
In general, there are mainly 4 cleaning steps that need to be done on text sentences:

a) Removal of stop words – Stop words like “and”, “if”, “the”, etc are very common in all English sentences and are not very meaningful in deciding the theme of the article, so these words can be removed from the articles.

b) Removal of Punctuation Characters – Exclude all punctuation marks from the set([‘!’, ‘#’, ‘”‘, ‘%’, ‘$’, “‘”, ‘&’, ‘)’, ‘(‘, ‘+’, ‘*’, ‘-‘, ‘,’, ‘/’, ‘.’, ‘;’, ‘:’, ‘=’, ‘<‘, ‘?’, ‘>’, ‘@’, ‘[‘, ‘]’, ‘\\’, ‘_’, ‘^’, ‘`’, ‘{‘, ‘}’, ‘|’, ‘~’]).

c) Lemmatization – It is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. For example, “include”, “includes,” and “included” would all be represented as “include”. The context of the sentence is also preserved in lemmatization as opposed to stemming (another buzz word in text mining which does not consider the meaning of the sentence).

d) Removal of digits from the text sentence.

The following Python code defines a function clean() for cleaning the text article passed as an argument to it:

# Importing libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cluster import KMeans
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
import re
import numpy as np
from collections import Counter

stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()

# Cleaning the text sentences so that punctuation marks, stop words & digits are removed
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    processed = re.sub(r"\d+","",normalized)
    y = processed.split()
    return y

2. Tf-Idf Feature Extraction.

The most popular and widely used word weighing scheme in text mining problems, term frequency and inverse document frequency (tf-idf), is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. The tf-idf value increases proportionally to the number of times a word appears in the document (tf), but is often offset by the frequency of the word in the whole corpus (idf), which helps to adjust for the fact that some words appear more frequently in general. I have explained tf-idf weighing scheme in detail in this blog-post of sentiment analysis application. The following Python code cleanses the text sentences using the definition provided in Section 1. Further, it extracts tf-idf features using scikit-learn library.

print "There are 10 sentences of following three classes on which K-NN classification and K-means clustering"\
         " is performed : \n1. Cricket \n2. Artificial Intelligence \n3. Chemistry"
path = "Sentences.txt"

train_clean_sentences = []
fp = open(path,'r')
for line in fp:
    line = line.strip()
    cleaned = clean(line)
    cleaned = ' '.join(cleaned)

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(train_clean_sentences)

# Creating true labels for 30 training sentences
y_train = np.zeros(30)
y_train[10:20] = 1
y_train[20:30] = 2

3. Training the Classification (K-NN) & Clustering (K-Means) models.

As we have discussed earlier also, Text classification is a supervised learning task, whereas text clustering is an unsupervised task. We are investigating two machine learning algorithms here: K-NN classifier and K-Means clustering.
In k-NN classification, the output is a category membership. A text is classified by a majority vote of its neighbors, with the text being assigned to the class most common among its k nearest neighbors.
The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data (feature vectors). In K-Means clustering, ‘K’ cluster centers are discovered which is centroid of data points belonging to that cluster. A test data (feature-vector) is assigned to that cluster whose centroid is at minimum Euclidean distance from it.

The following Python code snippet trains both the models using scikit-learn library from the tf-idf features extracted in Section 2.

# Clustering the document with KNN classifier
modelknn = KNeighborsClassifier(n_neighbors=5)

# Clustering the training 30 sentences with K-means technique
modelkmeans = KMeans(n_clusters=3, init='k-means++', max_iter=200, n_init=100)

4. Testing on Unseen Texts.

Once the model has been trained,  we demonstrate the concept of classification and clustering with above conventional methods. We tested it on the following few unseen text sentences.

Chemical compounds are used for preparing bombs based on some reactions.
Cricket is a boring game where the batsman only enjoys the game.
Machine learning is a area of Artificial intelligence.

# Predicting it on test data : Testing Phase
test_sentences = ["Chemical compunds are used for preparing bombs based on some reactions",\
"Cricket is a boring game where the batsman only enjoys the game",\
"Machine learning is a area of Artificial intelligence"]

test_clean_sentence = []
for test in test_sentences:
cleaned_test = clean(test)
cleaned = ' '.join(cleaned_test)
cleaned = re.sub(r"\d+","",cleaned)

Test = vectorizer.transform(test_clean_sentence)

true_test_labels = ['Cricket','AI','Chemistry']
predicted_labels_knn = modelknn.predict(Test)
predicted_labels_kmeans = modelkmeans.predict(Test)

print "\nBelow 3 sentences will be predicted against the learned nieghbourhood and learned clusters:\n1. ",\
test_sentences[0],"\n2. ",test_sentences[1],"\n3. ",test_sentences[2]
print "\n-------------------------------PREDICTIONS BY KNN------------------------------------------"
print "\n",test_sentences[0],":",true_test_labels[np.int(predicted_labels_knn[0])],\

print "\n-------------------------------PREDICTIONS BY K-Means--------------------------------------"
print "\nIndex of Cricket cluster : ",Counter(modelkmeans.labels_[0:10]).most_common(1)[0][0]
print "Index of Artificial Intelligence cluster : ",Counter(modelkmeans.labels_[10:20]).most_common(1)[0][0]
print "Index of Chemistry cluster : ",Counter(modelkmeans.labels_[20:30]).most_common(1)[0][0]

print "\n",test_sentences[0],":",predicted_labels_kmeans[0],\

While testing, we simply follow the same steps as those in training. Once the sentences have been cleaned, the tf-idf weighted features are extracted from text sentences using the pre-trained dictionary. The feature vectors are then assigned (predicted) to a category (in case of classification) or to a group (in case of  clustering).

The output snippet of the sequentially combined code from section 1, 2, 3 and 4 is given below:

Text classification and clusteringOne can easily observe that all the three sentences are classified as well as clustered in correct categories.


Hope it was very easy for readers to follow the implementation of text classification and clustering task. If you have gone through it well, you might have understood by now the basic difference between  supervised (K-NN) and unsupervised (K-Means) learning. In supervised learning, the category of the input text can be identified whereas in unsupervised model, only the similar group can be found.

For text classification or clustering tasks, these above posted methods are conventional. Much recently, a lot of methods and text representations using deep learning have been proposed and had provided state-of-the-art results for the same task. Few of the approaches that one can explore after having a basic understanding of this blog-post are:

1. Word2Vec text representation
2. Glove vectors text representation
3. FastText n-gram representation
4. Deep learning techniques for classification (Fully Connected, 1-D CNN, LSTM etc.)

You can get the full python implementation of this blog-post from GitHub link here.

If you liked the post, follow this blog to get updates about upcoming articles. Also, share it so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.

Happy machine learning 🙂