One of the most frequently encountered task in many ML applications is classification. From the plethora of classifiers at our disposal, Support Vector Machines (SVM) are arguably the most widely used one. Yet, the intuition behind their working and the key concepts are seldom understood and their understanding is reduced to just learning a few buzzwords such as hyperplanes, kernels, support vectors, etc. This post is an attempt to unroll the mystery that SVMs tend to become for beginners in ML.

### Introduction – Hyperplane and Margin

An SVM performs two class classification by building a classifier based on a training set, making it a supervised algorithm. The training set points have features and their class label. An SVM aims to find a separating boundary from which the distances of all the training points are as large as possible. This boundary is what is called the optimal hyperplane, while the distances give an idea about the margin. Notice the word optimal used with hyperplane – its easy to imagine that there will be a lot of boundaries that can separate the data points of the 2 classes, but the fact that SVM finds the one located as far as possible from the data points of either class makes it optimal. The figure below illustrates this important concept about SVMs.

The points plotted in the figure are feature vectors for 40 speech files belonging to 2 emotion classes – happy and sad (20 each). The feature vectors have been projected to a 2D space from their high dimensional space by applying principal component analysis (PCA) for visualization purposes.  You may be having many questions after reading this – from where these data points are coming, why PCA has been used, what are the 2 axes, etc. Don’t worry too much. Just understand we have a dataset of points from 2 classes (happy and sad) which we want to separate. In the figure above, while the pink, cyan and brown lines are also separating the data points of these 2 classes, its the blue line which is the optimal separating hyperplane.

We move ahead by explaining the concept of margin. Margin is the distance between the optimal hyperplane and the training data point closest to the hyperplane. This distance, when taken on both the sides of the hyperplane, forms a region where no data point will ever lie. SVM intends to make this region as wide as possible and the selection of the optimal hyperplane is such that it aids in fulfilling this target.

If it were to choose a hyperplane which is quite close to any of the data point, the margin would have been small.

Thus, an SVM builds a classifier by searching for a separating hyperplane which is optimal and maximizes the margin.

Its time to delve into the mathematics behind the above keywords. Firstly, lets clear the air about what a hyperplane is. In the figures above, we saw our separating hyperplane to be a line. Then why did we call it a plane at all, moreover a hyperplane? In essence, a hyperplane is just a fancy word for a plane. It becomes a point for 1 dimensional data points, a line when data points are of 2 dimensions, a plane in 3 dimensions and is called a hyperplane if dimensionality exceeds 3.

The equation of a hyperplane is

$\textbf{w}^T\textbf{x} + b = 0$        (1)

where $\mathbf{w}$ and $\mathbf{x}$ are both vectors. $\textbf{w}^T\textbf{x}$ is actually the dot product of $\mathbf{w}$ and $\mathbf{x}$thus $\textbf{w}^T\textbf{x}$ and $\textbf{w}.\textbf{x}$ are interchangeable.

Considering the case of 2 dimensional data points, implying every data point $\mathbf{x}$ is of the form $\begin {bmatrix}x\\y \end{bmatrix}$ , putting $\mathbf{w}$ = $\begin {bmatrix}-1\\m \end{bmatrix}$ gives us

$\textbf{w.x} = -mx + y$

$\textbf{w.x} = y - mx$

The right hand side of the above equation is the equation of a line with slope ‘m’. The above formulation is to convince the reader that in 2 dimensions, a hyperplane is essentially a line.

Adding ‘b’ to both sides of the equation

$\textbf{w.x} + b = y - mx + b$

Again, the RHS of the equation is the equation of a line with slope ‘m’ and intercept ‘b’. The equation of a hyperplane when written like in (1) (vectorized form) should provide you the intuition that points with more than 2 dimensions can also be easily dealt with.

The vector $\mathbf{w}$ is an important one. From the definition of a hyperplane, $\mathbf{w}$ is the normal to the hyperplane. Keeping this in mind, we will now determine the margin $\mathbf{m}$ of a hyperplane (not to be confused with slope ‘m’ written earlier). That will give us a very nice understanding of the optimization that an SVM performs.

### Problem Formulation

We consider a dataset containing $\mathbf{n}$ points, denoted as $\mathbf{x_i}$ and each $\mathbf{x_i}$ has a label $\mathbf{y_i}$, which is either +1 or -1. The label can be anything, but lets consider +1 and -1 for understanding the concept. Consider 3 separating hyperplanes HP0, HP1 and HPas follows

$\textbf{w.x} + b = 0$

$\textbf{w.x} + b = -1$

$\textbf{w.x} + b = +1$

HP1 and HP2 are no ordinary hyperplanes; as per our objective of having a margin devoid of any data points lying inside it, HP1 and HP2 are selected such that for each data point $\mathbf{x_i}$

if $y_i = +1$, then $\mathbf{w}\cdot\mathbf{x_i} + b\geq +1$
or
if $y_i = -1$, then $\mathbf{w}\cdot\mathbf{x_i} + b\geq -1$

By introducing above constraints, we have assured that the region between HP1 and HP2 will have no data points in it. These 2 constraints can be concisely written as one as follows

$y_i(\mathbf{w}\cdot\mathbf{x_i} + b ) \geq 1$

Having selected the above constraint obeying hyerplanes, next in line is the task of maximizing the distance $\mathbf{m}$ (i.e.,the margin) between HP1 and HP2. Look at the figure shown below. Lets consider a point $\mathbf{x_1}$ lying on the plane HP1. Notice that HP0 is equidistant from HP1 and HP2. If we had a point $\mathbf{z_1}$ that lies on HP2 ,we could simply calculate the distance between $\mathbf{x_1}$ and $\mathbf{z_1}$ and that would have been all. You might think that $\mathbf{z_1}$ can be obtained by simply adding $\mathbf{m}$ to $\mathbf{x_1}$, but remember, that while $\mathbf{x_1}$ is a vector, $\mathbf{m}$ is a scalar, and thus the two can’t be added. By looking at figure below, lets think about what we need in order to get $\mathbf{z_1}$.

Point $\mathbf{x_1}$ can also be seen as a vector from the origin to the $\mathbf{x_1}$ (as shown in figure below). Similarly, we can have a vector beginning from the origin to $\mathbf{z_1}$. From the discussion above, we know this much that we require a new vector with magnitude $\mathbf{m}$, the distance between  HP1 and HP2. Lets call this vector $\mathbf{k}$. Being a vector, $\mathbf{k}$ also requires a direction and we make $\mathbf{k}$ have the same direction as $\mathbf{w}$, the normal to HP2. Then $\mathbf{k}$ on addition to $\mathbf{x}$ will give us a point that lies on HPas shown below (triangle law of vector addition).

Direction of $\mathbf{w} = \mathbf{ \frac{w}{||w||}}$. Thus $\mathbf{k} = m \mathbf{\frac{w}{||w||}}$

Now, $\mathbf{z_1} = \mathbf{x_1} + \mathbf{k}$. We have obtained a point that lies on HP2. Now since $\mathbf{z_1}$ lies on HP2, it satisfies the equation of HP($\mathbf{w}\cdot\mathbf{x} + b = 1$). Thus

$w.z_1 + b = 1$

$w.(x_1 + k) + b = 1$

$w.(x_1 + m\frac{w}{||w||}) + b = 1$

$w.x_1 + m\frac{w.w}{||w||}) + b = 1$

$w.x_1 + m\frac{||w||^2}{||w||}) + b = 1$

$w.x_1 + m||w||+ b = 1$

$w.x_1 + b = 1 - m||w||$

Since $\mathbf{x_1}$ lies on HP1, thus $\mathbf{w}\cdot\mathbf{x_1} + b = -1$.

$- 1 = -1 - m ||w||$

$m = \frac{2}{||w||}$

You must realize we have landed at a very important result here. Maximizing margin $\mathbf{m}$ is the same as minimizing ||w||, the norm of the vector $\mathbf{w}$. Thus, we now have a formulation for the constrained optimization problem that an SVM solves.

minimize $||w||^2$

subject to $y_i(\mathbf{w}\cdot\mathbf{x_i} + b ) \geq 1$ for all i

### Primal and Dual

The above formulation is known as the primal optimization problem for an SVM. What is actually solved when determining the maximum margin classifier is known as the dual of the primal problem. The dual is solved by using the Lagrange multipliers method. The motivation behind solving the dual in place of the primal is that by solving the dual, the algorithm can be easily written in terms of the dot products between the points in the input data set, a property which becomes very useful for the kernel trick (introduced later). Lets see what the dual formulation looks like.

$max_\alpha W(\alpha) = \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i,j=1}^n y^{(i)}y^{(j)}\alpha_i \alpha_j (x^{(i)},x^{(j)})$

subject to $\alpha_i \geq 0, i = 1,2,..., n$

$\sum_{i=1}^n \alpha_i y^{(i)} = 0$

Seems overwhelming? Don’t bother about its complexity and just take away the crucial points explained next. Here, $\mathbf{\alpha_i}$ are the Lagrange multipliers. Notice the term $(x^{(i)},x^{(j)})$, which is the dot product of 2 input data points, $x^{(i)}$ and $x^{(j)}$ . Also notice that while the primal was a minimization problem, the dual is a maximization one. Thus we have obtained a maximization problem in which the parameters are the $\mathbf{\alpha_i}$ and we are dealing completely with the dot products between the data points. The benefit in this comes from the fact that dot products are very fast to evaluate. This dot product formulation comes in handy further too when kernels (introduced later) are used. Another interesting point is that the constraint $\mathbf{\alpha_i > 0}$ will be true only for a small number of points, the ones which lie closest to the decision boundary. These points are known as the support vectors, and for making any future predictions for a new test data point $\mathbf{X}$, we will require only the inner products between $\mathbf{X}$ and these few support vectors rather than the entire training set for determining which class the point $\mathbf{X}$ belongs to. The points surrounded by a blue rectangle (nearest to the decision boundary) are the support vectors in our 2 class case.

### Linearly Non-Separable Data – Soft Margin & Kernels

Till now, we have very conveniently assumed our training set to be linearly separable. However, if the 2 classes are not linearly separable, or are somehow linearly separable but the resulting margin is very narrow, we have the concept of soft margin in SVMs. It often happens that by violating the constraints for a few points, we get a better margin for the separating hyperplane rather than by trying to fit a strictly separating hyperplane, as in figure shown below. If the constraints are strictly followed for all the points in this case, then we can see that the resulting separating hyperplane has a very small margin.

Thus, sometimes by allowing a few mistakes while partitioning the training data, we can get a better margin.

At the same time, the SVM applies a penalty $\mathbf{\zeta}$ on the points for which the constraint are being violated.

$\zeta_i = max(0, 1 - y_i (w.x_i + b))$

Looking at the equation, we can see that $\mathbf{\zeta_i}$ will be 0 for data points on correct side of the margin, while it will be a positive number for data points on the wrong side. This alters our minimization problem from before to

minimize $||w||^2+\lambda\sum_{i=1}^n\zeta_i$

subject to $y_i(\mathbf{w}\cdot\mathbf{x_i} + b) \geq 1 - \zeta_i, \zeta_i > 0$ for all i

The variable $\mathbf{\lambda}$ is the regularization parameter and is actually determining the trade off between maximizing the margin and ensuring that all training data points lie on the correct side of the margin. If $\mathbf{\lambda}$ is small, then we are effectually making the total regularization term $\lambda\sum_{i=1}^n\zeta_i$ small in the optimization function and we get a large margin. The opposite becomes true if $\mathbf{\lambda}$ chosen is large.

Another way of SVMs to deal with the case of non-linearly separable data points is the kernel trick. The idea stems from the fact that if the data cannot be partitioned by a linear boundary in its current dimensionality, then projecting the data into a higher dimensional space may make it linearly separable, such as in figure shown below (taken from the internet).

When such is the case, there will be a function $\mathbf{\phi(x)}$ that will project the data points from their current space to the higher dimensional space, and $\phi(x)$ will replace $\mathbf{x}$ in all above equations. In the figure above, we can see that when the 1 dimensional points were projected to 2 dimensional space (higher dimensional space), they became linearly separable.

Like we mentioned before, we get the benefit of solving the optimization problem in terms of the dot products only here too, as the kernel function $\mathbf{k(x,z)}$ can then simply be written as the dot product of the transformed points $\mathbf{x}$ and $\mathbf{z}$.

$k(x,z) = \phi(x).\phi(z)$

Now whats interesting to note is that often, the kernel $\mathbf{k(x,z)}$ is rather easier to calculate than first calculating $\mathbf{\phi(x)}$ and $\mathbf{\phi(x)}$ and then their dot product, since the vectors have now become high dimensional vectors after their projection. This is why the method has been called kernel trick, since our effective implementation of the kernel can allow us to learn in the higher dimensional space even without explicitly performing any projection and determining the vectors $\mathbf{\phi(x)}$ and $\mathbf{\phi(x)}$ .

### Conclusion

This blog was an effort to give a comprehensive idea behind the working of SVMs, right from introducing the key concepts, developing intuition about the keywords to the actual problem formulation, albeit with much of the mathematics intentionally skipped. We aimed to cover as many keywords related to SVMs as possible, nevertheless its expected that some important content was missed. Readers are encouraged to delve more into the mathematical formulation and to discuss it with us, and to also let us know about any more SVM concepts they would like to be elaborated upon in the comments. It will motivate us to cover those concepts in future posts.

### Acknowledgement

1. Alexandre KOWALCZYK’s blog – “SVM Tutorial”. (link)