In a binary classification problem, given a linearly separable data set, the optimal separating hyperplane is the one that correctly classifies all the data while being farthest away from the data points. In this respect, it is said to be the hyperplane that maximizes the **margin**, defined as the distance from the hyperplane to the closest data point.

The idea behind the optimality of this classifier can be illustrated as follows. New test points are drawn according to the same distribution as the training data. Thus, if the separating hyperplane is far away from the data points, previously unseen test points will most likely fall far away from the hyperplane or in the margin. As a consequence, the larger the margin is, the less likely the points are to fall on the wrong side of the hyperplane.

Finding the optimal separating hyperplane can be formulated as a convex quadratic programming problem, which can be solved with well-known techniques.

The optimal separating hyperplane should not be confused with the optimal classifier known as the Bayes classifier: the Bayes classifier is the best classifier for a given problem, independently of the available data but *unattainable in practice*, whereas the optimal separating hyperplane is only the best *linear* classifier one can produce *given a particular data set*.

The optimal separating hyperplane is one of the core ideas behind the support vector machines.
In particular, it gives rise to the so-called **support vectors** which are the data points lying on the margin boundary of the hyperplane. These points support the hyperplane in the sense that they contain all the required information to compute the hyperplane: removing other points does not change the optimal separating hyperplane. Elaborating on this fact, one can actually add points to the data set without influencing the hyperplane, as long as these points lie outside of the margin.

*You can click inside the plot to add points and see how the hyperplane changes (use the mouse wheel to change the label).
*

Given a linearly separable data set $\{(\g x_i, y_i)\}_{i=1}^N$ with input vectors $\g x_i \in\X\subseteq\R^d$ and labels $y_i \in \{-1, +1\}$, the **margin** of a classifier $f(\g x) = \sign(g(\g x))$ with separating surface $H$ is
$$
\gamma = \min_{i\in\{1,\dots,N\} } dist(\g x_i, H)
$$
For a linear classifier, with $g(\g x) = \inner{\g w}{\g x} + b $, $H$ is a separating hyperplane:
$$
H = \{ \g x \in \X \ :\ \inner{\g w}{\g x} + b = 0\}
$$
and the distance can be computed as
$$
dist(\g x_i, H) = \frac{|\inner{\g w}{\g x_i} + b |}{\|\g w\|_2}
$$
(the use of the $\ell_2$-norm is implied by the classical inner product in $\R^d$).

The optimal separating hyperplane $H^*$ is the one that maximizes the margin: $$ (\g w^*, b^*) = \arg\max_{\g w, b} \ \min_{i\in\{1,\dots,N\} }\frac{|\inner{\g w}{\g x_i} + b |}{\|\g w\|_2} $$ Solving this optimization problem is difficult, in particular due to the scaling by the inverse of $\|\g w\|_2$, which makes the problem nonconvex. In addition, the problem is ill-posed due to an infinite number of solutions (obtained by scaling the parameters of one of them).

A possible alternative would be to fix $\|\g w\|_2$, but again this leads to a nonconvex optimization problem (with a quadratic equality constraint). A better idea is to consider the so-called **canonical hyperplane** by fixing the numerator. That is, we impose the constraint
$$
\min_{i\in\{1,\dots,N\} } |\inner{\g w}{\g x_i} + b| = 1
$$
when searching for the parameters $\g w^*$ and $b^*$. Given that we search for a classifier that correctly classifies all the data, we have, for all $i\in\{1,\dots,N\}$, $\sign (\inner{\g w}{\g x_i} + b) = y_i$, and $y_i (\inner{\g w}{\g x_i} + b) = |\inner{\g w}{\g x_i} + b|$. Thus, the constraint becomes
$$
\min_{i\in\{1,\dots,N\} } y_i(\inner{\g w}{\g x_i} + b) = 1
$$

Assuming the canonical form of the hyperplane, the margin is now given by
$$
\gamma = \frac{1}{\|\g w\|_2} \min_{i\in\{1,\dots,N\} } |\inner{\g w}{\g x_i} + b| = \frac{1}{\|\g w\|_2}
$$
In this case, **maximizing the margin is equivalent to minimizing the norm of $\g w$**; and $H^*$ can be found by solving
$$
\begin{align*}
(\g w^*, b^*) = \arg\min_{\g w, b} \ & \frac{1}{2} \|\g w\|_2^2 \\
\mbox{s.t. }\ & \min_{i\in\{1,\dots,N\} } y_i(\inner{\g w}{\g x_i} + b) = 1
\end{align*}
$$
(minimizing $\frac{1}{2} \|\g w\|_2^2$ amounts to minimizing $\|\g w\|$ but is more easily handled from an optimization viewpoint). And since we minimize $\|\g w\|_2$, we can relax the constraint to
$$
\begin{align*}
(\g w^*, b^*) = \arg\min_{\g w, b} \ & \frac{1}{2} \|\g w\|^2 \\
\mbox{s.t. }\ & y_i(\inner{\g w}{\g x_i} + b) \geq 1 , \ i=1,\dots,N .
\end{align*}
$$
At this point, we formulated the determination of the optimal separating hyperplane as an optimization problem known as a convex quadratic program for which efficient solvers exist.

The support vectors are the data points $\g x_i$ that lie exactly on the margin boundary and thus that satisfy
$$
y_i (\inner{\g w}{\g x_i} + b) = 1 .
$$
These points (if considered with their labels) provide sufficient information to compute the optimal separating hyperplane. To see this, note that removing another point from the data set amounts to removing a *non-active* constraint from the quadratic program above.

In addition, one can add as many points as one wish to the data set without affecting $H^*$, as long as these points are well classified by $H^*$ and outside the margin. Again, this amounts to adding constraints to the optimization problem and contracting its feasible set. But as as long as the previous solution $H^*$ remains feasible, the solution of the optimization problem does not change.