Geometrically, the concept is simple. Let’s look at the image below. The potential target variable is color coordinated. For example, if it’s blue, the person likes movies, or if it’s red, the person does not like movies. The idea is to find support vectors near the black line. The black line represents the boundary.
Realize that there are many ways to split the data in the image below. However, the algorithm chose the black line because it computes to the largest margin between both classes.
- The yellow line is the decision boundary. It’s the width of the margin divided by two.
If you’re asking how does the algorithm maximize the distance the classes, then you’re thinking in the correct lines (no pun intended).
Before I try to provide a good explanation of how the model finds the Support Vectors, we need to get an overview of Linear Algebra.
1.Vector arithmetic: Subtraction.
We need to calculate the difference between the vectors on each of the black line. For example, in the image above, we have the data points on the black line (the red and blue dot). You should familiarize yourself with viewing (x, y) as a vector rather than coordinates. It will help you better understand computation in higher-dimensions.
Calculate the differences between both vectors. Luckily, this calculation is probably a derivation you have been doing since middle school.
- u = [3, 4]
- v = [7, 2]
- u minus v= [-4, 2]
- v minus u= [4, -2]
Note that the ordering of the subtraction changes the geometric representation, but if you were able to pick up the calculated vector (either u-v or v-u), you could drop between vector_u and vector_v.
2. Why do we care?
Fair point. Let’s look at the image with the blue and red data points. Let’s draw a line between data points along the support vector — green line.
However, this is not the difference between the support vectors (data points on the black lines). It’s only the difference between the blue and red dot. However, we need to compute the black lines.
Which is why we need to find an orthogonal (perpendicular) line on the blue or red dot relative to the black line. To figure this out, we need to compute the dot product.
3. Vector arithmetic: Dot Product.
Another useful arithmetic calculation is the dot product. The dot product multiplies each instance in one vector by the same instance in the second vector. It then sums the multiplication of each instance. For example:
- Vector_a = [1, 4]
- Vector_b = [9, 23]
- Result = (1*9) + (4*23) = 105
- Geometrically, 105 is the distance projection of Vector_a onto Vector_b or vice versa.
The dot product allows us to project one vector onto another. For example, in the image below, we can project the green line onto the light purple line. Since the purple line perpendicular to the black line, when we compute the dot product, we are calculating the distance between the black lines!
4. Maximizing the distance between the Support Vectors (points on the black lines).
That requires a lot of math! I suggest you watch the Professor’s Wilson YouTube tutorial!
Well, the first part of the equation represents what we are trying to maximize the distance between both support vectors.
The second part of the equation represents what we are trying to minimize how wrong our predictions are. The SVM minimizes the training error it an exciting manner. There are only two possible outcomes: 1 or -1. For example, one could mean that the image is a dog while -1 could signify that the image is a cat. If your model predicts that the image is a dog is 0.7, then the loss function is 0.3 (1–0.7). However, if your model computes a probability of 1.2 that the image is a dog, then the loss function is 0.
The second part of the equation is seeking support vectors that will compute confident and accurate predictions.
5. Issues With Assumptions
One drawback with SVMs is that it assumes that a hyperplane can separate the data points. However, what if your graph looks like the left image below. We won’t else be able to have data labels that could distinguishably be separated by a line. Hence, we could transform our data into a higher dimension that will have a hyperplane that separates both classes. SVMs can do with a kernel. Think of a kernel as an equation that is applied to all the data points.
Once we transform the data into a higher dimension, we can change such a way to have a hyper-dimension separator. We then project this back to its original dimension. Similar to the image below.