The human brain is the ocean placed in people’s heads. It is one of the most complex biological structures in the universe. It has its own secrets that we still struggle to learn.
Today, I will talk about one of the precious capabilities that we all have: Classification. It is the talent to differentiate between objects that our eyes capture.
As humans, when we see an object, our brain does an absolutely elegant job. It first detects the prominent features of the object, then dives into detail and classifies what it is, but how does it achieve doing this? More questionable, how does it process all that mass of data coming from the outside world with tremendously high speed?
I think, all these questions are enough for one to become curious about the topic. Since I am not a neurobiologist, I cannot go forward with the biological details, but let me talk about how we imitate the brain using algorithms that make it possible for computers to detect objects in images.
Object detection is a very popular topic in Machine Learning. Computer Scientists try to find out algorithms that detect features from images and classifies the objects using those features. In this article, I will make a brief introduction to classification by explaining the binary classification method.
So, first question: What do we mean by feature? I will answer this question by using a famous example: Cats & Dogs.
In the graph above, you see that cats and dogs are classified into two groups on a 2D coordinate system. This is done by using their 2 prominent features: Their nose size and ear shape. As can be seen in the graph, cats have a pointy ear shape and small nose size while dogs have round ear shape and big nose size in general. It is obvious that a line among these two classes separates them perfectly. The ones above the line are clearly cats and the ones below the line are crystal clear dogs. Pay attention to that there are only two kinds of animals in this example.
Binary Classification -as can be understood from its name- is the method been used to classify between two kinds of things. From the math perspective, the approach here is to find the line (mentioned in the previous paragraph) which best separates these two classes. Here, if you have not read my previous blog about Linear Regression, I strongly suggest you read it before continuing.
In Linear Regression, we tried to find the line that best fits spread data points, i.e the line that has the minimum distance to the data points in total (Remember that we minimized our error function which was the linear least square showing the distance among the actual points and our line’s predictions). In Binary Classification, we will try to find out the line that best separates the classes and has the maximum distance to the points of each class. Now, it is time for math:
We know that the line equation, in general, is: x w+ b = y. Now, we have two classes. Suppose we assign +1 for cats and -1 for dogs.
b + xw > 0 if y = +1 => Above the line: Cats
b + xw < 0 if y = −1 => Below the line: Dogs
These are the predictions that we want to get as a result of our calculations. If one multiplies each equation by minus their respective value, it is obvious that one equation is enough to represent these two situations:
− y ((b + x)w) < 0 . If we take the maximum of zero and this equation, we get
max (0, − y ((b + x)w)) which in fact returns zero if a point is classified correctly and a positive value if it fails. Now, it is time to implement our cost function. If we predict for all values and get zero-sum, then it means that our line perfectly discriminates two classes. So, we should try to find the values that give the minimum sum of each prediction. Here is our cost function (Do not confuse the matrix notation used in the formula. Same logic is generalized using matrix notation):
Now, everything seems to be correct but there is a small trick here: The Gradient Descent and Newton’s Algorithm cannot be used here to find the optimal points because this function is not differentiable everywhere. And there is the trivial solution which b = 0, w = 0. Therefore, we use a mathematical operation called smooth approximation. There are many different functions that can be used in such a situation but I will continue with the most famous one: Softmax Cost Function that is defined as :
What this function does is removing the discontinuities by smoothing a function as can be seen below:
Now, our error function became:
So, the remaining part is using Gradient Descent or Newton’s Method to find the optimum parameters for our classifier line which will discriminate between our classes.
Since all the classes cannot be separated using a line (e.g.: It can be the case that some small dogs have the same nose size as cats), there are some techniques to strengthen this method by using a buffer zone, feature selection etc. I will dive into those topics further in these series.
For further information, you can see:
- Machine Learning Refined: Foundations, Algorithms, and Applications by Jeremy Watt, Reza Borhani, and Aggelos K. Katsaggelos
Thank you for your time!