- Supervised/Unsupervised: Supervised
- Regression/Classification: Regression
Notice that the values in the graph to the right will never reach 0 or 1. Also, the middle-point is 0.5. Recall that these values are probabilities!
To understand the importance of the graph to the right, you must understand probabilities.
Probabilities should never be below 0 or above 1. Why? Well, how can you express more certainty that a possibility of 1? Or less uncertainty than a probability of 0.
The graph above will never “touch” 0 or 1. It can get real close to 0 or 1 but never exactly.
Also, please realize that the graph above is a modification of the natural log graph. The natural log graph will never touch the y-intercept or when x = 1.
When making predictions, we do not express probabilities but use them to make our decisions. First, we must set a threshold of 0.5. Then, we calculate the probability. We must establish the target variables: 1 indicates a start-up while 0 indicates finishing college. The closer the possibility is to 1, the more confident we are that building a start-up is the way to get rich.
You can skip this part of the article. But if you appreciate the underlying mechanics of how things work, I suggest you continue reading.
Expressing probabilities could be accomplished in many formats. You are probably familiar with fractions. That’s one method of communicating probabilities. However, there are other methods to express probabilities.
Probability: How probable an event will occur.
Odds: Expressed the odds of success (P(success)/P(failure))
- .75/.25 = 3
- It is three times more likely that I will win than lose.
- The higher the odds, the more likely ‘success’ will occur.
Log Odds: Log of the Odd (natural log)
- ln(3) = 1.098
- This calculation is used to map the result between negative infinity to infinity.
- The formula can be denoted as Log(P(A))/(P(B)) or Log(P(A))/(P(-A)), where -A is not A.
- Replace -A with 1-A which calculates Log(P(A)/P(1-A))!
- This method does not have an intuitive interpretation like the previous two.
Equation (will be modified):
- Logit(p/(1-p)) = b + w1x1 + w2x2 + … + wmxm
- X’s = input, b = bias term, and w = weights.
Notice how the left side of the equation is Logit. Let’s change that by rearranging the formula.
Formula on the left key legend:
- S(z) = output between 0 and 1
- z = input of the functions (inputs)
- e = base of the natural log
If we are trying to predict whether an email is spam or non-spam, we need variables/measures. Let’s use three variables: an email’s timestamp, the total amount of words in the email, and the images in the email.
The function would be
- z = W0 + W1*email_timestamp + W2*email_words + W3*email_images.
The function is identical to linear regression. The only difference is that we plug our function into the sigmoid function which will calculate a probability between 0 and 1.
At this point, you might be asking about how we find the weights. Good question. Recall that we used the MSE for linear regression. The MSE helped us achieve the appropriate weights for our equation.
For logistic regression, it’s a bit more complicated. The cost function we use is called Cross-Entropy, also known as Log Loss. Cross-entropy loss is split into two separate cost functions when dealing with a binary classification problem: for y=0 and y=1.
If you have any difficulties understanding this, I suggest watching Andrew Ng’s YouTube video on Logistic Regression. Notice that there are two functions, so there will be two graphs to provide us with an intuitive understanding.
Left graph: The ground truth is 1. If our prediction is 0, notice how the graph can go til’ infinity. The more incorrect we are, the larger the cost.
Right graph: The right graph is symmetric to the left graph. The ground truth is 0. If our prediction is 1, the larger the cost.
The key thing to note is the cost function penalizes confident and wrong predictions more than it rewards confident and right predictions!
Lastly, the equation above is the final equation. Note that it’s the same formula as the one described above with both equations. Don’t believe me? Well, take note of the “y(i)” and “(1-y(i))” in the formula.
- If the ground truth is 0 and y(i) is 0, it computes the left part of the equation to 0. The remaining equation would be (1–0)log(1-h(x(i))) — the equation in the image above (right).
- If the ground truth is 1, then y(i) is one it computes the right part of the equation 0. The remaining equation would be (1)log(h(x(i))) — the equation in the image above (right).
Great. But didn’t we use Gradient Descent in the Linear Regression?
Yes, and we will.
We discussed gradient descent with linear regression. A logistic regression is also optimized with the gradient descent but its computation is more complicated, but oddly enough, it computes the same equation.
For deriving the gradient descent of the Cross-Entropy (Log Loss):
Using the learning rate, we update our weight to minimize our cost function.
After computing the learning rate, we would like to understand our model better. There are several ways to evaluate our model. Check out my article! The most popular method is accuracy where we use a truth table.
- True positives: Correct positive predictions.
- False positives: Prediction was positive but it was negative.
- True negatives: Correct negative predictions.
- False negatives: Prediction was negative, but it was positive.
- The total number of correct predictions / total predictions.
- Apparently, the stronger the number, the better our model performed.