If you have worked on Logistic regression or Neural network problem you must have heard about Sigmoid function. It takes the input values between -∞ to ∞ and map them to values between 0 to 1. It is very handy when we are predicting the probability. For example, where email is spam or not, the tumor is malignant or benign. More detail about why to use sigmoid function in logistic regression is here
We calculate the derivative of sigmoid to minimize loss function. Lets say we have one example with attributes x₁, x₂ and corresponding label is y. Our hypothesis is
where w₁,w₂ are weights and b is bias.
Then we will put our hypothesis in sigmoid function to get the predict probability. i.e. values between 0 and 1.
where y_hat is prediction probability of y being 1. And the loos function will be L(y_hat,y)
To minimize the loss function we need to do gradient descent, we will calculate the derivative of the loos function with respect to weights and bias and multiple that with learning rate alpha(α)and deduct that from our values of weights and bias. At the next we will use this new values of weights and bias. This iteration goes on until we hit global minima. Here is great article about gradient descent .
To calculate the derivative we have to back propagate. Because the loss function is depend upon sigmoid, sigmoid is depend upon hypothesis and hypothesis is depend on weight or bias.
w₁→z→ sigma(z) → L(y_hat, y)
By the chain rule of Derivative, derivative of loos function with respect to w₁
In this article we will talk about only middle term derivative of sigma function. Lets put value of y_hat
Now we will solve the derivative of sigmoid, We will treat this derivative as total derivative(not partial derivative) for more simplicity.
1. Write Your First AI Project in 15 Minutes
2. Generating neural speech synthesis voice acting using xVASynth
3. Top 5 Artificial Intelligence (AI) Trends for 2021
4. Why You’re Using Spotify Wrong
Before going further I will recommend go through first seven Rules of Derivative from here.
Take the derivatives on both sides
Applying power rule and chain rule
Again by the chain rule
Add and subtract 1 in numerator
Lets take common multiple outside the bracket.
This is derivative of the sigma function
Lets take 50 numbers equality spaced between -10 to 10 and calculate sigmoid and derivatives of sigmoid for every number and plot it.
We know that Derivative is actually slope. Slope is defined as the ratio of change in Y to unite change in X.
We can see in plot at left where X= (-10), as we change X there is very less change in sigmoid(x), that’s why the slope or Derivative of sigmoid is nearly 0
But at the center of the plot, if change the X little bit, there is large change sigmoid(x), Moreover the slope is highest at X = 0,
As we go further right change of Y, to unit change of X is less so, again the slop is nearly zero.
These three conditions are depicted accurately by Derivative of sigmoid function (orange line) in the plot, Therefore we can say that our calculation good to go.
Thanks for reading. Feel free to refer below link’s for more details
- Logistic Regression by Andrew Ng
- Cost Function by Andrew Ng
- List of Derivative Rules