Because Y can be any value from (-inf, +inf), activation functions are the deciding factor of whether a neuron should fire or not (be on or off, 0 or 1, etc.)

Why use an activation function? Well, there are a few good reasons, but the most important is that activation functions introduce non-linearity into the network. A neural network without activation functions is basically just a linear regressionmodel and is not able to do more complicated tasks such as language translations and image classifications. Also, linear functions are not able to make use of backpropagation which is the way neural networks “learn”.

This is a function that says:

Activated=true if Y>some number (usually 0), otherwise 0

What happens when there are many different neurons that are all 1 or all 0 or some are 1 and some are 0. How do you decide which is most right? This is what activation functions help with.

What if I had some function that could tell me which is most right….20% right 99% right 87%, etc….

The first thing that comes to mind is a Linear Function:

This way it gives a range of activations and given many neurons we could choose a max or min or something else.

The problem with this is that the derivative of this linear function is a constant which means that in the back-propagation of the network, the derivative will never go to zero and find that “minimum”. It will keep climbing slowly to either +inf or -inf.

Let’s now look at a function that can give us a range of results but is non-linear.