A human neuron takes some input(stimuli), performs a simple computation and returns an output(activation). In modern neural networks, multiple perceptrons comprise of a single layer of the network, and multiple layers form the network (similar to what we saw in the visual cortex) (Galistel & King, 2009). This is a very high level understanding of the structure of a neural network of course.
Basically, a neuron takes an input signal (dendrite), processes it like the CPU (soma), passes the output through a cable like structure to other connected neurons (axon to synapse to other neuron’s dendrite). Now, this might be biologically inaccurate as there is a lot more going on out there but on a higher level, once a neuron takes an input, it delivers an output based on processing.
Our sense organs interact with the outer world and send the visual and sound information to the neurons. Let’s say you are watching Friends. Now the information your brain receives is taken in by the “laugh or not” set of neurons that will help you make a decision on whether to laugh or not. Each neuron gets fired/activated only when its respective criteria is met.
Although artificial neurons and perceptrons were inspired by the biological processes scientists were able to observe in the brain back in the 50s, they do differ from their biological counterparts in several ways. Birds have inspired flight and horses have inspired locomotives and cars, yet none of today’s transportation vehicles resemble metal skeletons of living-breathing-self replicating animals. Still, our limited machines are even more powerful in their own domains (thus, more useful to us humans), than their animal “ancestors” could ever be. It should be emphasized that artificial and biological neurons do differ in several ways apart from what they contain.
The idea behind perceptrons (the predecessors to artificial neurons) is that it is possible to mimic certain parts of neurons, such as dendrites, cell bodies and axons using simplified mathematical models of what limited knowledge we have on their inner workings: signals can be received from dendrites, and sent down the axon once enough signals were received. This outgoing signal can then be used as another input for other neurons, repeating the process. Some signals are more important than others and can trigger some neurons to fire easier. Connections can become stronger or weaker, new connections can appear while others can cease to exist. We can mimic most of this process by coming up with a function that receives a list of weighted input signals and outputs some kind of signal if the sum of these weighted inputs reach a certain bias. Note that this simplified model does not mimic neither the creation nor the destruction of connections (dendrites or axons) between neurons, and ignores signal timing. However, this would not suffice for simple tasks about classification.
A weight is a connection between neurons that carries a value. The higher the value is, the more significance would need to be associated with neuron. In terms of mathematical formula, all these weights can be conceptualized as a matrix format. For example, if a layer L has N neurons followed by M neurons in the next layer L+1, the weight matrix would be an N-by-M matrix (N rows and M columns).
The bias is also a weight. Imagine you’re thinking about a situation (trying to make a decision). You have to think about all possible (or observable) factors. But what about parameters you haven’t come across? What about factors you haven’t considered? In a Neural Net, all of these unforeseen or non-observable factors would refer to a bias. Similar to the weights, biases can also be viewed as vectors.
The process of a neuron making a decision on its output is referred to as the activation. We represent it as f(z), where z is the aggregation of all the input. There are 2 broad categories of activation, linear and non-linear. If f(z)=z, we say the f(z) is a linear activation (i.e nothing happens).
Now that you know the basics, it’s time to do the math. Remember this?
See everything in the parentheses? Call that your z. Then;
· b = bias
· x = input to neuron
· w = weights
· n = the number of inputs from the incoming layer
· i = a counter from 0 to n
The only neurons that have values attached to them are the input neurons on the input layer (they are the values observed from the data we are using to train the network). So, how does this work?
1. Multiply every incoming neuron by its corresponding weight.
2. Add the values up.
3. Add the bias term for the neuron in question.
That’s all for evaluating z for our neuron. Yet, imagine you have to do this for every neuron (of which you may have thousands) in every layer (of which you might have hundreds), it would take forever to solve. So here’s the trick we use:
Remember the matrices (and vectors) we talked about? Here is when we get to use them. Follow these steps:
1. Create a weight matrix from input layer to the output layer as described earlier; e.g. N-by-M matrix.
2. Create an M-by-1 matrix from the biases.
3. View your input layer as an N-by-1 matrix (or vector of size N, just like the bias).
4. Transpose the weight matrix, now we have an M-by-N matrix.
5. Find the dot product of the transposed weights and the input. Once you come up with the dot product of an M-by-N matrix and an N-by-1 matrix, you get an M-by-1 matrix.
6. Add the output of step 5 to the bias matrix (they will definitely have the same size if you did everything right).
7. Finally, you have the values of the neurons, it should be an M-by-1 matrix (vector of size M)
Invented by Frank Rosenblatt, the perceptron was originally intended to be a custom-built mechanical hardware instead of a software function. The Mark 1 perceptron developed by the US Navy had the aim to fulfill the image recognition tasks. However, its shortcomings were quickly realized, as a single layer of perceptrons alone is unable to solve non-linear classification problems This problem can only be overcome (more complex relationships in data can only be modeled) by using multiple layers (hidden layers). However, there isn’t a simple, cheap way of training multiple layers of perceptrons, other than randomly nudging all their weights, because there is no way to tell which small set of changes would end up largely affecting other neurons’ outputs down the line. This deficiency has caused artificial neural network research to stagnate for years. Then a new kind of artificial neuron have managed to solve this issue by slightly changing certain aspects in their model, which allowed the connection of multiple layers without losing on the ability to train them. Instead of working as a switch that could only receive and output binary signals (meaning that perceptrons would get either 0 or 1 depending on the absence or presence of a signal, and would also output either 0 or 1 when reaching a certain threshold of combined, weighted signal inputs), artificial neurons would instead utilize continuous (floating point) values with continuous activation functions (more on these functions later).
This might not look like much of a difference, but due to this slight change in their model, layers of artificial neurons could be used in mathematical formulas as separate, continuous functions where an optional set of weights (estimating how to minimize their errors by calculating their partial derivatives one by one) could be calculated. This tiny change made it possible to teach multiple layers of artificial neurons using the backpropagation algorithm. In other words, artificial neurons don’t just “fire” as they send continuous values rather than binary signals. Depending on their activation functions, they might somewhat fire all the time, but the strength of these signals varies. Note that the term “multilayer perceptron” is actually inaccurate as these networks utilize layers of artificial neurons instead of layers of perceptrons. Yet, teaching these networks was so computationally expensive that people rarely used them for machine learning tasks, until recently (when large amounts of example data were easier to come by and computers got many magnitudes faster). Since artificial neural networks are hard to teach and aren’t faithful models to what actually goes inside our heads, most scientists still regarded them as dead ends in machine learning. The hype was back, when in 2012 a Deep Neural Network architecture AlexNet managed to solve the ImageNet challenge (a large visual dataset with over 14 million hand-annotated images) without relying on handcrafted, minutely extracted features that were the norm in computer vision up to this point. AlexNet beat its competition by miles, paving the way for neural networks to be once again relevant.
Now that we have a rudimentary understanding of general processing mechanism within the brain, let us move how visual processing is done within the brain.
The human eye has always fascinated the scientists as it has been recognized as the best visual tool. Yet, while the field of optics might have explained its basics, scientists still remain fascinated by how the brain perceives the world by means of the eyes for sense-making. In order to resolve that mystery, in late 1950s, two professors in the field- David Hubel and Torsten Wiesel -conducted an experiment by inserting electrodes in a cat’s visual cortex to observe the individual neurons (Hubel & Wiesel, 1959).
During these recordings of neurons in the visual cortex of a cat while moving a bright line across its retina , Professors Hubel &Wiesel noticed the following (Hubel & Wiesel, 1959):
(1) Firing of neurons happened only in case of the existence of a line on the retina.
(2) The neuron activities were changing based on the orientation of the line.
(3) The neurons fired only in case the line was moving in a particular direction.
This classic experiments by Hubel and Wiesel provided evidence for the following:
– Existence of a topographical map in the visual cortex that represents the visual field, where nearby cells process information from nearby visual fields.
– The arrangement of neurons in the visual cortex in a precise architecture
– Cells with similar functions being organized into columns similar to computational machines that relay information to a higher region of the brain to encode image features.
Hubel & Wiesel discovered that each neuron was tuned to observe only one kind of stimulus (say a straight vertical line) and only moving (or stationary) in one direction. Further they observed that the firing of neurons change as the orientation of the stimuli was changed. This understanding of each neuron as looking out for different stimuli in different orientations was later used in building convolutional neural networks (Hubel & Wiesel, 1959).
Within this light of information, some main difference should be noted when it comes to understanding the receptive field of neurons and perceptrons:
1. Size: Our brain contains about 86 billion neurons and more than a 100 trillion (or according to some estimates 1000 trillion) synapses (connections). The number of “neurons” in artificial networks is much less than that (usually in the ballpark of 10–1000) but comparing their numbers this way is misleading. Perceptrons just take inputs on their “dendrites” and generate output on their “axon branches”. A single layer perceptron network consists of several perceptrons that are not interconnected: they all just perform this very same task at once. Deep Neural Networks usually consist of input neurons (as many as the number of features in the data), output neurons (as many as the number of classes if they are built to solve a classification problem) and neurons in the hidden layers, in-between. All the layers are usually (but not necessarily) fully connected to the next layer, meaning that artificial neurons usually have as many connections as there are artificial neurons in the preceding and following layers combined. Convolutional Neural Networks also use different techniques to extract features from the data that are more sophisticated than what a few interconnected neurons can do alone. Manual feature extraction (altering data in a way that it can be fed to machine learning algorithms) requires human brain power which is also not taken into account when summing up the number of “neurons” required for Deep Learning tasks. The limitation in size isn’t just computational: simply increasing the number of layers and artificial neurons does not always yield better results in machine learning tasks.
2. Topology: All artificial layers compute one by one, instead of being part of a network that has nodes computing asynchronously. Feedforward networks compute the state of one layer of artificial neurons and their weights, then use the results to compute the following layer the same way. During backpropagation, the algorithm computes some change in the weights the opposing way, to reduce the difference of the feedforward computational results in the output layer from the expected values of the output layer.
Layers aren’t connected to non-neighboring layers, but it is possible to somewhat mimic loops with recurrent and LSTM networks. In biological networks, neurons can fire asynchronously in parallel, have small-world nature with a small portion of highly connected neurons (hubs) and a large amount of lesser connected ones (the degree distribution at least partly follows the power-law). Since artificial neuron layers are usually fully connected, this small-world nature of biological neurons can only be simulated by introducing weights that are 0 to mimic the lack of connections between two neurons.
3. Speed: Certain biological neurons can fire around 200 times a second on average. Signals travel at different speeds depending on the type of the nerve impulse, ranging from 0.61 m/s up to 119 m/s. Signal travel speeds also vary from person to person depending on their sex, age, height, temperature, medical condition, lack of sleep etc. Action potential frequency carries information for biological neuron networks: information is carried by the firing frequency or the firing mode (tonic or burst-firing) of the output neuron and by the amplitude of the incoming signal in the input neuron in biological systems. Information in artificial neurons is instead carried over by the continuous, floating point number values of synaptic weights. How quickly feedforward or backpropagation algorithms are calculated carries no information, other than making the execution and training of the model faster. Artificial neurons do not experience “fatigue”: they are functions that can be calculated as many times and as fast as the computer architecture would allow. Since artificial neural network models can be understood as just a bunch of matrix operations and finding derivatives, running such calculations can be highly optimized for vector processors (doing the very same calculations on large amounts of data points over and over again) and sped up by magnitudes using GPUs or dedicated hardware (like on AI chips in recent SmartPhones).
4. Fault-tolerance: Biological neuron networks due to their topology are also fault-tolerant. Information is stored redundantly so minor failures will not result in memory loss. They don’t have one “central” part. The brain can also recover and heal to an extent. Artificial neural networks are not modeled for fault tolerance or self regeneration (similarly to fatigue, these ideas are not applicable to matrix operations), though recovery is possible by saving the current state (weight values) of the model and continuing the training from that save state. Dropouts can turn on and off random neurons in a layer during training, mimicking unavailable paths for signals and forcing some redundancy (dropouts are actually used to reduce the chance of overfitting).
Trained models can be exported and used on different devices that support the framework, meaning that the same artificial neural network model will yield the same outputs for the same input data on every device it runs on. Training artificial neural networks for longer periods of time will not affect the efficiency of the artificial neurons. However, the hardware used for training can wear out really fast if used regularly, and will need to be replaced. Another difference is, that all processes (states and values) can be closely monitored inside an artificial neural network.
5. Power consumption: The brain consumes about 20% of all the human body’s energy — despite it’s large cut, an adult brain operates on about 20 watts (barely enough to dimly light a bulb) being extremely efficient. Computers also generate a lot of heat when used, with consumer GPUs operating safely between 50–80 degrees Celsius instead of 36.5–37.5 °C.
6. Signals: An action potential is either triggered or not — biological synapses either carry a signal or they don’t. Perceptrons work somewhat similarly, by accepting binary inputs, applying weights to them and generating binary outputs depending on whether the sum of these weighted inputs have reached a certain threshold (also called a step function). Artificial neurons accept continuous values as inputs and apply a simple non-linear, easily differentiable function (an activation function) on the sum of its weighted inputs to restrict the outputs’ range of values. The activation functions are nonlinear so multiple layers in theory could approximate any function. Formerly sigmoid and hyperbolic tangent functions were used as activation functions, but these networks suffered from the vanishing gradient problem, meaning that the more the layers in a network, the less the changes in the first layers will affect the output, due to these functions squashing their inputs into a very small output range. These problems were overcome by the introduction of different activation functions such as ReLU.
The final outputs of these networks are usually also squashed between 0–1 (representing probabilities for classification tasks) instead of outputting binary signals. As mentioned earlier, neither the frequency/speed of the signals nor the firing rates carry any information for artificial neural networks (this information is carried over by the input weights instead). The timing of the signals is synchronous, where artificial neurons in the same layer receive their input signals and then send their output signals all at once. Loops and time deltas can only be partly simulated with Recurrent (RNN) layers (that suffer greatly from the aforementioned vanishing gradient problem) or with Long short-term memory (LSTM) layers that act more like state machines or latch circuits than neurons. These are all considerable differences between biological and artificial neurons.
7. Learning: We still do not understand how brains learn, or how redundant connections store and recall information. Brain fibers grow and reach out to connect to other neurons, neuroplasticity allows new connections to be created or areas to move and change function, and synapses may strengthen or weaken based on their importance. Neurons that fire together, wire together (although this is a very simplified theory and should not taken too literally). By learning, we are building on information that is already stored in the brain. Our knowledge deepens by repetition and during sleep, and tasks that once required a focus can be executed automatically once mastered. Artificial neural networks in the other hand, have a predefined model, where no further neurons or connections can be added or removed.
During training, only the weights of the connections and biases are subject to change. The networks start with random weight values and will slowly try to reach a point where further changes in the weights would no longer improve performance. Just like there are many solutions for the same problems in real life, there is no guarantee that the weights of the network will be the best possible arrangement of weights to a problem — they will only represent one of the infinite approximations to infinite solutions.
Learning can be understood as the process of finding optimal weights to minimize the differences between the network’s expected and generated output: changing weights one way would increase this error, changing them the other way would decrees it. Imagine a foggy mountain top, where all we could tell is that stepping towards a certain direction would take us downhill. By repeating this process, we would eventually reach a valley where taking any step further would only take us higher. Once this valley is found we can say that we have reached a local minima. Note that it’s possible that there are other, better valleys that are even lower from the mountain top (global minima) that we have missed, since we could not see them. Doing this in usually more than 3 dimensions is called gradient descent. To speed up this “learning process”, instead of going through each and every example every time, random samples (batches) are taken from the data set and used for training iterations. This will only give an approximation of how to adjust the weights to reach a local minima (finding which direction to take downhill without carefully looking at all directions all the time), but it is still a pretty good approximation.
We can also take larger steps when ascending the top and take smaller ones as we are reaching a valley where even small nudges could take us the wrong way. Walking like this downhill, going faster than carefully planning each and every step is called stochastic gradient descent. So the rate of how artificial neural networks learn can change over time (it decreases to ensure better performance), but there aren’t any periods similar to human sleep phases when the networks would learn better. There is no neural fatigue either, although GPUs overheating during training can reduce performance. Once trained, an artificial neural network’s weights can be exported and used to solve problem similar to the ones found in the training set. Training (backpropagation using an optimization method like stochastic gradient descent, over many layers and examples) is extremely expensive, but using a trained network (simply doing feedforward calculation) is ridiculously cheap.
Unlike the brain, artificial neural networks don’t learn by recalling information — they only learn during training, but will always “recall” the same, learned answers afterwards, without making a mistake. The great thing about this is that “recalling” can be done on much weaker hardware as many times as we want to. It is also possible to use previously pretrained models (to save time and resources by not having to start from a totally random set of weights) and improve them by training with additional examples which possess same input features. This is somewhat similar to how it’s easier for the brain to learn certain things (like faces), by having dedicated areas for processing certain kinds of information.
So artificial and biological neurons do differ in more ways than the materials of their environment — biological neurons have only provided an inspiration to their artificial counterparts, but they are in no way direct copies with similar potential. If someone calls another human being smart or intelligent, we automatically assume that they are also capable of handling a large variety of problems, and are probably polite, kind and diligent as well. Calling a software intelligent only means that it is able to find an optimal solution to a set of problems.