Computer vision tasks primarily involve primarily of processing static images (or sequences of them such as frames in a video), the biological vision has shown to processes and emits fewer signals, mainly of changes occurring in the environment at a certain point in time. In simple words, cells in your eye only convey information to the brain when they detect a change in the scene — an event, while report nothing at all when no changes are detect. This key characteristic of biological vision systems allows the selective focus of attention on the salient portions of the scene, drastically reducing the amount of information that needs to be processed. Take an example of the frames captured from a video below.
In conventional sensors, data is conveyed in frames, which includes everything presented on the image is processed including sky, trees, and grass, while the only important information is actually the movement of the person, the swing of the golf club, and the movement of the ball. To avoid this issue of overprocessing irrelevant information, event-based sensors were introduced. Event-based sensors send out data packages, or events, from each pixel asynchronously whenever a local brightness change is detected in the pixel, rather than reading every single pixel and sending out frames at a constant rate. Such event-based sensing allows us to perform some vision tasks extremely efficiently, reducing the amount of required computation, transmitted data, and power consumption. Researchers have also shown that collecting statistics on event-based sensors could pave the way to full visual reconstruction. This is also where spiking neural networks steps in.
In accordance with what was mentioned in the last article, the picture above depicts a biological neuron and how they communicate with one another via action potential (which produces what known as ‘spikes’). A collection of spikes through time is known as spike train, as shown in the image below. They can be thought of as a collection of data (which in this case, is a function of time)
In traditional ANNs, the non-spiking neurons (see Fig 1.1) use differentiable, non-linear activation functions to propagate information between units, which allow units to be stacked into multiple layers.
The derivative property of these neurons is also what makes learning through backpropagation via gradient-based optimization possible. The main difference between the traditional ANNs and the SNNs is that the SNNs adopt “spiking neurons”, which uses pulses of “spike” as the mean of communication, propagating information between units over time in a brain-like manner instead of using continuous activation value (see Fig 1.2 and 1.3). This spatio-temporal property (def: involving space and time) of the spiking neurons is also what makes SNN one of the most promising candidates to process temporal-dynamic visual data captured as a function of time by event-based sensors as well as in classical frame-based machine vision applications such as object recognition or detection, where they have proven to be accurate, fast, and efficient, especially when being run on neuromorphic hardware
The network was initially developed in order to shed some light on the computing dynamic of the brain. Interestingly, in terms of engineering motivation, SNNs also hold apparent advantages over traditional neural networks regarding performance speed and power-consumption when implemented on neuromorphic hardware platforms, which could resolve the power-consumption issue faced by CNN. This is due to the unique nature of the networks in which output spike trains can be made sparse in time. Since each spike would consume energy, having few spikes which contains high information content could effectively lower the total 6energy consumption. Neuromorphic systems and hardware design are also based on this spiking property and together with the implementation of SNNs, neuromorphic systems could play the key role in the progression of next-generation artificial intelligence.
Since SNNs is proven effective at processing sensor information in real-time, it could become extremely beneficial in a dynamical visual system such as autonomous vehicles where it could improve the emergency brake assistants in which challenging weather condition as well as suddenly appearing vehicle or pedestrian are the main risk factors during high-speed maneuvering.
Aside from selective attention model, In recent years, various models of SNNs have been proposed to solve object recognition tasks, including the hybrid type such as the Convolution Spiking Neural Network that adopts conversion algorithms on the conventional CNN, in which weights are converted into spike signal input with leaks and refractory period. The main idea behind this hybrid architecture is to replace the CNN classifier unit with a spiking neuron whose firing rate is correlated with the output of that unit (shown in image above).
Despite the promising potential, in practice, SNNs has a very challenging drawback where learning was proven difficult to train, especially when the network becomes multi-layer. One of the reasons being the lack of effective training and learning algorithms as the spike function adopted by the neurons is non-differentiable while backpropagation mechanism, which uses the derivative property of the neurons to train ANNs in a supervised manner, is what made the CNN one of the most, if not, the most powerful object classification/recognition tool to date. Many researchers believed that the performance of SNNs can be improved to catch up with that of ANNs by embedding the deep architecture into the network (Machado, Cosma, and McGinnity, 2019; Tavanaei et al, 2019; Xu, 2019). In order to mend this gap between ANN continuous-valued networks SNN, there is a crucial need to develop learning methods that could support deep (multi-layer) SNN with low error rates as their conventional counterparts. Successful approaches have been shown which include direct training of SNNs using backpropagation and applying stochastic gradient descent on to the SNN classifier layers (Stromatias et al., 2017). Spike-Timing Dependent Plasticity (STDP), a learning rule inspired by the plasticity algorithm of the brain that could be applied in both supervised and unsupervised manner, are also extensively studied due to its biologically plausible nature and possible implementation of low-power on-chip local learning.
As one can see, we are now one step closer to achieving the biological-like vision. We have come a long way from a simple neural network to CNN, and eventually, to SNN. While CNN is perfect for object recognition in static images, it lacks a dynamic nature to process real-time datasets from newly developed event-based sensors which are dependant on time, thus making SNN a more promising candidate for real-time object recognition and processing task. Moreover, many studies have shown that SNNs have a potential to replace the power-hungry CNN as the spiking algorithms can be implemented on neuromorphic systems.
I hope everyone who read through the series now has a better idea of the progression in computer vision and how neuroscience had greatly contributed to the breakthrough of such a fascinating field (and will continue to do so). In the next article, I will dive deeper into the technical property of Spiking Neural Network including the encoding and learning rules. I have left a list of references that I used in this article which could be served as additional readings for those who are interested. Do share my article if you find it useful. See you next time!