Many people are familiar with the term, Deep Learning, as it has gained widespread attention as a reliable way to tackle difficult and computationally expensive problems. However, especially among newcomers to the field, there is little concern for how these systems were originally developed. Without this context, it is sometimes difficult to decide which specific framework or architecture is required for a particular application. In this post will learn the difference between a deep learning RNN vs CNN.
Modern-day deep learning systems are based on the Artificial Neural Network (ANN), which is a system of computing that is loosely modeled on the structure of the brain.
Neural networks (NN) are not stand-alone computing algorithms. Rather, they represent a structure or framework, that is used to combine machine learning algorithms for the purpose of solving specific tasks.
Any neural network is basically a collection of neurons and connections between them. Neuron is a function with a bunch of inputs and one output. Its task is to take all numbers from its input, perform a function on them and send the result to the output.
Here is an example of a simple but useful in real life neuron: sum up all numbers from the inputs and if that sum is bigger than N — give 1 as a result. Otherwise — zero.
Connections are like channels between neurons. They connect outputs of one neuron with the inputs of another so they can send digits to each other. Each connection has only one parameter — weight. It’s like a connection strength for a signal. When the number 10 passes through a connection with a weight 0.5 it turns into 5.
These weights tell the neuron to respond more to one input and less to another. Weights are adjusted when training — that’s how the network learns. Basically, that’s all there is to it.
To prevent the network from falling into anarchy, the neurons are linked by layers, not randomly. Within a layer, neurons are not connected, but they are connected to neurons of the next and previous layers. Data in the network goes strictly in one direction — from the inputs of the first layer to the outputs of the last.
If you throw in a sufficient number of layers and put the weights correctly, you will get the following: by applying to the input, say, the image of handwritten digit 4, black pixels activate the associated neurons, they activate the next layers, and so on and on, until it finally lights up the exit in charge of the four. The result is achieved.
When doing real-life programming nobody is writing neurons and connections. Instead, everything is represented as matrices and computation based on matrix multiplication for better performance.
While the traditional NN proved successful in many tasks, recognition of its true strength began with the introduction of very large amounts of data and the computing power required to process it. Essentially, deep learning systems are very large neural networks that are trained using considerable volumes of data. With the realization that these systems had vast and untapped potential, the composition of the underlying structure became an important research topic.
An important milestone in the history of deep learning was the introduction of the Recurrent Neural Network (RNN), which constituted a significant change in the makeup of the framework.
The RNN uses an architecture that is not dissimilar to the traditional NN. The difference is that the RNN introduces the concept of memory, and it exists in the form of a different type of link. Unlike a feedforward NN, the outputs of some layers are fed back into the inputs of a previous layer. This addition allows for the analysis of sequential data — music, text or voice, which is something that the traditional NN is incapable of. Also, traditional NNs are limited to a fixed-length input, whereas the RNN has no such restriction.
The inclusion of links between layers in the reverse direction allows for feedback loops, which are used to help learn concepts based on context. In terms of what this can do for a deep learning application, it depends very much on the requirements (or goal) and the data.
For time-series data that contains repeated patterns, the RNN is able to recognize and take advantage of the time-related context. This is accomplished by applying more weight to patterns where the previous and following tokens are recognized, as opposed to being evaluated in isolation. For example, consider a system that is learning to recognize spoken language. Such a system would benefit greatly by taking into account recently spoken words to predict the next sentence.
On this topic, a popular framework for learning sequence data is called the Long Short-Term Memory Network (LSTM). This is a type of RNN that is capable of learning long-term relationships. If, for example, the prediction of the next word in an autocomplete task is dependent on context from much earlier in the sentence, or paragraph, then the LSTM is designed to assist with this. LSTMs have also achieved success in acoustic modeling and part-of-speech tasks.
RNNs come in different varieties that are also typically dependent on the task. The type of RNN is described by the number of inputs in relation to the number of outputs. The four different types are:
- One-to-one, which is informally known as the Vanilla RNN. This variety has one input, such as a word or an image, and outputs a single token, such as a word or a Boolean value.
- One-to-many, where one input is used to create several outputs.
- Many-to-one, where several inputs are used to create a single output.
- Many-to-many, where several inputs are analyzed to generate several outputs.
To understand which of these is best suited for a particular job, it is worthwhile to review some of the applications for which RNN vs CNN is effective.
RNN is applied successfully in many types of tasks. Some of these include:
- Image classification, where an image is examined and a single determination is made, such as “Daytime picture” versus “Nighttime picture”.
This is an example of a one-to-one mapping.
- Image captioning, where an image is automatically given a caption based on what is being shown. This can include complex actions, such as: “Fox jumping over dog”.
This task requires a one-to-many RNN, where the input is a single image and the output is a phrase consisting of several words.
- Time series prediction, such as the forecasting of a stock price given a history of values.
This task involves using a many-to-one RNN, where many previous stock prices are used to predict a single, future price.
- Automatic language translation, where the written text or spoken words of one language serves as input, and a different language representing the same text is output.
This is an example of a many-to-many RNN, where several words are analyzed as input, and the output is also several words in length.
- Natural language processing, such as sentiment analysis in social media posts.
Depending on the complexity of the sentiment, this RNN may be of type many-to-one or many-to-many. If, for example, the sentiment is “positive” or “negative”, then there is only a single output. However, if the sentiment is more complex and combined with additional modifiers such as “positive and funny”, “negative and angry”, or “negative and funny” then it would be suitable for a many-to-many implementation.
- Text classification and sentence completion. RNN is used broadly in text classification, outperforming other well-known algorithms such as the Support Vector Machine (SVM).
- Segmented handwriting recognition and speech recognition systems have also been successfully implemented using RNNs.
When comparing RNN vs. CNN, the next important innovation in NN frameworks is the CNN — used to search for objects on photos and in videos, face recognition, style transfer, generating and enhancing images, creating effects like slow-mo and improving image quality. CNN’s are used in all cases that involve pictures and videos.
The defining feature of the CNN is that it performs the convolution operation in certain layers — hence, the name Convolutional Neural Network. The architecture varies slightly from the traditional NN, starting with the makeup of the individual layers. Convolution can be represented as a layer of a neural network because each neuron can act as any function.
In CNN’s, the first layer is always a Convolutional layer. These are defined using the three spatial dimensions: length, width, and depth. These layers are not fully connected — meaning that the neurons from one layer do not connect to each and every neuron in the following layer. The output of the final convolution layer is the input to the first fully connected layer.
Interactive Demo of CNN recognizing handwritten digits
Mathematically speaking, a convolution is a grouping function that takes place between two matrices. More generally it combines two functions to make a third, thereby merging information. In practice, they can be thought of as a filter, or a mechanism for feature selection. A single layer may be responsible for pinpointing very bright pixels in an image, and a subsequent layer recognizes that these highlights, taken together, represent the edge of an object in the image.
In addition to the convolution layers, it is common to add pooling layers in between them. A pooling layer is responsible for simplifying the data by reducing its dimensionality. This effectively shortens the time required for training and helps to curb the problem of overfitting. After the convolution and pooling, layers come to the fully connected layers.
In the final, fully connected layers, every neuron in the first is connected to every neuron in the next. The process during this stage looks at what features most accurately describe the specific classes, and the result is a single vector of probabilities that are organized according to depth. For example, in a vehicle recognition system, there are numerous features to consider.
A side-view picture of a vehicle may only show two wheels. Taken in isolation, this incomplete description will potentially match a motorcycle. As such, there will be a non-zero probability, albeit small, that a vehicle will be classified as a motorcycle or vice-versa. Importantly, additional features such as the presence of windows and/or doors will help to more accurately determine the vehicle type. The output layer generates the probabilities that correspond to each class. That with the highest probability is assumed to be the best choice. In this example of identifying a vehicle, the motorcycle would have a lower probability because, among other things, there are no visible doors.
The most common application for CNNs is in the general field of computer vision. Examples of this are medical image analysis, image recognition, face recognition, generating and enhancing images, and full-motion video analysis.
One such system is AlexNet, which is a CNN that gained attention when it won the 2012 ImageNet Large Scale Visual Recognition Challenge. It is a CNN that consists of eight layers, where the first five are convolutional, and the final three are fully connected layers.
CNN has been the subject of research and testing for other tasks, and it has been effective in solving traditional Natural Language Processing (NLP) tasks. Specifically, it has achieved very impressive results in semantic parsing, sentence modeling, and search query retrieval.
CNN’s have been employed in the field of drug discovery. AtomNet is a deep learning NN that trains on 3D representations of chemical interactions. It discovers chemical features and has been used to predict novel biomolecules for combating disease.
Finally, it is worth noting that CNNs have been applied to more traditional machine learning problems, such as game playing. Both Checkers and Go are games for which CNN has learned to play at the professional level.
A comparison of RNN vs CNN would not be complete without mention that these two approaches are not mutually exclusive of each other. At first glance, it may seem that they are used to handle different problems, but it is important to note that some types of data can be processed by either architecture. Examples of this are image classification and text classification, where both systems have been effective. Moreover, some deep learning applications may benefit from the combination of the two architectures.
Suppose that the data being modeled, whether representative of an image or otherwise, has temporal properties. This is an ideal situation for the merging of these techniques. One such hybrid approach is known as the DanQ architecture. It takes a fixed-length DNA sequence as input and predicts properties of the DNA. The convolutional layer discovers sequence motifs, which are short recurring patterns that are presumed to have a biological function. The recurrent layer is responsible for capturing long-term relationships, or dependencies between motifs. This allows the system to learn the DNA’s grammar and consequently, improve predictions.
DanQ was built upon the DeepSEA model, which performs the same function as DanQ, but does not include the RNN component. The DanQ model performs superiorly to its predecessor, thereby highlighting the significance of adding memory and feedback loops to the architecture. Similarly, the RNN component benefits by considering only the more abstract data that has been filtered by the CNN, making the long-term relationships easier to discover.
It goes without question when comparing RNN vs CNN, both are commonplace in the field of Deep Learning. Each architecture has advantages and disadvantages that are dependent upon the type of data that is being modeled. When choosing one framework over the other, or alternatively creating a hybrid approach, the type of data and the job at hand are the most important points to consider.
An RNN is used for cases where the data contains temporal properties, such as a time series. Similarly, where the data is context-sensitive, as in the case of sentence completion, the function of memory provided by the feedback loops is critical for adequate performance.
A CNN is the top choice for image classification and more generally, computer vision. In addition, CNN’s have been used for myriad tasks, and outperform other machine learning algorithms in some domains. CNN’s are not, however, capable of handling a variable-length input.
Finally, a hybrid RNN and CNN approach may be superior when the data is suitable for a CNN but has temporal characteristics that can be identified and exploited by an RNN component.