We use ResNet-ish architecture (which has already become the best type of architecture to use in CNNs) that consists of Residual Blocks (ResBlocks). The idea behind ResBlock is simple yet efficient — add the input of the block to its output. This idea allows neural networks to “remember” each intermediate result and take it into account in the final layers.
CNN tries to predict the next value using some number of previous values. In our case, this number equals 10, but of course, it may be configured.
Before coding CNN itself we should make a few additional preparations of the data. In general, since we use PyTorch, the data should be wrapped into something compatible with PyTorch’s Dataset. This may be as well the class inherited from the
Dataset class, a generator, or even a simple iterator.
We will use the class option due to its readability and implicit PyTorch’s enhancements.
Once again, there are some necessary imports for Dataset creation:
Let’s recall the task for CNN. We want the model to predict the next value using some amount of the previous values. This means that our Dataset class should contain each item in a specific format — divided into 2 parts:
nvalues in sequential order that the model uses for prediction (to easily change the amount of these values, we make
- 1 value that goes next after
nvalues from the first part of the item
Coming back to the implementation — actually, it is very easy to wrap data with our custom class that just inherits from PyTorch’s Dataset. It should implement only
Now it is the right time to define the number of previous values that are to be used for the prediction of the next one. And, of course, wrap the data with our
Here comes the best part — the definition of the neural network.
Let’s rewind how Residual block looks like in regular ResNets. It consists of:
The only difference between the regular
ResBlock and ours is that we removed the last
ReLU activation — it turned out that in our case CNN without the last
ResBlock generalizes better.
Many Residual Blocks bring twice more Convolution + Batch Norm. (+ ReLU) combinations. So, such a combination is a good starting point to define.
In each Residual Block, we should remember about the case of changing the number of output channels (when
in_feat != out_feat). One possible way to synchronize the number of channels is to multiply or cut them. However, there is another greater way — we can handle this using a 1×1 convolution without padding. This trick not only allows us to fit layer input into layer output but also adds more reasonable computations for the neural network.
It is widely used to finish the base block of the convolutional net with Max Pooling or Average Pooling depending on the task. Here comes another useful trick for Convolutional Neural Nets (thanks to Jeremy Howard and his fantastic fast.ai library) — concatenate Average Pooling and Max Pooling. It allows our neural net to decide, which approach is better for the current task and how to combine them to get better results:
And here is our resulting CNN class that is made of the building blocks that we implemented above:
Now we have the definition of CNN and can create a randomly initialized model, pass items from our Datasets, and check that data shapes are correct and that no exceptions are raised.
After the model is defined, we can move to the training loop. This loop is quite general and may be used with the majority of the neural nets:
- Loop over the epochs
- Loop over the training part of the dataset making optimization steps
- Loop over the validation part of the dataset
- Calculate losses for each epoch
Alright, we are almost ready to start fitting our CNN model. The only thing left is to initialize the model, dataloaders, and training parameters.
Here we use:
- Adam optimizer — one of the best general optimizers.
If you have no idea which type of optimizer to pick — use Adam. Other optimizers such as SGD, RMSProp, etc. may converge better in some specific situations, but it is not our case.
- Learning rate scheduler with One Cycle policy to speed up the convergence of the neural net.
Instead of keeping the learning rate with the same value across all iterations, we can change it. There a lot of policies such as Cosine, Factor, Multi-Factor, Warmup, etc. We choose One Cycle policy because it seems logical for us — warm-up learning rate for the ~1/3 of the iterations and then gradually decrease it.
- Mean Squared Error loss criterion as we described in Part I.
Eventually, we can run the most desired line of code — execution of the training process:
And see the losses of our model.
We always want both training and validation losses to move down because this behavior means that the model has learned something useful about our data. If any loss eventually moves up, then the model can’t figure out how to solve the task, and you should change or modify it.
These losses on the picture above seem pretty nice because they eventually move down, but it is an early assumption (their niceness) since we haven’t checked the predictions yet.
At this very moment, we can easily calculate them with our model:
And take a look:
Just to be clear, we don’t really want our models to perfectly predict values, as such perfection would destroy the whole idea of the anomaly detection process (described in “Saying what we want from our models out loud” in Part I). That’s why when we look at plots, we want to see our model catch the main trend, not the particular values.