Assuming that we already went through the README, we can move to the loading into our local filesystem.
This dataset is located in git, but it doesn’t mean that every dataset should be version controlled. We need just
csv with data and
json with anomaly labels and they can even be simply on the local hard drive. Git here is for convenience.
Let’s grab the data into our local environment to manipulate them any way we want:
After some initial inspection in NAB’s GitHub repo, we see that both training and validation files consist of timestamps with corresponding values.
Each file contains 4032 ordered data rows with a 5-minute rate.
Let’s find these files in labels (once more,
json with exact timestamps of anomalies).
We can see that each file contains 2 anomalies.
Since we know what we are dealing with, we can start loading into the notebook with imports of some useful packages:
We already downloaded training and validation files and the file with labels via git. But we also downloaded all other files from the repository, so we need to specify paths to our data of interest:
Finally, we can load timestamps for anomalies from
json into our local environment:
And read our data with pandas into DataFrame object:
As you can see, we’ve managed to load data and take a look at the first 5 rows from the training file. It is enough to understand that we succeeded — these rows are exact 5 first rows that we saw on GitHub.
Great, we’ve loaded data into our notebook and inspected it a bit, but are we good to go with it and move on to the models part? Unfortunately, the answer is no.
We have to define exactly what part of the loaded data exactly will be used by our models. We have to think about the problem we want to solve and what data can be used for this.
In our case, we want to detect anomalies in CPU usage. Well, here is the answer, the clearest choice is
value column because it represents CPU usage.
We also can consider
timestamp column — a timestamp encodes a lot of information, i.e. what day of the week it is, what month of the year it is, etc. This information can be extracted and used by the models. We won’t do it in this article, but if you want, you can try it. Maybe, you’ll achieve even better results!
So, we are going to use
value column from training and validation DataFrames. The next step is to transform values from this column into an appropriate format. The appropriate format, as you might guess, is numbers. This comes from history — computers use numbers for calculations (and ML models as well), you have to make everything numbers, it’s just the way it works.
value column already consists of numbers, and we can use them in our models as it is. But it is pretty often a good idea to standardize numbers before feeding our models with data. It helps them generalize better and avoid problems with different scales of values with different meanings (Yes, somebody tried it in ML and it worked). Quite often standardization just means rescaling the numbers to mean = 0 and standard deviation = 1.
That’s why the next thing we have to do is to parse datetime from timestamps (just for convenient visualization) and standardize values.
We will follow the regular rescaling policy (as we said — mean = 0 and standard deviation = 1) and use
StandardScaler from Scikit-learn:
Then we can extract anomalies into dedicated variables for both training and validation data from DataFrames:
And plot all data with the help of the Plotly library to visualize the whole set of data points and gain more understanding of what it present.
Firstly, we plot the training data.
(We are going to use this code for plotting everything we need)
Secondly, we plot the validation data.
I suppose this blue annoying dot between 60 and 70 needs an explanation — why isn’t it marked as an anomaly? The thing is that this dot goes right after the green anomaly dot between 70 and 80. And after ~77% of CPU load, ~67% doesn’t seem suspicious at all. You also should keep in mind that we can’t look ahead because real data comes in real-time so what looks like an anomaly at some one moment may not look like an anomaly in the full picture.
You may notice that at this moment you feel much more comfortable with data. That usually happens due to visual inspection of data values, so it is strongly advised to visualize and examine it with your own eyes.
Saying what we want from our models out loud
Now we know what our data looks like and what kind of data we have. We don’t have much information, actually, just timestamps and CPU loads. But still enough to build quite great models.
So, here comes the best moment to ask a question — “How our models are going to detect anomalies?”
To answer it we need to figure out 3 things (remembering the general idea of the anomaly detection and that our data are time series):
- What is normal?
This question can enforce your creativity because there is no strict definition of normality.
Here is what we came up with (like many other people). We will use 2 almost identical tasks to teach our models “the normality”:
1. Given the value of CPU usage, try to reconstruct it. This task will be given to the LSTM model.
2. Given the values of CPU usage, try to predict the next value. This task will be given to the ARIMA and CNN models.
If the model reconstructs or predicts easily (meaning, with little deviation), then, it is normal data.
You can change this distribution of the tasks for models. We tried different combinations and this one was the best for our dataset.
- How to measure deviation?
The most common way to measure deviation is some kind of errors.
We will use the squared error between the predicted value (x*) and the true one (x). Squared error = ❨*−❩²
For ARIMA we will use absolute error =|*−|since it performed better. The absolute error may also be used in other models instead of the squared error.
- And what is “too much” deviation?
Or in other words how to pick the threshold? You can simply take some numbers or figure out some rules to calculate it.
We are going to use the three-sigma statistic rule by measuring the mean and the standard deviation of the errors in all training data (and only training, because we only use validation data to see how our model performs). And then we will calculate the threshold, above which deviation is “too much”.
And, looking ahead, we will also use a slightly modified version with measuring over some window behind the current position to incredibly enhance accuracy.
Don’t worry, if it seems too complicated right now, it will become much more clear, when you will see it in the code.
Great, most of the theory part is done, and we have loaded, inspected, and standardized our dataset!
As the practice shows, data preparation is one of the most important fragment sand. We will prove this in the further parts because the code that we implemented here is a strong fundament that we will use in all our models.
And very soon you will be able to move forward (these items will become links like the ones from the header):