Anomaly detection is identifying something that could not be stated as “normal”; the definition of “normal” depends on the phenomenon that is being observed and the properties it bears. In this article, we dive deep into an unsupervised anomaly detection algorithm called Isolation Forest. This algorithm beautifully exploits the characteristics of anomalies, keeping it independent of data distributions making the approach novel.
Since anomalies deviate from normal, they are few in numbers (minority) and/or have attribute values that are very different from those of normal. The paper nicely puts it as few and different. These characteristics of anomalies make them more susceptible to isolation than normal points and form the guiding principle of the Isolation Forest algorithm.
The existing models train to see what constitutes “normal” and then label everything that does not conform to this definition as anomalies. Almost every single algorithm has its own way of defining a normal point/instance; some do it through statistical methods, some use classification or clustering but in the end, the process remains the same — define normal and filter out everything else.
The usual methods are not optimized to detect anomalies, instead, they are optimized to find normal instances, because of which the result of anomaly detection either contains too many false positives or might detect too few anomalies. Many of these methods are computationally complex and hence suit low dimensional and/or small-sized data.
Isolation Forest algorithm addresses both of the above concerns and provides an efficient and accurate way to detect anomalies.
Now we take a go through the algorithm, and dissect it stage by stage and in the process understand the math behind it. Fasten your seat belts, it’s going to be a bumpy ride.
The core of the algorithm is to “isolate” anomalies by creating decision trees over random attributes. The random partitioning produces noticeable shorter paths for anomalies since
- fewer instances (of anomalies) result in smaller partitions
- distinguishable attribute values are more likely to be separated in early partitioning
Hence, when a forest of random trees collectively produces shorter path lengths for some particular points, then they are highly likely to be anomalies.
The diagram above shows the number of splits required to isolate a normal point and an anomaly. Splits, represented through blue lines, happens at random on a random attribute and in the process building a decision tree. The number of splits determines the level at which the isolation happened and will be used to generate the anomaly score.
The process is repeated multiple times and we note the isolation level for each point/instance. Once the iterations are over, we generate an anomaly score for each point/instance, suggesting its likeliness to be an anomaly. The score is a function of the average level at which the point was isolated. The top
m gathered on the basis of the score, are labeled as anomalies.
The decision tree is constructed by splitting the sub-sample points/instances over a split value of a randomly selected attribute such that the instances whose corresponding attribute value is smaller than the split value goes left and the others go right, and the process is continued recursively until the tree is fully constructed. The split value is selected at random between the minimum and maximum values of the selected attribute.
There are two types of node in the decision tree
Internal nodes are non-leaf and contain the split value, split attribute and pointers to two child sub-trees. An internal node is always a parent to two child sub-trees making the entire decision tree a proper binary tree.
External nodes are leaf nodes that could not be split further and reside at the bottom of the tree. Each external node will hold the size of the un-built subtree which is used to calculate the anomaly score.
The Isolation Forest algorithm works well when the trees are created, not from the entire dataset, but from a sub-sampled data set. This is very different from almost all other techniques where they thrive on data and demands more of it for greater accuracy. Sub-sampling works wonder in this algorithm because normal instances can interfere with the isolation process by being a little closer to the anomalies.
The image above shows how sub-sampling actually makes a clear separation between normal points and anomalies. In the original dataset, we see that normal points and very close to anomalies making detection tougher and inaccurate (with a lot of false negatives). Because of sub-sampling, we could see a clear separation of anomalies and normal instances. This makes the entire process of anomaly detection efficient and accurate.
Since anomalies are susceptible to isolation and have a tendency to reside closer to the root of the decision tree, we construct the decision tree till it reaches a certain height
max_height and not split points further. This height is the height post which we are (almost) sure that there could not be any anomalies.
The process of tree construction is repeated multiple times and each time we pick a random sub-sample and construct the tree. There are no strict rules to determine the number of iterations, but in general, we could say the more the merrier. The sub-sampling count is also a parameter and could change depending on the data set.
The pseudocode for forest construction is as follows
While constructing the tree we pass
log2(nodes_count) as that is the average height of a proper binary tree that could be constructed from
nodes_count number of nodes. Since anomalies reside closer to the root node it is highly unlikely that any anomaly will isolate after the tree has reached height
max_height. This helps us save a lot of computation and tree construction making it computationally and memory efficient.
Every anomaly detection algorithm has to score its data points/instances and quantify the confidence the algorithm has on its potential anomalies. The generated anomaly score has to be bounded and comparable. In Isolation Forest, that fact that anomalies always stay closer to the root, becomes our guiding and defining insight that will help us build a scoring function. The anomaly score will a function of path length which is defined as
h(x)of a point
xis the number of edges
xtraverses from the root node.
As the maximum possible height of the tree grows by order of
n, the average height grows by
log(n) – this makes normalizing of the scoring function a little tricky. To remedy this we use the insights from the structure of the decision tree. The decision tree has two types of nodes internal and external such that external has no child while internal is a parent to exactly two nodes – which means the decision tree is a proper binary tree and hence we conclude
The average path length
h(x)for external node termination is the same as the average path length of unsuccessful search in BST.
In a BST, an unsuccessful search always terminates at a
NULL pointer and if we treat external node of the decision tree as
NULL of BST, then we could say that average path length of external node termination is same as average path length of unsuccessful search in BST (constructed only from internal nodes of the decision tree), and it is given by
To understand the derivation in detail refer to the references at the end of this article.
The anomaly score of an instance
x is defined as
E(h(x)) is the average path length (average of
h(x)) from a collection of isolation trees. From the scoring function defined above, we could deduce that if
- the score is very close to 1, then they are definitely anomalies
- the score is much smaller than 0.5, then they are quite safe to be regarded as normal instances, and
- all the instances return around 0.5, then the entire sample does not really have any distinct anomaly
In the evaluation stage, an anomaly score is derived from the expected path length
E(h(x)) for each test instance. Using
get_path_length function (pseudocode below), a single path length
h(x) is calculated by traversing through the decision tree.
If iteration terminates at an external node where
size > 1 then the return value is
e (number of edges traversed till current node) plus an adjustment
c(size), estimated from the formula above. This adjustment is for the unbuilt decision tree (for efficiency) beyond the max height. When
h(x) is obtained for each node of each tree, an anomaly score is produced by computing
s(x, sample_size). Sorting instances by the score
s in descending order and getting top
m will yield us
The isolation forest algorithm thrives on sub-sampled data and does not need to build the tree from the entire data set; it works well with sub-sampled data. While constructing the tree, we need not build tree taller than
max_height (very cheap to compute), making it low on memory footprint. Since the algorithm does not depend on computationally expensive operations like distance or density calculation, it executes really fast. The training stage has a linear time complexity with a low constant and hence could be used in a real-time online system.
I hope this article helped you to understand Isolation Forest, an unsupervised anomaly detection algorithm. I stumbled upon this through an engineering blog of Grofers. This algorithm was very interesting to me because of its novel approach and hence I dived deep into it. FYI: In 2018, Isolation Forest was extended by Sahand Hariri, Matias Carrasco Kind, Robert J. Brunner. I have not read the Extended Isolation Forest algorithm but have definitely added it to my reading list. I recommend that if you liked this algorithm you should definitely give the extended version a skim.