Distributed LOF: Density-Sensitive Anomaly Detection With MapReduce

Published in

The Startup

5 min readJun 24, 2020

Overview

Outlier detections is one fundamental step in data quality, data management and data analysis tasks. Many applications are related to it, such as Fraud detection and network intrusion, and data cleaning. Frequently, outliers are removed to improve accuracy of the estimators.

One popular technique in outlier detection, the Local Outlier Factor(LOF), addresses challenges caused when data is skewed and outliers may have very different characteristics across data regions. As datasets increase radically in size, highly scalable LOF algorithms leveraging modern distributed infrastructures are required.

Inspired by the work done by Yizhou Yan et al, this project is focusing on designing LOF algorithm in distributed computing paradigm to satisfy the stringent response time requirements of modern applications. Here, Distributed LOF with related MapReduce jobs and optimized Data-Driven LOF (DDLOF) framework are delivered to achieve this Goal.

All the works are based on Map-Reduce, written in Java. Code is available in the Github.

Distribute LOF (DLOF)

According to the work done by Yizhou Yan et al, DLOF is implemented from assigning all the points to the same machine first, and follow the 3-step pipeline:

Step 1: calculate K-distance of each point and materialize them as intermediate values.
Step 2: calculate LRD, which is the inverse of the average reachability distance based on the number of nearest neighbors of p.
Step 3: calculate LOF for each point from the intermediate values saved in previous steps.

Noticing that although each step requires different types of intermediate values, these intermediate values are only related to the direct kNN of p. Given a point p, as long as the intermediate values associated with the kNN of p are correctly updated and passed to the machine that p is assigned to in the next step, the LOF score of p can be correctly computed without having to be aware of its indirect kNN.

Related MapReduce jobs are as follows:

Data-Driven LOF (DDLOF) framework

To optimize D-LOF, DD-LOF is implemented to bound the support points. It decomposes the KNN search into multiple rounds to minimize the number of points that introduce data duplication. The following figures shows the general schema of the framework.

In the preprocess step, the whole dataset is divided into grids, and points are assigned to each grid. Different flags was set to each point based on three kinds of observations. Flag 1 means that it is able to find all neighbors in its belonging grid, while Flag 2 means that its k distance is shorter than the distance of point p to the grid line, which indicates that the k distance needs to update. After this step, each point will return the list, contains its flag, grid, neighbor list.

Fig 4. Different conditions for supporting areas

Next step is updating the k distance. First, by computing how many other grids the circle with radius r (as the below figure indicates) touches, there are 8 conditions. The most complex case is the radius happens to be grid diagonal.

After k-distance is updated, the neighbor information also needs to be updated. When both k-distance and neighbor list has the latest version, it’s ready to compute LRD, and finally LOF.

DDLOF’s six MapReduce jobs are summarized as follows:

Experiments and Results

For the data generation, firstly, generate random points within a defined region. Then, simulate clusters using make_blobs function in Scikit-learn by defining the parameters of clusters. Lastly, to mark outliers, some points are randomly distributed across the whole region.

Two datasets are generated for test. One is only having 110 points with 10 outliers, another one contains 1,0,50 points with 50 outliers.

Small dataset
The dataset is visualized in a scatter plot below.

Fig 5. DLOF implementation in small dataset

By filtering LOF value whose is greater than 1.5, the points with large LOF value 13.86 and 7.3 fit into the outliers. However, this algorithm fails to capture the rightmost points. The reason maybe is that when k is little (2 is chosen in this case), these points share similar sparse density with its neighbors, which leads to the LOF is near 1.

Fig 6. DDLOF implementation in small dataset

Comparing to DLOF, DDLOF can capture the few rightmost points, but it still fails to the farthest points. It may due to the same reason as DLOF. This indicates that choosing k-value is critical to outlier detection. The small k-value may classify the farthest points into non-outlier, while the larger k-value may make the points belonging to one cluster have the relatively large LOF.

The following figure marks the points whose LOF value is greater than 1.5. DLOF successfully captures few outliers, but fails to detect all of them. To be noticed, there are some points belonging to the clusters also has relatively high LOF. This may due to located on the edge of one high density cluster, the average of the ratio of the local reachability density of this point and those of o’s k-nearest neighbors may still large, even though they are close to its neighbors.

Fig 7. DLOF implementation in large dataset

Conclusion

In our project, DLOF and DDLOF are applied to detect outliers. The points which lay in the center of cluster mostly have LOF close to 1, while the points get higher LOF when they are far from the center of the cluster. Both the methods are able to detect outliers, while there still exists misclassification. More work needs to be done regarding choosing the value of k, and processing the points in the edge of the cluster.

References:
[1] Y. Yan, L. Cao, C. Kuhlman, and E. A. Rundensteiner, “Distributed local outlier detection in big data,” in SIGKDD, 2017, pp. 1225–1234.

[2] https://towardsdatascience.com/density-based-algorithm-for-outlier-detection-8f27 8d2f7983