Expert Tips / Case Study

How ML-Based Anomaly Detection can improve the performance of 4G networks?

Sharing our practical experience in solving broadband distribution problem for mobile internet providers

Oleksandr Stefanovskyi

Published in

Intelliarts AI

6 min readMar 31, 2021

Applying Anomaly Detection For Improving 4G Networks Utilization

There is a problem with the older generations of cellular networks — they are wasting resources by distributing the traffic evenly. Imagine a large area that a cellular network covers, including the city, small towns, and large forests. Obviously, the highest traffic loads will be in the city area, lower in the towns, and the little to none in forests, while all areas are receiving the same coverage. This issue is called ineffective traffic utilization.

Modern 4G networks have an even higher amount of traffic, so optimizing the utilization of frequency resources will be beneficial for cellular providers and will lead to significant energy savings and better customer experience.

It is possible for operators to anticipate traffic demand in different parts of the network thanks to Machine Learning-based Anomaly Detection methods. In this research, we took information from the public domain, analyzed it, and leveraged Machine Learning algorithms to show 4G Network operators one of the approaches for improving traffic utilization, which has proven to be effective in our case.

The most interesting of the other existing solutions, in our opinion, include:

Anomaly Detection and classification in cellular networks using automatic labeling technique for applying supervised learning suitable for 2G/3G/4G/5G networks.
CellPAD, a unified performance anomaly detection framework for detecting performance anomalies in cellular networks via regression analysis.

Data Overview

The dataset we use is extracted from the activity of the real LTE Network. We had 14 features with 12 of them being numerical and 2 categorical ones. Overall we had 36,904 rows of recorded data. Among these recordings we had 0 missing values, meaning that none of the rows was empty. Our data was labeled by the Data Analytics team into two classes:

0 (normal) — no redistribution or reconfiguration needed
1 (unusual) — different activity compared to the usual behavior, reconfiguration is required

The collected data was labeled manually based on the load of certain parts of the network. There is also an option for automated data labeling using neural networks, for example, Amazon SageMaker Ground Truth has this functionality.

The Insights From Data Analysis

The labeled data was analyzed, and we found out that the dataset was imbalanced, meaning that we had 26,721 values in class 0 (normal) and 10,183 in class 1 (unusual):

Based on the dataset we built a Pearson correlation matrix:

4G networks utilization features correlation plot — intelliarts.ai — 4G networks utilization features correlation plot

From the illustrations above we can conclude that many of the features are highly correlated. Data correlation helps us understand the relationships between multiple attributes in the dataset. It is used as a basic quantity for various modeling techniques, in some cases can indicate the presence of a causal relationship and can help to predict one attribute from the other. With the perfectly positive or negative attributes, which we have in this case, there is a chance to face the multicollinearity problem and get the lower performance of the model as a result. This problem occurs when one predictor variable in a multiple regression model can be predicted linearly from other variables with a high degree of accuracy.

The good news is that decision trees and boosted trees algorithms are great for this scenario because they deal with multicollinearity by choosing only one of the perfectly correlated features at the moment of splitting. Linear Regression or Logistic Regression, on the other hand, are not immune to this problem and need additional adjustment before training. There are other ways to deal with multicollinearity. We could delete one of the perfectly correlated features or use the Principle Component Analysis (PCA). We decided to use algorithms based on trees because they can deal with this problem out-of-the-box.

One of the most popular metrics to measure classification is the ratio of correct predictions to the total numbers of samples in the dataset — basic accuracy. In our case, the classes are imbalanced which means that the basic accuracy can lead to wrong results because high metrics don’t show the prediction capacity for the minority class. Even with 99% accuracy, we would still have weak prediction capacity on the class that interests us, and anomalies are the rarest classes in the dataset. To gain more understanding of how our models perform we will use the F1 metric. It is the harmonic average of precision and recall. This metric is a good choice for the imbalanced classification scenario. The range of F1 is in [0, 1], where 1 is perfect classification and 0 is a total failure.

The samples can be classified in the four possible ways:

TP (True Positive) — the sample gets a positive label and classification

TN (True Negative) — the sample gets a negative label and classification

FP (False Positive) — the sample gets a negative label and a positive classification

FN (False Negative) — the sample gets a positive label and a negative classification

The metrics for the imbalanced classes look like this:

True Positive Rate, Recall, or Sensitivity

False Positive Rate or Fall-out

Precision

True Negative Rate or Specificity

So, the formula for the F1-score metric that we used is:

Algorithms That We Used

We started with DecisionTreeClassifier that provided us with 94% accuracy on the test selection without any adjustment of parameters. This is a great result, but we wanted better results and decided to try another tree algorithm named BaggingClassifier. Based on the F1 Score metric, BaggingClassifier was even better with 96% accuracy. The other two algorithms RandomForestClassifier and GradientBoostingClassifier provided us with 91% and 93% accuracy respectively.

Feature Engineering Step

After seeing the initial performance of tree-based algorithms we tried to process the data to get even better accuracy. Adding time features (hours and minutes), extracting part of the day from the “time” parameter, as well as using the time lags feature didn’t help much. Feature transformation and balancing the data using upsampling techniques improved model results.

Parameters Tuning Step

Having 90%+ accuracy was quite impressive results for out-of-the-box algorithms. We decided to tune all four algorithms with the GridSearch technique and found out that GradientBoostingClassifier was the best for this case with 99% accuracy, which is more than enough for our goal.

Conclusion

The problem described in the article is very common and any mobile company offering 3G or 4G can use this approach to provide internet traffic more effectively to the users. The “anomaly” in this use case means an inefficient distribution of mobile internet traffic. The Machine Learning model can decide whether the resources are distributed effectively or not, based on input data. With the 99% accuracy of GradientBoostingClassifier tuned with GridSearch, it is possible to determine the effectiveness of internet traffic distribution for the mobile company at any particular moment to change the parameters and make the user experience better.

We at intelliarts.ai love to help companies to solve the challenges with data strategy design and implementation, so if you have any questions related to ML pipelines in particular or other areas of Data Science — feel free to reach out.