Anomaly detection in LTE networks with Decision Trees and XGBoost

Brayden Chang and Edwin Ang sought to predict normal and anomalous cellular network behavior to facilitate dynamic adjustments, by pitting Decision Trees against Xtreme Gradient Boosting (XGBoost) to discern operational states and network behaviours. Their models demonstrated efficacy with F1-scores of 0.966 and 0.980, and AUC values of 0.974 and 0.999, respectively. Their work helps fellow engineers to evaluate approaches to manage radio network resources more effectively. The team was mentored by Robin Ong Boon Ping, Senior Principal Engineer from the Infocomm Infrastructure Programme Centre.

d*classified
d*classified
16 min readApr 20, 2024

--

Challenge

Mobile Network Operators (MNOs) are facing unprecedented challenges as they gear up to support over 100 billion connections by the next decade, a feat that underscores the pivotal role of Long-Term Evolution (LTE) cellular networks. These networks are integral to the global communications infrastructure, providing enhanced data transmission speeds and reduced latency compared to their third-generation (3G) counterparts.

Despite their capabilities, the operation of LTE networks is fraught with difficulties, particularly in the area of radio resource management (RRM). Key challenges stem from the high costs and limited availability of crucial resources like electrical power and frequency spectrum. The proliferation of wireless devices has further compounded these issues, leading to severe spectrum congestion and soaring operational costs. Historically, MNOs have managed peak traffic by deploying more macro cells and overprovisioning network capacity — a strategy that is no longer viable due to its heavy environmental and economic impacts.

In light of these challenges, the advancement of LTE networks, especially with the adoption of fourth-generation (4G) and emerging fifth-generation (5G) technologies, has turned towards leveraging machine learning (ML) for more dynamic and efficient RRM.

Photo by Praveen kumar Mathivanan on Unsplash

Solution

We developed supervised classification ML models that utilize historical 4G LTE telemetry data to discern and manage cell behaviors. These models are crafted to differentiate normal operational states from anomalies triggered by external events — such as sporting events or large-scale social gatherings — that may disrupt network functioning and demand rapid resource reallocation.

Photo by Maxime Lebrun on Unsplash

To facilitate the practical application of these ML models, the project also plans to introduce a Graphical User Interface (GUI). This interface will empower MNOs to harness these models for real-time predictive insights and refine their network management strategies using live data, significantly enhancing the efficacy and responsiveness of RRM in complex network environments

Approach

Materials & Methods

Dataset

Telemetry data was obtained from a 4G LTE deployment spanning two weeks, where the dependent or outcome feature “Behaviour” along with 13 other predictor features listed in Table 1 were gathered from a set of 10 base stations, in total spanning 33 cells, every 15 minutes. Total sample size of the dataset is 36904.

Table 1: Feature Names and Details

Exploratory Data Analysis

Before approaching the ML model creation, Exploratory Data Analysis (EDA) allows for insights into the traits of the data and the fundamental relationships between features [5]. In this context, it incorporates assessing data quality, identifying distinct outliers, observing data distributions, correlations and relationships between features. Data quality was first assessed by checking for missing values within the rows of the dataset, of which there were none. However, the dataset was observed to be imbalanced with the large majority of cell behaviour being normal (72.4%) rather than anomalous (27.6%).

Figure 1: Normal and Anomalous box plots of mean UE devices (Uplink)

Data visualisation was subsequently done by plotting boxplots for normal and anomalous behaviour for all 12 predictor features (except “Time”, the feature “Time” cannot be analysed until it is converted into alternate forms and is thus negated for this section). Any values determined to be outliers by Tukey’s method [6] was plotted as a diamond dot. Utilizing this method, the only distinct global outlier was identified in the mean active UE devices (uplink) feature, shown in Figure 1, with a value of 2.668 and was subsequently removed for all following analysis and model training.

Afterwards, feature importance was examined to determine which features would benefit the performance of the models and should be utilised. This was done through hypothesis tests that would determine whether a feature has a statistically significant effect on the outcome feature “Behaviour”. To ascertain the type of statistical analysis to perform, a Shapiro-Wilk (SW) test for normality [7] was conducted and it revealed that all features were not normal (p > 0.05), as such, non-parametric hypothesis tests, the Chi Square test for homogeneity [8], Point-Biserial correlation (PB) [9] and Mann Whitney U-test (MW) [10] were selected.

Figure 2: Scatter plot of percentage of PRB usage (Uplink)

As depicted in Figure 2, an example out of the 12 predictor features, we observe in the scatter plot visualisation a difference between the means that is supported by a Point-Biserial correlation coefficient, rpb, of -0.105 that is statistically significant (p < 0.001), indicating a weak negative correlation where cells with lower percentages of PRB usage (uplink) tend to have anomalous behaviour. This is reinforced by a statistically significant difference in the means shown by the Mann Whitney U-Test (p < 0.001). As such, the feature is deemed to have a statistically significant impact on the outcome feature and is utilised for model training. All metric features underwent the above analysis, however, a separate hypothesis test, the Chi Square test for homogeneity (CS) was conducted for the nominal feature “Cell ID”.

Data distribution was subsequently assessed through histograms and three types of distributions were observed among the 11 metric features: (a) exponential distribution as seen in Figure 2 with traits such as a peak at 0 and a long right tail. (b) Bimodal-Lognormal distribution as seen in Figure 3, with a sudden peak at 0 and another peak in a log normal distribution around 1.0. © Normal distribution as seen in Figure 4, all features of this category had integer only values. All features were hence categorised into their different data distribution types. The table below illustrates the summary of the EDA findings, which will be used to influence future processes.

Table 2: Summary of Exploratory Data Analysis findings on 12 predictor features (excluded “Time”) (3dp)

Experiment Setup

All data manipulation, model training and creation of GUIs was done in Python 3.11.4, where all stochastic functions use a “random_state” parameter of 53 (random seed = 53). The dataset was shuffled and split into training (80%) and testing (20%) datasets using scikit-learn’s [11] “train_test_split” function with the “shuffle” parameter set to “True”.

Feature Engineering

Through transforming or creating new suitable features, feature engineering aims to improve predictive performance by extracting the most important information from the data [12]. Three new features were created, the “Hour of the day” categorical feature from the “Time” feature using ordinal encoding, returning integers ranging from 0 to 24. Along with two features “Mean UE devices encoded (uplink)” and “Mean UE devices encoded (downlink)” which were created from the features “Mean UE devices (uplink)” and “Mean UE devices (downlink)” respectively using one hot encoding, returning 1 if the value of the original feature is 0 and 0 if the value is not, extracting information from the bimodal trait observed during the EDA component. These features were then statistically analysed to determine feature importance, as shown in Table 3.

Table 3: Summary of Chi-Square test results on three new created features (3dp)

Only predictor features identified during EDA that had a statistically significant effect on the outcome feature subsequently underwent pre-processing as illustrated in Figure 6, where these features were transformed into numeric forms with the most important information captured. The pre-processor was fit on the training data and used to transform all data.

Figure 6: Flowchart of pre-processor process

For nominal features, on top of the one-hot encoding already done during the creation of the “Mean UE devices encoded (uplink)” and “Mean UE devices encoded (downlink)” features, the “Cell ID” feature had also under went one hot encoding, resulting in the creation of 33 different features, one for each cell, each feature returning a value of 1 if the Cell ID matches with the assigned feature and returning a value of 0 if it does not.

The metric features were further categorised into normally distributed, and exponential or lognormal distributed features based on observations of the data distributions made during EDA. Exponential or Lognormal features would undergo a log transformation in order to normalise the data, improving model performance and training stability. These features would thereafter undergo scaling along with the normally distributed data, where the feature is scaled to have a mean of 0 and a standard deviation of 1. Scaling ensures that the magnitude of influence each feature has on the ML model is comparable and reduces excessive bias.

2.5 Model Training and Hyperparameter Tuning

The two selected ML models for training are the Decision Tree (DT) model [13], which will utilise “scikit-learn”’s “DecisionTreeClassifier()” function, and the eXtreme Gradient Boosting (XGBoost) model [14], which will utilise “xgboost”’s “XGBClassifier()” function, which is an ensemble learning technique that consists of multiple decision trees building upon the residuals of the previous tree. The chosen objective function for hyperparameter tuning is the F1-score:

where precision is the fraction of true positives over predicted positives and recall is the fraction of true positives over actual positives. The F1 score ranges from a perfect score of 1 to the worst possible score of 0, it is the selected objective function due to its utilisation of both precision and recall in a manner which scores models with more balanced precision and recall values higher, it hence is more suitable for the imbalanced classification dataset being used.

Hyperparameter tuning of the models begins with the creation of a study in the Optuna library [15], the first trial then begins with stratified 10-fold validation [16] for both models, splitting the processed training dataset into a new validation (10%) and training dataset (90%) over 10 folds. Stratified k-fold validation is the selected validation method as it addresses the imbalanced dataset by ensuring each fold has the same proportion of anomalous to normal cell behaviour as the original training dataset. The function is performed using the “StratifiedKFold()” function in “scikit-learn” with the “shuffle” parameter set to “True”.

Table 3: Table of selected hyperparameters and ranges for Decision Trees and XGBoost models.

A random set of hyperparameters is then chosen from specified ranges shown in Table 3, where all non-specified hyperparameters use the default values described in their respective library documentation. For each fold, the models are subsequently fitted with the chosen hyperparameters on the new training data and tested on the validation data to obtain a F1 score, the mean F1 score of all the folds is then calculated, indicating the performance of the model with the chosen set of hyperparameters. In the following trial, the hyperparameters will be tuned using Optuna’s optimisation algorithm based on the mean F1 score of the previous trial, aiming for a “maximised” F1 score. The DT model ran 100 trials while the XGBoost model only ran 5 trials due to its increased computational intensity. After all trials are complete, the trial with the highest mean F1 score is selected as the model with the best performing hyperparameters. This set of hyperparameters is then used to fit the final model on the entirety of the training data.

Results

It is observed from the confusion matrix that the XGBoost model particularly excels at minimising false positives, having a perfect precision for anomalous behaviour of 1.00 and creating 0 false positive predictions. However, it has a lower recall for anomalous behaviour of 0.962, indicating that the model prioritises accurate anomalous predictions over predicting all true anomalies. The DT model similarly prioritises accurate anomalous predictions but to a larger extent, with a greater precision for anomalous behaviour of 0.973 than its recall for anomalous behaviour of 0.900.

A high F1 score was also calculated for both models, with the XGBoost model being slightly higher at 0.980 compared to the DT model’s F1 score of 0.966. Hence, not only does the XGBoost model have a greater overall accuracy and score than the DT model, it also has a better balance between precision and recall values.

Both models were subsequently evaluated using a ROC curve [18], shown in Figure 7, which denotes the true positive rate and false positive rate of the models at different thresholds, where a higher true positive rate across all thresholds indicates better model performance. Both models are significantly higher than the “Baseline”, which acts as a random guess model that obtains no information from the predictor features.

It was observed that the XGBoost model had performed better than the DT model at all possible thresholds, coming exceedingly close to being a perfect model of having a true positive rate of 1.0 when the false positive rate is 0.0. This is further reinforced by the Area Under the Curve (AUC) metric, where a greater AUC value indicates higher true positive rates across all thresholds. The XGBoost model had a AUC value of 0.99996 which was greater than the AUC value of 0.98596 for the DT model.

The PRD [18], shown in Figure 8, is capable of evaluating model performance in imbalanced datasets better than ROC curves as it plots the trade-off between precision and recall across different thresholds. Likewise, the XGBoost model outperforms the DT model, having either greater or equal precision and recall at all thresholds. The XGBoost model also has a high AUC value of 0.99989 compared to the DT model’s AUC value of 0.97420.

Under the hood

In order to identify the features that had the largest magnitude of influence on the models and the greatest feature importance, SHAP [18] values for each feature was found for all data points in the testing dataset, and the mean of the absolute SHAP values were found and plotted. SHAP values are based off cooperative game theory, determining how much the changing of a feature would affect the outcome of the predictive model, effectively identifying feature importance. This was done using the “tree.explainer()” and “summary_plot” functions in the “shap” library.

As shown in Figure 9, the DT model is heavily influenced by mainly two features, “Mean UE devices (downlink)” with a mean absolute SHAP value around 0.56 and “Percentage of PRB Usage (uplink)” with a mean absolute SHAP value around 0.17. Similarly, the XGBoost model is most influenced by the same two features but with reversed communication directions, with a mean absolute SHAP value around 0.58 for “Mean UE devices (uplink)” and a mean SHAP value around 0.43 for “Percentage of PRB Usage (downlink)”.

These 4 features were generally observed to have some of the strongest point biserial correlation coefficients during the EDA component which would justify their high mean absolute SHAP values. The XGBoost model is also observed to have greater mean absolute SHAP values for other features besides the two most influential features unlike the DT model, which could hypothetically explain for the difference in predictive performance. Whereby the DT model is prone to overreliance on the two most influential features and is thus more likely to output different predictions due to minor fluctuations in the two aforementioned features. In contrast, the XGBoost model is more consistently influenced by a large set of features and is hence more reliable, leading to greater predictive performance.

Real-world implementation & drawbacks

In order to identify the features that had the largest magnitude of influence on the models and the greatest feature importance, SHAP [18] values for each feature was found for all data points in the testing dataset, and the mean of the absolute SHAP values were found and plotted. SHAP values are based off cooperative game theory, determining how much the changing of a feature would affect the outcome of the predictive model, effectively identifying feature importance. This was done using the “tree.explainer()” and “summary_plot” functions in the “shap” library.

As shown in Figure 9, the DT model is heavily influenced by mainly two features, “Mean UE devices (downlink)” with a mean absolute SHAP value around 0.56 and “Percentage of PRB Usage (uplink)” with a mean absolute SHAP value around 0.17. Similarly, the XGBoost model is most influenced by the same two features but with reversed communication directions, with a mean absolute SHAP value around 0.58 for “Mean UE devices (uplink)” and a mean SHAP value around 0.43 for “Percentage of PRB Usage (downlink)”.

These 4 features were generally observed to have some of the strongest point biserial correlation coefficients during the EDA component which would justify their high mean absolute SHAP values. The XGBoost model is also observed to have greater mean absolute SHAP values for other features besides the two most influential features unlike the DT model, which could hypothetically explain for the difference in predictive performance. Whereby the DT model is prone to overreliance on the two most influential features and is thus more likely to output different predictions due to minor fluctuations in the two aforementioned features. In contrast, the XGBoost model is more consistently influenced by a large set of features and is hence more reliable, leading to greater predictive performance.

This study is primarily focussed on the application of ML for cell behaviour prediction, but does not go into detail or development of the dynamic RRM system itself, placing heavier emphasis on the ML models. Future studies could delve into creating a dynamic RRM system capable of reconfiguring identified anomalies and evaluating effectiveness of the entire system.

Additionally, this study is limited with a dataset associated with just one fixed LTE deployment. The ML models may not perform up to its expected standard when used in new and foreign deployments due to the ML models usage of certain location specific features such as “Cell ID” and the fluctuation of telemetry data and patterns from region to region.

One of the largest drawbacks of the model is its reliance on completeness of data quality, requiring values for all features in order to make predictions, which may not be feasible during real-life implementation and data collection. Although the usage of methods such as imputation could resolve the issue it may result in significant drawbacks in accuracy.

Graphical User Interface

Graphical User Interface (GUI) was created utilising the “streamlit” library, creating an interface for MNOs and users to create predictions with raw testing data input or to evaluate the ML models with the aforementioned evaluation metrics. The GUI website application can be accessed at https://anomalydetectionlte.streamlit.app.

Game for a similar challenge? Step into your future

Excited by what you’ve read? There could be a thinker and tinkerer in you that seeks a greater challenge. Learning never stops — chart your next adventure, and push the envelope in defence tech with us through the Young Defence Scientist Programme.

Slide into our DMs here to fuel your passion for science and technology and be mentored by Singapore’s top engineers and developers.

Photo by Árpád Czapp on Unsplash

References

Data Availability — all raw telemetry data utilised in this project is openly available at the Kaggle database and can be accessed at https://www.kaggle.com/competitions/anomaly-detection-in-4g-cellular-networks/overview from Vidal, J. 2020. Anomaly detection in 4G cellular networks.

[1] Gandhi, J., Narmawala, Z. 2022. Capacity and Cost Analysis of 4G-LTE and 5G Networks. In: Signh, P.K., Weirzchon, S.T. Chhabra, J.K., Tanwar, S. (eds) Futuristic Trends in Networks and Computing Technologies, Lecture Notes in Electrical Engineering, vol 936

[2] Ezhilarasan, E. and Dinakaran, M. 2017. A Review on Mobile Technologies: 3G, 4G and 5G. 2017 Second International Conference on Recent Trends and Challenges in Computational Models (ICRTCCM) pp. 369–373.

[3] Luong, N. C., Lu, X., Hoang, D. T., Nivato, D., and Kim, D. I. 2021. Radio Resource Management in Joint Radar and Communication: A Comprehensive Survey. IEEE Communications Surveys & Tutorials, vol. 23, no. 2, pp. 780–814, Secondquarter 2021

[4] Arbi, A. 2017. Spectral and Energy Efficiency in Cellular Mobile Radio Access Networks. Department of Electronic and Electrical Engineering.

[5] Jebb, A., Parrigon, S. and Woo, S. 2017. Exploratory data analysis as a foundation of inductive research. Human Resource Management Review, Volume 27, Issue 2, 2017, pp. 265–276, ISSN 1053-

[6] Saleem, S., Aslam, M. and Rukh, S. 2021. A review and empirical comparison of univariate outlier detection methods. Pak. J. Statist. 2021 Vol. 37(4), 447–462.

[7] Shapiro, S and Wilk, M. 1965. An analysis of variance test for normality (complete samples). Biometrika 52, 3 and 4, p. 591

[8] David, C. n.d. Chi square test — analysis of contingency tables. Professor Emeritus,University of Vermont

[9] Kornbrot, D. 2014. Point Biserial Correlation Encyclopedia of Statistics in Behavioral Science, © John Wiley & Sons, Ltd.

[10] Emerson, R. 2023. Mann-Whitney U test and t-test. Journal of Visual Impairment & Blindness, 117(1), 99–100.

[11] Buitinck, L., Louppe, G., Blondel, M., Pedregosa, Fabian, Mueller, A., Grisel, O. and Ga”el V. 2013. API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pp. 108–122.

[12] Fatemeh, N., Horst, S., Udayan, K., Elias, B., and Deepak, T. 2017. Learning Feature Engineering for Classification. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) 2529.

[13] A. Navada, A. N. Ansari, S. Patil and B. A. Sonkamble. 2011. Overview of use of decision tree algorithms in machine learning. 2011 IEEE Control and System Graduate Research Colloquium, pp. 37–42.

[14] Ramraj, S., Nishant, U., Sunil, R. and Shatadeep, B. 2016. Experimenting XGBoost Algorithm for Prediction and Classification of Different Datasets. International Journal of Control Theory and Applications. ISSN : 0974–5572 © International Science Press Volume 9 Number 40.

[15] Akiba, T., Sano, S., Yanase, T., Ohta, T. and Koyama M. 2019. Optuna: A Next-generation Hyperparameter Optimization Framework. KDD ’19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2623–2631.

[16] Berrar, D. 2018. Cross-Validation. Encyclopedia of Bioinformatics and Computational Biology, Volume 1, Elsevier, pp. 542–545.

[17] Fawcett, T. 2006. An Introduction to ROC Analysis. 27 (8): 861–874. Pattern Recognition Letters. 27 (8): 861–874.

[18] Scott, M. and Lee, S. 2017. A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems. 30: 4765–5774 arXiv:1705:07874.

--

--