Episode-XXIII TrueAI4Telco

Fatih Nar

Published in

Open 5G HyperCore

11 min readAug 2, 2024

Authors: Azhar Sayeed, Fatih E. NAR, Ian Hood

Introduction:

Artificial Intelligence (AI) is revolutionizing the Telco Media Entertainment (TME) industry, offering innovative solutions for managing complex networking systems, enhancing customer experiences, and ensuring robust security & compliance and more. The rapid convergence of telecommunications, media, and entertainment demands the integration of AI to address challenges, optimize operations and possibly create new revenue channels through new product and services offerings.

“In this article we would be sharing our experiences with true ai use (no ppt gimmick) for telco, fueled with predecessor’ knowledge in data-mining and data-science.”

The holy grail of leveraging AI effectively lies in the good data and best data-science practices and fundamentals of successful project & change management.

A good data practice can be summarized based on a pyramid approach (bottom up);

Multiverse is an environment in which we can observe, measure whatever is worth to and knowing how to and using what.
Data is created through abstractions or measurements taken from the world.
Information is the data that has been processed, structured, or contextualized so that it is meaningful to humans.
Knowledge is the information that has been interpreted and understood so that it can be acted on if needed to.
Wisdom is acting on knowledge in appropriate ways.

The right datasets with right features inside -> fuels the AI models, enabling them to learn, adapt, and improve continuously. However, it’s not just about having data; it’s about having the right and useful data with associable features/attributes. Dataset must be accurate, comprehensive, and contextual to be actionable. The synergy between data science and telco domain expertise is crucial in crafting working and beneficial AI solutions that are both technically robust and business-aligned.

The skilled Telco data scientists are needed to;

frame the problem,
to design and prepare the data,
to select which AI model(s) are most appropriate (considering price/performance for training and lcm of the outcome product),
critically interpret the results,
plan appropriate action to take based on the insights provided by AI model(s).

Without skilled human oversight, AI projects will fail to meet their business objectives (See Episode-XXI for Avoiding AI Blindness: Link).

In contexts where there is a well-understood business problem/need, and the appropriate data and human expertise are available, then the applied AI can (often) provide actionable insights that gives an organization the competitive advantage it needs to survive and/or thrive.

Approach:

Many people and organizations regularly offer their suggestions on the best process to follow on the AI journey, which is pretty much aligned with the data pyramid (Figure-1) and geared up with proper goods & skillset as described in AI happiness chamber (Figure-2).

However, in our humble view, the previous work in Cross Industry Standard Process for Data Mining (CRISP-DM Ref: Link) has been the #1 Reference / The-Ground-Work for any successful data science and AI project.

Figure-3 Adopted CRISP-DM Stages & Tasks for AI Projects

Business understanding is crucial for aligning AI projects with business objectives. It involves:

Determine Business Objectives: Clearly define what the business aims to achieve and establish a success criteria.
Assess the Situation: Evaluate resources, project requirements, risks, and perform a cost-benefit analysis.
Determine AI Goals: Set specific technical goals that support achieving business objectives.
Produce Project Plan: Develop a detailed plan, selecting appropriate technologies and outlining the project’s phases.

The Data exploration phase builds on the Business Understanding phase by focusing on the data needed to achieve project goals. It involves three key tasks:

Initial Data Harvesting: Acquire the necessary data and load it into your analysis tool if required.
Explore & Describe Data: Examine and document the data’s surface properties, such as format, number of records, and field identities. Delve deeper into the data through querying, visualization, and identifying relationships among data points.
Volume, Velocity and Variety: Assess the enough-ness and use-ability of the data and document any issues. In our article on Data Formulation we have dived into details (Link).

The Information Engineering phase, often known as “data munging,” is essential for preparing the final dataset(s) for modeling. It involves three main tasks:

Select Data: Identify and choose the relevant data sets to use, documenting the rationale behind the selection and exclusion of data. In our previous articles, Episode-XV (Link) we have addressed the Data Collection , ETL Engine (Extract Transform and Load) together with Data Mesh, Episode-VIII (Link) we have showcased the use of Open Telemetry (OTel) for proper data harvesting in application layer.
Cleansing: Address data quality issues by correcting errors, imputing missing values, and removing inaccuracies to ensure clean data. In Episode-XVI (Link) we have addressed the different data backends for metrics, events, logs and traces (MELT) as original data sources together with exhaust data types (such as utilization metrics, packet loss, jitter etc) from platform observability capabilities empowered with platform engineering (Link) toolboxes.
Dataset Construction: Develop new attributes and integrate data from multiple sources to build a comprehensive dataset, ready for analysis. At this stage, if you are lucky you may have a proper scalable & well organized data warehouse that would lower your data exploration and information engineering timelines, if you are not lucky then you will enjoy dealing with data aggregation, association, cleaning and Nx layers of feature engineering for your ultimate dataset.

The AI Modeling phase focuses on developing and evaluating useful/beneficial models. It consists of two key tasks:

Model Set Build-Up: Select and implement various modeling techniques, such as regression, classification, or neural networks (rnn, cnn, transformers) etc, to construct a range of models.
Model Tests and Assessments: Evaluate the models by testing them against validation criteria and assessing their performance based on domain knowledge and predefined success metrics. Final model selection can be down to a single model instance or chaining multiple models for a mixture of expert approaches (as in multimodal).

The QA and Tuning phase focuses on assessing the model’s overall effectiveness in meeting business objectives and determining the next steps. It includes three main tasks:

Evaluate Results: Assess whether the models meet the defined business success criteria and decide which models are suitable for approval and use.
Review Process: Conduct a thorough review of the project’s work, identifying any overlooked steps or necessary corrections. Summarize the findings to ensure the process was properly executed.
Determine Next Steps: Based on the evaluation and review, decide whether to proceed to deployment, make further iterations, or start new projects.

The Production phase ensures the model is deployed and utilized effectively. It includes the following tasks:

MLOps: Plan and execute the deployment of the model, including the necessary infrastructure, automation, and integration with existing systems.
Observe & Tune: Implement monitoring to observe the model’s performance in a live environment and tune it as necessary to maintain accuracy and reliability.
Result Report: Compile a comprehensive report detailing the project outcomes, including the data mining results and any key insights gained.
Business Impact Final Assessment: Conduct a final assessment to evaluate the business impact of the deployed model, ensuring it meets the intended objectives and identifying any areas for further improvement.

We understand that the approach we presented above may seem comprehensive and demanding (time , resources and dedication most importantly). However, to ensure a positive business outcome, especially given the significant capital investment required in AI, it is crucial to execute these steps in a serious manner. This thorough process maximizes the likelihood of achieving valuable and impactful results for the organization.

Selecting Right AI Path:

Figure-7 AI Technique Suitability vs Use-Case Family (Ref:Link)

Different AI techniques are better suited for different Telco use cases due to their unique strengths (Figure-6):

[A] Traditional machine learning models are favored for prediction and forecasting tasks like demand forecasting and customer churn prediction, as they offer greater accuracy and reliability compared to Generative AI (GenAI).

[B] Optimization and simulation techniques are ideal for planning and optimization scenarios, such as inventory management and route optimization, where precise calculations are essential.

[C] Rule-based systems and specialized AI technologies are preferred for decision intelligence and autonomous systems due to their reliability and explainability.

[D] GenAI excels in content generation, including text and image creation, and in developing conversational user interfaces like virtual assistants and chatbots, which rely heavily on natural language processing.

[E] For segmentation and classification tasks, traditional methods often deliver more consistent performance, as verified by our AI projects.

[F] In anomaly detection, techniques such as isolation forests and support vector machines provide robust solutions.

It is crucial for AI project leaders to choose the most appropriate AI technique, considering factors like accuracy, transparency, and alignment with business goals. In our Telco AI projects we also tried the use of multiple AI techniques (multimodal, mixture of experts -moe-) approach where we married classic-AI with GenAI to offer better solution impact on business needs, reasons explained later.

TME-AIX:

In our TME-AIX projects we closely follow the stages and tasks with associated stakeholders to ensure the successful delivery of an ai project. In each use-case we have implemented so far in TME-AIX repository there is/are ;

Defined business objectives; such as detecting & preventing fraud, predicting & preventing customer churn, lowering MTTR etc.
Various data sources that will be relevant/useful for work as raw materials, where we heavily utilized OTel Data from CNFs (Episode XVIII: Link), Prometheus exports from platform observability data, eBPF for producing exhaust data types. We have also leveraged FCC’s open datasets (Link) , Open Weather data sets (Link) and UC Irvine data sets (Link).
The feature engineered & consolidated final datasets, that we have built from available sources (mentioned above) that have enough/sufficient features for helping us to train our set of AI models. Our final datasets are shared on Hugging Face (Link).
Trained and Evaluated AI-Model, with respect to solving our business problems. Our trained AI models are shared on Hugging Face (Link).

Figure-8 The Relationship between artificial intelligence, data science, and big data. Also displayed are common machine learning and deep learning algorithms. (Ref: Link)

Sandbox Environment:

We have used Red Hat Openshift 4.16 with Red Hat Openshift AI Operator 2.11 as base AI Platform and Red Hat Advanced Cluster Manager and NetObserv Operators for harvesting observability data from workloads & platform exhaust data generations.

For accelerated computing we have tried/used various options such as; NVIDIA GPU ( 2x RTX 4070 Super) with CUDA device type and AMD MPS Device Type (2x Radeon Pro Vega II Duo) on PyTorch & TensorFlow Frameworks. The accelerators we have leveraged are reasonably priced desktop GPUs which have done the job for us for our dataset sizes and ai model complexities we settled with. However when you delve into fine tuning a pre-trained llm with parameter size > Billion mark that’s another story, therefore the genai model versions we used (T5, BERT, GPT2 etc) are below that parameter size limits.

You may ask;

Would it make a difference if we had bigger, better GPUs for our use cases?

Our Answer: Honestly speaking , we have tried few of our use cases with A100 GPU cluster (on public cloud) with larger pretrained model fine-tuning and we have not get benefit of spending much more money in comparison to outcomes generated by means of accuracy and reliability (they were better but not eye popping level).
Also using A100 did not truly take us much further in training loss and evaluation loss in ai models we have worked on at the end the data set is the same, we did not build bigger richer (ie more features inside) datasets for the sake of utilizing A100 better (may be in future we can if we have; time, funds and further motivation).
Where we are as we speak; we believe the combination of classic-AI and genAI as middle ground for Telco use cases are the best shots in our experience. We will give more details on following sample use case implementations.

Why those particular GPU models?

Our Answer: It is all about AAA (Availability, Accessibility and Affordability) vs performance we needed. The GPUs we used are (still) directly purchase-able from the vendor without paying gray/black-market overhead (<$1K) . Also they have pretty good level parallel processing capabilities (>7K cores), only caveat is the memory sizes may not be that impressive (12GB), for that we had to tune our batch sizes down a bit and leveraged multi-gpu usage with Hugging Face accelerator backend.
Another important aspect is the power consumption, the GPUs we have used their TDPs < 250 Watts and in our heaviest work they barely passed 200 Watts level, if you are going to use better and/or multiple gpus -> better calculate your power & cooling needs. See our article on sustainability and power efficiency of AI accelerators article — Episode XXII (Link).

Sample Use Case Implementations:

Classification Use Case: In the TME industry, the need for transaction categorization, DDoS attack filtering etc has been in place for sometime. Such problems can be addressed using classification, which is pivotal in supervised learning, where the goal is to identify which category an object belongs to. Our projects include a range of implementations, such as Revenue Assurance and Fraud Management, where we’ve utilized Balanced Random Forests and Neural Network Transformers (we have implemented two different approaches on different jupyter notebooks).

Prediction Use Case: Telco service assurance, predictive maintenance sort of telco-oss needs can be addressed with regression which focuses on predicting continuous outcomes, essential for tasks like dynamic forecasting, and risk estimation.

Figure-10 5G NetOps with Mixture of Experts (XGBoost+BERT Fine Tuning)

In our repository we have implementations for; Service Assurance, utilizing Neural Networks to predict latency insights and Net Promoter Scores and Sustainability for predicting energy consumptions with a regression model. The 5G RAN operational predictions use a blend of T5 models and Mixture of Experts (MoE) with XGBoost, balancing interpretability and accuracy.

Figure-11 Actual vs Predicted Net Promoter Score for Telecom Service Provider

Segmentation Use Case: Traffic/Transaction segmentation, identifying performance hotspots, and detecting outliers type problems can be solved with clustering groups with similar objects without prior knowledge of the categories. This unsupervised learning technique is used for IoT Security Time&Date Segmentations with density based clustering.

Figure-13 IoT Perimeter Security Analysis Visualizations

Anomaly Use Case: Anomaly detection identifies unusual patterns that deviate from the norm, crucial for detecting fraud, security breaches, and system faults. In our projects, we’ve implemented solutions like Smart Grid Anomaly Detection using Sequential Models.

Closure

By leveraging the adopted CRISP-DM methodology, high-quality data, domain expertise, the right selection of AI techniques, and cutting-edge price-performant accelerators, we can tackle complex challenges and drive use-able innovation.

We kindly invite you to explore TME-AIX in detail and welcome to contribute, so we create a positive impact in Telecom. Together, we can build bigger and better solutions.

Episode-XXIII TrueAI4Telco

Written by Fatih Nar