Managing and Assessing Network as a Service (NaaS) Capacity for Telecommunication Sector
I heard someone say “With great power comes great responsibility” and knowing how crucial assessing a telecommunication’s capacity is, so I decide to generously share my previous experiences and knowledge in assessing NaaS for my past clients; telecommunication companies in several countries (Australia, UAE, Sri Lanka, Malaysia, and South Africa) to all of you here. Oh how adorable and kind I am. Let’s get started and keep on reading.
NaaS Capacity Assessment Objectives
The objectives of capacity assessment cover the following:
- Understanding future business requirements for the required service delivery
- Understanding the current operation of the required service delivery
- Understanding the infrastructure of the required service delivery
- Ensuring that all current and future capacity and performance of the business requirements are delivered in cost-effective manner
NaaS Capacity Assessment Scope and Timeline
As the name suggests, NaaS provides network services to connected network domains and customers.
The demand for this NaaS traffic is highly likely to escalate from time to time. In order to accommodate such thing, the capacity assessment should be planned properly as follows:
- NaaS capacity should be evaluated every 3 months
- NaaS capacity should be able to accommodate 200 simultaneous requests for the first implementation. This is a usual safe minimum threshold per second when a company serves less than 7 millions subscribers. However, take this with a grain of salt (and probably pepper too) since customer behavior may vary over time — subject to lots of factors (out of scope of this article)
- The NaaS network ideally should be able to sustain any size of traffic load of Application Programming Interface (API) requests
NaaS Multi-Tenant Environment Management Stack
NaaS may serve multiple tenants, thus it may reside in multiple environments — with utilization of containers to establish portability.
To orchestrate these multiple environments, a Management Stack and an Application Stack are needed for every environment.
Management Stack is comprised of the deployment, management, and operation assets that are necessary to deploy and execute the NaaS. The respective assets are listed below.
- Bastion Host: A bastion server where a legitimate personnel can log onto from preferably a private network e.g., a Virtual Private Network (VPN). It would be a separate, self-contained virtual machine to remove any dependency on any external environments.
- Continuous Integration/Continuous Delivery (CI/CD): In order to include each environment in the CI/CD pipeline, a CI/CD software e.g., Jenkins or Travis server and a configuration software e.g., Ansible would need to be installed in all environments.
- Container Repository: Every environment would be installed with container repository to provide roll-forward and rollback functionalities of the NaaS components without any external dependencies. Subsequently, this may assist future firewall changes.
- Key Store: This would contain the NaaS platform secret keys e.g., connection strings, passwords, account details of each environment. Such store could be provided by the Ansible Secrets feature.
- Service Discovery: This provides discovery and resolution of every management stack component above without depending on external Domain Name System (DNS) servers.
NaaS Application Stack
The NaaS platform’s application stack has the application components and provides the necessary integration endpoints to the Resource Discovery Network (RDN). This stack consists of the following assets:
- Microservices including authentication, authorization, and proxy services
- Load Balancer that provides Mutually Authenticated Secure Socket Layer (MA-SSL) offloading for all users connecting to the NaaS platform. This functionality may be provided by a web server e.g., Nginx
- Master Service Catalogue: It is a database that stores data about network-related services, management, and governance e.g., routing information
- Message Queuing Service: It is a middleware that mediates the asynchronous delivery of messages between consumers and providers. Some popular examples are RabbitMQ and Apache Kafka
- API Directory: A list of available APIs
- Admin GUI: Management interface for administrators using graphical elements
- DevOps Dashboard: A graphical dashboard for NaaS platform development and operation monitoring
Input Capacity Forecasting
The number of transactions per services varies widely depending on the service type and number of service instances. The typical kinds of service provided by a telecommunication company are for examples: voice (of course — either mobile or landline), data (known to commoners as the Internet), TV (yes it’s television, what else could it be?), and many more.
Assume the current accumulative average monthly number of input transactions is 3 millions (no worries, no complex math is involved), which are comprised of different telco services. But basically, whatever this number is, a ready-to-use baseline to predict future input growth is the historical trend. Suppose there is an increasing trend and we use this increment to generate the multiplying factor to forecast future number of input transactions, which we plan to handle. Be informed that predicting and extracting this multiplying factor is a separate task preferably done by a data scientist. For brevity, let’s assume the multiplying factors for each service have been generated and are compiled in Table 1 below.
Table 1. Input Capacity Prediction based on Multiplying Factors
In Table 1, it is shown that the total number of forecasted future input transactions is 14,500,000, and let’s assume there are 5 network domains that we will have to connect to our NaaS platform. That means that the forecasted future input transactions must be multiplied by 5, which generates 72,500,000 of forecasted input transactions. But don’t worry about this magnitude of a number, because each domain would be served by different or dedicated service instance (service can be replicated), thus let’s stick to 14,500,000 as our baseline of future capacity.
All in all, the future network input transactions may be estimated as follows:
Number of monthly transactions from IT components: 14,500,000
Number of monthly transactions between network domains: 72,500,000
Number of hourly peak transactions: 715,342
Number of peak transactions every second: 199
Note that the number of hourly peak transactions variable is taken from the recorded value in the monitoring log file and not from any formula. While the value of number of peak transactions every second is easily calculated by dividing number of hourly peak transactions by 3,600 seconds.
Input Capacity Testing
In order to ensure the NaaS platform is fit for purpose, it should be load-tested and stress-tested to ensure the above estimations could be handled. Basically, we need to configure our testing tool e.g., JMeter to send requests in the pace and volume according to the above estimations, and then afterwards, set another test with pace and volume higher than those estimations. This is why it is called stress test (hitting the platform beyond its expected load till it requires anti-depressant — pun intended).
The load and stress tests would consist of sending API requests to the API server e.g., requests to create service, get service, delete service, and whatnots. For the API server, it may utilize Node.js, while MongoDB is a plausible platform to serve as the database since the API requests would be in JSON format.
In case you need a reminder about performance test (load test and stress test), you may refer to this curated article: https://medium.com/software-qa-testing/overview-of-software-performance-testing-activities-a5c81ea32952?sk=29f7b8c2ade2289dcf19658df2784c59. Surprise! I wrote that article too! Gotcha!.
At the end of the last stress test (the one with the most burdening load), get the summary of result like the one shown below. The last stress test’s result is taken as a conclusion — assuming that the NaaS platform has passed previous tests with flying colors.
Peak API hits per hour: 3,000,000
Peak API hits per minute: 50,000
Peak API hits per second: 833
Error Rate: 0.05%
Based on the above summary, it is implied that the NaaS platform is in general capable to cater to the forecasted future input transactions and beyond. But what if the result is negative? Tune the Node.js and MongoDB servers configurations; theoretically they may process thousands of requests per second.
But don’t get too excited yet! We must also monitor our hardware utilizations. What if our CPUs and/or memories are overwhelmed and overheated while servicing the API requests? Or worse, fried?. Poor things.
The threshold for a safe CPU utilization is below 60% (sounds like a very low threshold, but remember we are talking about mission-critical telecommunication services), while the threshold for a safe memory utilization is below 80%. An overwhelmed CPU may lead to higher CPU temperature, which subsequently may “disturb” the states of electrons that signal the CPU (hope nobody is allergic to physics). This in turn could disrupt or halt the cycles of CPU (again — hope nobody is allergic to physics). That’s it! Enough physics for today!.
The hardware’s health (e.g., to check if it approaches breaking point or catches fever) — could be monitored by an app, such as Grafana, which sounds like a tribal group (thanks Wakanda), but really it’s an app. Grafana Forever!
The following Figure 1 and 2 are sample graphs by Grafana that depict CPU and memory utilization percentages over time.
Grafana also provides graph for the number of requests over time — including the percentiles for successful requests and their corresponding maximum and minimum response times. See below respective graph that displays such stats during a performance test by JMeter.
After all the performance tests have been completed. The result of every test can be summarized in a table as shown below.
Table 2. The Effect of API Request Processing on Hardware Utilization
The above table contains eight (8) activities of performance test. Thanks to my limited monitor size that I could only take screenshot of eight activities, but in real situation, there would be more activities revolving around more varieties in number of requests and duration. Thus, for example, instead of just including 2 hours of test duration, we may also include 4 and 8 hours of duration. The more the merrier, but you may want to check your test deadline.
Observe the red-colored stats in Table 2. Those indicate the situations where the hardware were in dire states. What to do about this trouble? Easy! Write a resignation letter! — Pun intended! Actually, we may optimize the number of containers in our virtual machines (VM) that host the services, or if budget is no issue, then we may upgrade the hardware or even spawn more VMs.
Final Words
NaaS capacity assessment is subject to the types of service provided by the telecommunication provider. As the number of customers is projected to grow for certain services, the future network input size for each respective service could be predicted by multiplying their current size with the multiplying factor, which corresponds to the predicted increase in input transactions.
Such prediction could be quite complex, considering the future uncertainties, such as global economy crisis, global oil crisis, pandemic (e.g., Covid-20 may come), etc.
Future network input transactions may exceed prediction if for examples: subscription fee becomes significantly lower, population size increases, Internet of Things (IOT) devices become popular, basic income climbs, and other reasons which are better explained by a seasoned economist. Anyhow, a reasonable start of assessment is to use historical dataset of number of network input transactions to predict the multiplying factors.
Am I missing something, nope?