Improving Wi-Fi digital experience by applying throttling in Distributed Architecture

Published in

Deutsche Telekom Digital Labs

7 min readJun 30, 2020

The use case we solved

In Deutsche Telekom Digital Labs (DTDL) we are working on the product called HomeGateway which will allow our customers to manage and operate their smart routers through a mobile app. The whole idea is to give control back to our customers to manage their digital lives and provide the next-generation customer experience. Along with giving our customer control of their WIFI routers, we are also helping them to detox themselves digitally with many features of our product like our customers can schedule different rules for their internet access on their devices.

Yes, digital detox — Wikipedia says — “Digital detox refers to a period when a person voluntarily refrains from using digital devices such as smartphones, computers, and social media platforms”.

While building this next generation product and we solved many problems on the way to make sure our customers get the best experience out of it. One of the problems we solve by introducing throttling in our cloud-native distributed architecture which is explained in the following article.

What is Throttling

In software, a throttling process is a process responsible for regulating the rate at which application processing is conducted, either statically or dynamically.

In any application, there can be a scenario that your system is integrated with some legacy system or third-party system downstream. You can control the performance of your system by various means e.g. scaling, caching, etc., but it might be a possibility that you don’t have control over the downstream system. In such a scenario, you require to apply throttling concepts to scale your system.

To establish the context via the below diagram — System A is a co cloud-native application that can handle 500 transactions per second, but the downstream system can handle around 300 TPS only.

Now without any additional handling, the legacy system would break due to overload.

Typically, the solution to such a problem in throttling.

To handle this case, we have to have throttling layer in between which will essentially reduce the traffic to the downstream system and when the requests are throttled depending upon the use case

Either the excessive requests are rejected or
We can queue the request to process them when a downstream system is free to process them.

High-Level Flow

In HomeGateway a typical request of any operation for the router e.g. a simple operation like changing the password of the router or a relatively complex operation i.e. blocking the internet of any device connected to the router.

Originates from a mobile app
Travel through microservices in AWS where that operation is apprehended and converted into the command that router could understand.
Travels to one of the core telecom products in the world of broadband ACS (Auto Configuration Server).
This server ACS takes the responsibility of executing the command on the customers’ router and getting the response back from it.

High-Level Request flow from Homegateway to Customer’s router

Technical Challenge

Legacy System

In every business problem, there are some challenges to solve. Tying back the cool product we are building to the problem explained in the starting of the article — in the above architecture (simplistic view of complete architecture) ACS is the legacy system and ACS cannot be scaled beyond a limit as compared to HomeGateway Product on the cloud. One simple reason for this is HomeGateway is based on modern cloud-native architecture, but ACS is a legacy monolithic application.

HTTP Callback

Another interesting behaviour about ACS which is worth mentioning (before we jump to solution) is, in the above-mentioned architecture when from AWS command is pushed to ACS to execute on the router, the command is pushed via a REST call to ACS. Between the ACS and Router on a high level, the communication is socket-based (TR-069).

When REST call is made with the command for a router, that command is not executed in synchronous manner w.r.t to that REST call; rather: -

ACS immediately gives a response for the HTTP REST Request stating “the command that you indented to execute on the router is excepted and, in the response, returns a “unique identifier” as an identifier of the Request submitted.
Then ACS executes the command on the router via TR-069 protocol.
Once the command is successful, ACS provides feedback to microservices in AWS via new HTTP Rest call corresponding to the above mentioned “unique id” so the request and its response can be correlated. This pattern is typically known as an HTTP Callback pattern. As depicted in the image below

Few commonly used solutions for throttling

By quick googling, we found the following possible solutions https://github.com/shlomokoren/spring-boot-throttling or on a similar solution https://github.com/google/guava/blob/master/guava/src/com/google/common/util/concurrent/RateLimiter.java.

But there is a fundamental problem in using these solutions — The distributed nature of the overall architecture.

In microservices you expect your system to scale in case of an increase in load and since our product is a direct consumer-facing product imagine the amount of possible traction. So, HomeGateway product follows cloud-native architecture and we are using Kubernetes as an orchestrator to manage our microservices and we are leveraging dynamic scaling of the same.

Now coming back to the above-googled solution, as you might have guessed, these solutions will work as long as we are not in a dynamic scaling world or we are in the monolithic world as they provide throttling solutions for particular instances of microservices.

The second problem is, even somehow, we make this work in a distributed environment by introducing some distributed locking, etc. then also overall solution has to be considerate about HTTP callback (in which actual response comes via ACS), the solution should ideally be throttling the incoming request till the system receives an actual response (via HTTP callback) for the requested operation.

The solution that worked for us

Considerations

As explained above, the following are the basics of the solution that we designed

The solution should work in distributed architecture i.e. solution should consider dynamic scaling of the microservices in case incoming load is increased.
The solution should also consider the actual response of requests made i.e. response coming via HTTP callback.

Solution Architecture

As already stated, customers can request various actions on routers via the mobile app e.g. change the password of a router or block any device connected on the router. The request lands on microservices deployed on AWS.

We process the request converts it into a command that the router can understand. (this is quite a simplistic view of the overall solution, in an actual lot of components comes into play). The processed request is forwarded for execution via persisted Queue in our case via RabbitMQ.
Throttling engine is responsible to throttle the excessive request in the system and queue them. Either for later processing or to give feedback to our customers to try again.

“Auto Configuration Server” (ACS) and can handle a limited amount of requests at a given point of time. Let’s assume this value is “T” = A total number of request ACS could handle.

To control the number of requests to ACS, we are taking advantage of Atomic operations supported by Redis Cache. We have maintained an atomic counter in Redis let’s say “X”. X represents the current number of requests that in-progress on ACS.

Now Throttling engine as independent microservice would be running in multiple instances. Any instance can consume the message from the queue to forward it to ACS to execute a command on the router.

When any instances of throttling engine consume the message from the queue

It checks if X (current number of requests) If X less then T (total number of requests that ACS can process). If TRUE
It increments the Atomic counter.

3. Forward the request to the Request executor component.

4. The request executor is the actual component that initiates the HTTP request to ACS to execute the command on the router. Once the HTTP request is made to ACS, ACS response back stating the request is accepted.

5. ACS executes the command on the router and gets the response back. This communication is socket-based communication on the protocol called TR-069.

6. Now ACS (after getting a response from the router) Initiates an HTTP REST call towards the product on AWS to provide feedback that the operation requested by User is successful.

7. Response Handler receives this HTTP request and processes it and decrement the Atomic counter in Redis stating request has been completed via ACS.

8. And then Response handler forwards the feedback to customer request facing microservice so that feedback can be forwarded to Customer.

This is how we solved the problem of API throttling in the distributed architecture enabling a seamless experience for our customers even if some of the downstream systems are not able to catch up with modern architecture.

Teaser for another problem we solved!

You might have noticed when a user requests for any operation for router via a mobile app that goes to ACS, ACS does not return the response immediately. The response is asynchronous. But for the good customer experience, we have converted this asynchronous nature of the response to synchronous to avoid long polling from a client or explicit action on the client by the user to get the feedback. Another very interesting problem we solved in HomeGateway. This will follow up in the next article.