DIY DoS: Uncontrollable table growth

Billing systems highly depend on databases that often become enormous in table size. PostgreSQL, Oracle and other database management systems allow partitioning, though Oracle Standard Edition doesn’t include this option.

Hydra’s money-making billing had skipped on this issue and decided to come up with how-to’s that could pinpoint potential problems and offer preventive measures. Here’s how we proved ourselves wrong.

Background

Hydra is mostly utilized by telecoms and PPP sessions together with call detail records (CDRs) are a big deal for them. The more clients you have, the faster it grows and laws oblige you to store it as is for a number of years in case authorities need it.

We had taken this situation into account and provided our system with tools that could handle tables with up to 15m rows. More rows triggered export and data archives as the output. In any case, you had to monitor table size in order to keep it stable and it was easy to miss the point of no return.

The problem

One of our clients skipped the monitoring part as well as our own monitoring service. The company was unaware of the situation as their sessions table grew up to 36m rows.

The communication service provider (CSP) authorization system was implemented with VPN and RADIUS protocol. Doubling the suggested amount of rows and one of the main switches failure led to RADIUS-server overload as it took more time to authorize people due to slow queries.

Everything got even worse with connection requests snowballing, loading the VPN-server and making it a real DoS.

How we managed it

The CSP went straight to our support and the problem didn’t take long to find as the session’s table was just enormous and top IOWAIT queries indicated it was the weakest link. All we had left to do was getting rid of millions of rows but we started with the export into CSV-files.

We switched the RADIUS server into auto authorization mode that utilized latest available session data and had no need to look it up in the session’s table. This move made it possible to cut it to 30m rows and regular authorization process went back up into normal together with the RADIUS server.

Takeaways

It’s easy to follow the path with how-to’s and monitoring done on the client’s side but it turns out to be a real pain if something goes wrong. We decided to implement our own monitoring solution that could pinpoint potential problems and send support tickets straight to our team.

In addition, we improved the RADIUS caching system and introduced archives in one of our billing system updates. It archives data from oversized tables into Oracle in the background and clients are able to build reports as usual with no need to workaround additional archives.

Our RADIUS server became autonomous with local authorization database that serves not as a caching mechanism but stores a replica from the main database. This way we solve potential problems before they pop up and mess with the client’s business.

Here’s a brief introduction to our authorization serves components:

  • User Profiles and Services DB
  • Our app called HARD that handles HTTP requests
  • FreeRADIUS server that implements standard AAA Protocol as well as translates binary to HTTP+JSON for the HARD app

AAA servers (all with MongoDB) are grouped, they have one master and two slaves. Requests are handled by one AAA server without the main DB on board. So if something goes wrong and one of the components fails, client’s services are going to stay up as if there’s nothing wrong.