Lessons from ingesting 500M database records at once

Ryan Baker
Singularity
Published in
6 min readJan 3, 2024

--

At Singularity Energy, we started building one of our products to handle much larger scale than we have previously supported. Our primary product’s data store has about 2 billion data points. However, that was ingested over a few years’ time. For this new product, we wanted to see where our system would fail under higher-than-expected strain. To do this, we tried ingesting 500 million data points at once.

As we’ve mentioned previously, we queue all of our processing so it can be done asynchronously. We added about 500,000,000 messages to that queue. We watched closely to see which parts succeeded at their jobs and which failed.

In the rest of this post, I’ll go over what already worked well, what needed to change, and what changes we will make ahead of our next large scale data ingestion.

What worked already

Before starting this workload, we already had a system where work would be pulled off of a queue and worked on by processes that we called “workers”. As the workers picked up work, they started inserting lots of rows into the database. From here, it was obvious that we did not give the database enough storage space for all the data it would create. However, we had already set up something that would protect against that:

Auto-scaling RDS storage space

Whenever our RDS instance was low in storage space, it would automatically add more storage space (without downtime) and allow work to continue. Luckily, this feature in AWS is as easy as checking a box in the RDS configuration page.

Modeling large workloads as “jobs” in our database

In our system, we use Celery to handle asynchronous work. We like to keep our “tasks” as small as possible (for multiple reasons that I won’t get into here). For this reason, we decided to model larger workloads as a “jobs” table in our database, that have a status, configuration, cursor, and percentage complete. This allowed us to see how the jobs were doing, as well as check the logs for any specific or problematic jobs we observed.

A screenshot of an internal dashboard for Singularity Energy that shows jobs in a table format, including their status and percentage complete.
What we see in our internal dashboard

A separate “slow-lane” worker queue

As you can imagine, we have a lot of work that our workers do in the background. For this reason, we configured two sets of workers to read from two different queues: normal workers and latency-insensitive workers.

The “latency insensitive” workers will do work that typically takes a longer time to complete and/or work that is resilient to system delays.

This allowed our “jobs” to be completed and take up most of the throughput of the workers without interrupting the normal workers. If we did not have that, then a normal system function would have been severely delayed. For example, our “reset your password” email is sent via a worker. If we have 500,000,000 messages on the queue and we add a single message to the end of the queue in order to send the password reset email, the user would likely not receive that email until after the other 500,000,000 messages were take off the queue. This would result in a significant delay for the user.

What we changed to work better

During the data ingest, we updated a few things about our system in order to get through the 500 million messages in a timely manner. We prioritized these based on the issues and errors that we saw in our system.

Provide workers direct database access

Typically, we have our workers send a request to the API to access objects in the database. This is roundabout and unnecessary, but it helped us by limiting what resources could access the database and guaranteeing a log line at every data request (by having the API log every incoming request).

However, we quickly found that the workers were choking the API:;

An image showing the CPU and memory utilization of the API. The memory hovers around 25%, but CPU is almost always at 100% for a 12 hour window.
Resource utilization for our API showed they were consistently holding 100% CPU

We identified the requests causing high resource utilization and found that they were doing lots of DB operations and would actually exceed the “healthy” threshold for a response. Instead of fixing the issue by making the database faster (which showed healthy metrics at this point), we moved the DB access for these high-latency operations from the API to the workers.

This relieved pressure from the API and removed what probably should never have been an operation the API was doing in the first place.

Upsize the database based on the workload

After fixing the previous point, the system was humming along nicely. However, it seemed that it was going very slow and would take a couple weeks to finish the data ingestion. We decided to increase the number of workers in order to give them higher throughput. Naturally, we made sure to watch the database when we did this.

Sure enough, the database CPU jumped from ~20% to about 80%. This was well past the allowance for our RDS instance size, and the CPU credits for our instance started depleting. In response, we decided to upsize the database during the period that we would be doing this work.

It worked like a charm! Dropping the CPU back to healthy levels allowing high throughput of the jobs, as well as supporting the application usage needs for the database.

Add the ability to pause and restart a worker job

In certain cases, we had a problematic job that needed to be fixed. We decided to implement a way for a Singularity employee to pause and restart a worker job in order to effectively re-run the job upon deploying a bug fix.

This worked out very nicely for us and prevented us from having to restart the entirety of the 500M record ingest, and just restart a smaller portion of it.

Things we’ll do for next time

Overally, the process was relatively smooth. The least smooth parts about it were that we had to watch and monitor it the whole time. However, with a few tweaks, we could comfortably kick off another job of this size and rest assured that the system would take care of itself. With a few small changes:

Auto-scale database resources

As part of the work we did during the ingest, we upsized the database. We could, instead, apply an auto-scaling group to the database based on its CPU utilization and allow it to upsize itself if it needed to.

This would allow us to not have to manually interfere with changing the configuration of the database and allow our system to instead respond to changes in its workload.

Minimize payloads sent over SQS

We learned that AWS SQS has a built-in message limit. We knew that this existed, but felt we would not hit it.

An error in our Sentry dashboard that we saw in our QA environment

We were wrong.

Next time, instead of sending large payloads over SQS, we will store them somewhere first, and provide the Celery worker a reference to the large payload. That way, we can provide arbitrarily large jobs or configurations without having to stretch the limit of SQS.

Auto-scale logging cluster disk space

Unfortunately, we ran out of space in our logging cluster during our ingest. We had so many more logs than normal that our OpenSearch cluster ran out of space to store them! In the console, we manually added 2 more data nodes to the logging cluster and bumped their disk space by 20%.

This added more cost, but solved a critical observability issue for us. In the future, we will add an auto-scaling policy to the logging cluster and allow it to grow and shrink as needed to fit our needs.

In conclusion, we learned a lot from this ingestion and have a good list of projects to keep us accelerating our scale of data.

Comments? Thoughts? See anything we missed or did wrong? Send it over to jobs@singularity.energy, we’d love to chat.

--

--