How to delete 100 million rows in a MySQL table within 30 minutes?

Sajeban Antonyrex
Sysco LABS Sri Lanka
4 min readSep 13, 2022

Hook

Recently, I got into a scenario where I had to delete 104 million rows in a table which contained 293 million rows based on a condition.

I was able to delete them in less than 30 minutes using a Java application and by making parallel calls to stored procedure. Let’s see how to do this.

Stack I used: Java, AWS RDS with r6g.4xlarge size.

You can apply the same approach to your choice of tech stack.

Beginning

Deleting a row in MySQL table is slower compared to other operations. This depends on the primary key, indexes, size of the table and the computational power of the database service.

If we want to delete all the rows in a table, it’s best to do a truncation. A single delete will cause blocking and other potential issues, so when dealing with millions of data, this might consume a huge amount of time.

If we want to delete a large number of rows based on a condition, the best way to do is by calling a stored procedure as shown below.

Let’s say our business is going to shut down all operations in Columbia. So, we are going to delete all the customers in that area.

DELIMITER $$
CREATE PROCEDURE delete_customer()
BEGIN
set foreign_key_checks=0;
SET sql_safe_updates=0;
set innodb_lock_wait_timeout=1000;

deletion_loop:LOOP
SELECT GROUP_CONCAT(customer_id)
FROM (SELECT customer_id FROM customer
where state = 'Columbia' limit 100000) as t INTO @data;
if @data is null then
leave deletion_loop;
end if;
SET @gQuery =CONCAT("delete FROM customer where customer_id in (",@data,")");
PREPARE stmt1 FROM @gQuery;
EXECUTE stmt1;
end LOOP;
set foreign_key_checks=1;SET sql_safe_updates=1;
END$$
DELIMITER ;

This caused multiple issues. Why? Read below.

Pinch Point

This stored procedure will delete 100,000 rows at a time. Depending on the number of data we have, this will consume time (a lot of time). For me it took 6–7 hours to complete the deletion. I noticed that this utilized less than 10% of database CPU. Yes, I tweaked the values, increased the limit to millions, and decreased it to thousands, but things didn’t work in my favor. They didn’t make much of a difference in the time consumed or at times, caused timeout issues.

What’s happening in the background? MySQL engine is going for a full table lookup, than an index lookup every time which causes a huge time consumption and since this is single threaded, database service resources are underutilized.

What can we do now?

The Big Reveal

Rather than calling one stored procedure to handle the whole table, we can call multiple stored procedures in parallel feeding them a part of the table.

Are you with me? OK! Let me use an example.

Let’s say our table has 5 million rows. We’ll call one stored procedure to handle the first million rows, then another stored procedure to handle the second million.

Let’s make some changes to the stored procedure to receive the start pointer, end pointer of each million and also the number of rows to be deleted at once.

DELIMITER $$
CREATE PROCEDURE delete_customer_by_chunks(IN start_pointer INT, IN end_pointer INT, IN limit_amt INT)
BEGIN
set foreign_key_checks=0;SET SQL_SAFE_UPDATES=0;deletion_loop:LOOP
SELECT GROUP_CONCAT(customer_id)
FROM (SELECT customer_id FROM customer
where customer_id between start_pointer and end_pointer and state ='Columbia' limit limit_amt) as t INTO @data;

if @data is null then
leave deletion_loop;
end if;
SET @gQuery =CONCAT("delete FROM customer where customer_id in (",@data,")");
PREPARE stmt1 FROM @gQuery;
EXECUTE stmt1;
end LOOP;
set foreign_key_checks=1;SET SQL_SAFE_UPDATES=1;
END$$
DELIMITER ;

Are we finished? Not yet!

We need something to execute this.

Here we go! We have to create a Java application which will make parallel calls to the stored procedure and also feeding them values.

First let’s get the minimum and maximum primary ID of customers from Columbia so we can avoid the overhead of looking into unnecessary rows. This applies only if you have incremental primary ID or any unique ID’s since we are using a Seek Approach to achieve this. (checkout this article to know more about Seek)

Now let’s divide the table equally based on any size we want. As I said earlier, I will be feeding the stored procedure with 1 million rows (CHUNK_SIZE). manipulationRebateIdPeriodQueue which will contain the equally divided parts (starting primary key and ending primary key of a chunk).

Now, let’s create the 15 threads (THREAD_SIZE) as shown above.

Let’s look into what the CustomerDeletionRunnable has.

Here we call the stored procedure with the start and end ID of the rows that it has to look into. We are also passing the limit which is the number of rows it will delete at a time. We can set the limit in the stored procedure itself but we have more control of the stored procedure if we pass from the application. It is also easy to tweak the values and test.

You must have noticed by now that we are setting the group CONCAT max length. Since the primary key can be long, the CONCAT length in the stored procedure can be very long. It was throwing error when I tested it, and I had to set this to the max value it can take.

Resolution

If we have a powerful database service, we can then make parallel deletion calls to stored procedure utilizing the database performance to the fullest.

You may have to increase or decrease the thread count, Chunk size and limit depending on the computational power of the database service and the number of rows in the table. These are the values I used.

public static final int CHUNK_SIZE = 1000000;  
public static final int THREAD_SIZE = 15;
public static final int LIMIT = 8000;

I too tested a lot of times (I say it again, a lot of times) to find the threshold values for these attributes. If we give values beyond the threshold values, there are high chances of getting a timeout error.

Go ahead and try this approach (At your own risk 😉)

Happy deletion Folks!

Final Credits

Find the complete code in this repository.

Thanks to Pasindu Senanayake.

--

--