How We Scaled Our User Import Task Using Bulk Operations

Published in

The Startup

5 min readAug 4, 2020

We have functionality in our loyalty product to allow our clients to import their customers into the program. Our team wrote this as a Celery task which would read each row from a CSV file and call our existing user enrolment method. This method would insert a user into the database and send out emails to the user.

When we released this feature, it worked perfectly fine for small customers who would import around 20k-30k customers into the program. Then one fine day, horror-struck when one of our customers wanted to import 700k users. All of the nodes (celery workers in our case) were only working for this task to get complete. Our RabbitMQ queue started getting filled up and there were continuous memory alerts. Our database connections increased and became astonishingly high making the system super slow. The import took almost a day to complete (during which we faced a lot of customer support queries) and post this nightmare we decided to improve the functionality.

We went through the import users task once again and figured out a few performance issues.

We were making a database call for each user to fetch configurations from the database like email notification details, merchant configuration settings, etc.
Each time there was a database query to insert a user into the database.

We realized that we were doing a lot of reparative processes. Let’s compare this with our public transport system. Suppose there are 50 employees working in an organization and living in the same society as well. Everyone uses their own vehicle to commute to the office. For everyone, the security personnel of the organization has to authorize the vehicle before entering office premises, make available parking space for each vehicle, etc. This increases traffic on the road causing congestion for others as well. And not to mention everybody is spending money to fill up the fuel.

So the organization decides to provide bus service to those employees. Now everyone boards the bus from the same location. The road has only one bus instead of 50 individual vehicles. The Security personnel at the campus have to authorize only one vehicle, and need parking space for only one. Fuel cost for each employee drops significantly. This benefits everyone.

Mapping this analogy with our problem, we were somewhat doing the same. We listed down the enhancement that needed to be made.

Fetch all the required metadata like merchant configuration, email notification details, at the start of the process.
Stop doing repetitive work over and over again
Reduce the round trip to the database while inserting users. (fit everyone into a bus instead of individual)

We came up with the below architecture to handle any (small or huge) CSV without affecting our system. Since we were implementing this functionality from scratch we decided to try out a new technology. We had heard about the serverless architecture and its ease of use. So we decided to give it a try and implemented this functionality with AWS lambda. We implemented two lambda functions and an SQS queue to bridge them.

I will not go into much details about the lambda implementation part. Let’s focus on DB bulk operation

Architecture diagram to import bulk users into the database

The first Lambda function takes a large CSV as input and splits it into smaller chunks. The smaller chunks are then uploaded to the S3 bucket and the file location information is pushed into an AWS SQS FIFO queue.
Another Lambda function, which is listening to the SQS queue, processes each of the smaller CSV chunks. Each instance of this function downloads one of the chunks from the S3 bucket, reads the user’s information from that chunk, and prepares a MongoDB bulk insert query.

What are bulk operations? Why are they important?
Most of the databases provide a mechanism to provide Bulk Operations. In general, bulk operations should be used to reduce the round trip to your DB resulting in lesser traffic to your DB.

In our case, we were using MongoDB. Consider the following piece of code in PyMongo. Both pieces of code result in the same output but have a huge performance difference.

What should be the chunk size?
We have seen the advantages of bulk operation previously. But there is some limit for the size of the chunk too. As it is not possible to accommodate an infinite number of people into a single bus, after a certain limit, you will consider adding another bus to the service.

Similarly, bulk operations have some restrictions on the amount of data you are passing with the single bulk operation. The limit depends on so many factors like

What is the average size of your single document?
How much data can be transferred in a single network call?
What is the bulk operations limit supported by your DB?

Initially, we were not sure about what we should choose as the bulk operation limit. Along with the database insertion, we were also performing some other business logic. With some trial and error, by considering the average document size, we decided to keep it as 5000 documents per bulk operation.

Performing all the operations in a sequential manner.
We decided to use the (AWS SQS) FIFO queue as it will line up all the smaller chunks processing. We could have also processed all the chunks in parallel but running those in parallel might have increased the write overhead on the database.

Take-Aways

In our application, a lot of the time, we perform an operation on a single document in DB in the normal flow. But sometimes we need to handle a batch of documents. Treating those two flow separately, even though they result in the same output, in order to use specific kinds of operations provided by your database for specific purposes can result in a huge performance difference.

I would be very glad if you find this blog post useful and knowledgeable. This is the first time I am writing something. Special thanks to my colleague, Yahya Poonawala who helped me a lot to write this blog post. Please do share the blog and leave your comments. Suggestions and comments are most welcome.

How We Scaled Our User Import Task Using Bulk Operations

Take-Aways

Written by Shubham Najardhane