Campaign Management System Design

Published in

BSADD

8 min readJun 26, 2020

Description:

Fake Company is going to build a campaign management system. And a part of it is sending campaign messages to users via email. In the Customer model of Fake Company, they have the user details (such as email, phone number…etc) of all their customers. The campaign will be launched on June 24, 2020. Therefore we’ll have to design a Campaign Management System through which we will launch various types of campaigns targeting many customers and let them know about the ongoing campaigns through email.

System V1

Current System

In this system, we will fetch the desired user’s information from the user database and take the email id and start sending them an email.

A certain DB contains the information of all the users.
We fetch all the User records from the DB.
A User Model has the email and phone number
Among all the fetched records we filter all the records with email and send them to a separate process namely Email Sending Process. This process gets all the emails and starts sending emails to those mail one by one.

This whole process is triggered by a Cron Job which is executed at the 00:00 am. So we will write a routine in the cronwhich will trigger our fetch-item-from-db process. We can put the fetch-item-from-db and the cron routine in the same server to lessen the delay.

Modifications

Rather than fetching all the user parameters we only need user_id, user_email. No need to load the whole user record. In this way, we are saving space in our server’s memory. Why? If you load all the information and if the information for a user is 1KB, for 20 million users it is around 19 GB data. So we can reduce occupying this space by only loading the id and email.

System V2

In this version, we have two separate processes P1 and P2 to handle the Email work and SMS work. And we are also fetching only the email and phone number of the user.

Looks good?

Well, if you want to send 100~300 emails, well then it seems good. But what if you want to send 20,000 emails? Then we have a problem.

Problems of V2

If the email process (the triangle in the upper portion of the image) is an HTTP/HTTPS process, then we can send the email id one by one from our db-fetching-process to our email process. So what is the problem?

If our db_processing_system goes down, we are back to square one. Even if our email_sending_process is working fine, we are stuck as long as ours db_processing_system is down.

So, seems like this connection between the db-fetching-process and email process is a bottleneck of our system. So how can we mitigate this?

We can use a Message Oriented Middleware (MoM) here. Let’s use a FIFO Queue between these two connections.

Let’s look at the diagram first.

System V3

So there’s a Queue in between db-fetching-process and email process. What we are achieving with this?

We first collect all the desired users from the database and push all their email id one by one to the Queue. And our email process will pick them and send the email.

Here, our db-fetching-process is the publisher which is publishing messages to the queue and our email process is the consumer, consuming messages from the Queue.

So, with this how do we mitigate the above bottleneck?

Say, we have to send 20,000 emails. We get all the 20,000 email id from the database and start publishing them to the Queue. Now say after publishing 10,000 email id, our db fetching process is down. But now, we are not stuck. Because we have email id already inside the Queue, the email process won’t have to pause. It will execute its task. It will consume email id from the Queue and will send the mail. And like this, it just solved our issue.

If you look at our version: V3 image carefully, you can see that we have a new component here which is a Content Management System(CMS). System V1 and System V2 did not contain them. What’re its usages?

We have to send an email, right? So we have an email message body. And it can vary from campaign to campaign. So certainly we can’t hardcoded them in our email process. So that’s why we need a service where we can create, update, delete content. And our other services can get the content’s from this CMS.

So in our email process we will have the campaign id(inside our message that we are consuming from Queue) and with the campaign id, we can fetch our desired content which is the email message body from our CMS.

Looks good, right? Ok, so you developed this system and deployed in the production and you are now chilling. The very next day you are getting complaints from your customers that you are flooding their inbox with lots of email advertising a certain campaign. And now, you are sweating. What went wrong? The system looked good.

Let’s find out.

Problems of V3

Say, this whole system crashed. Then it will reboot after some time. In this scenario, let’s explain the problem. Before crashing, the system has fetched 10,000 email id from the database and the pushed it to the Queue. Now 7,000 emails have already been sent among them. At this point, if the db-fetching-process crashes and becomes down, your email process will still be able to send the email to the email id that’s been inside the Queue. But when the db-fetching-process reboot, it will fetch all the email id of the users who are part of this campaign again from the database and will push them inside the Queue again which the email process will pick again and will send email again to the email id. So we are sending emails multiple times to the same user and flooding our users’ inbox. Very bad user experience, isn’t it? Not to mention we are burning resources.
Until now, how are we fetching our users from the database? We are making the decision in our db-fetching-process to fetch our desired user for a specific campaign. What if we want to run multiple campaigns at the same time? Will we keep this configuration inside our db-fetching-process? If so, then we’ll have to keep changing it each time we run a new campaign. Doesn’t sound good to keep this kind of dependency, does it? No, it does not. We should not keep this logic here.
At one point in time, we are reading all our rows of the user table at once. If we have 100~200 rows, that’s ok. But if you have millions of rows, is our current way of reading the table will perform well? Of course, not.
Up until now, we have only one consumer of our Queue that is our email process. What does it do? It sends emails. Now if we have a spike in our Queue meaning we have so many email ids are coming to the Queue but to process all of them we have only one worker. So we are giving all the load of our Queue to the only workerwhich is email process here. If this load is more than what it can handle, it will crash.
Our vendor party will be sending the mails finally. But all the vendors API endpoints have some limits, right? You can’t just bombard their API. You have a certain rate limit at which you can hit those API. Also, vendors will usually likely to want us to send them a bulk request. So can’t hit their API one by one. We can hi once with a bulk payload. So, maybe we can hit the API once and the payload will contain the email id for 1000 users.

So identified 4 problems. Let’s see how can we fix this.

For problem 1, we need to keep a state. The state denotes to which users the email has already been sent. Therefore, if our db-fetching-process goes down and it reboots, it will again take all the desired users from the database and check to which uses the email has already been sent. Therefore it will not push the same email id to the Queue.
We need a dashboard or a portal from where an admin can create a campaign. We can have intermediate storage from where our db-fetching-process can get all the active campaigns.
We need a paginated way to solve this problem. You’ve heard about paginated API, right? What is done there? When you have so many records you divide them into pages and send the responses page by page. On each page, you may have 20 documents. That’s the mechanism we’ll be using here. For millions of rows, we will read the rows chunk by chunk. Each chunk may contain 5000 of our desired users. So, if we have 20 million rows in our table, and among them, we need 1 million users for our current campaign, we can get these users chunk by chunk. If each chunk contains 5000 users, we will be needing 1000000/5000 = 200 chunks.
We need to horizontal scale our worker meaning we need to multiple workers who will be consuming from the Queue. Therefore we can distribute the load of the Queue among them.

Let’s look at the diagram below:

System V4

We are using Redis to maintain the states.
Using a campaign management portal to create a campaign. The current active campaigns are in the Queue.
db-fetching-process takes the currently active campaign and based on the campaign gets desired users from the database chunk by chunk.
We have horizontally scaled our email-sending process and created some copy. All of them together will handle the load of the Queue. (Ok, in the picture it looks vertical but it’s actually not vertical scaling. We just created new servers for the email process. Creating new servers is a horizontal scaling to simply put)
Rather than hitting our vendor API frequently, we hit them once with a bulk payload. Our bulk payload contains the email id of 1000 users. Therefore we won’t have to face the throttling issue.

Notes: To read more about QoS you can check this out: https://www.rabbitmq.com/confirms.html