Campaign Management System Design
Description:
Fake Company is going to build a campaign management system. And a part of it is sending campaign messages to users via email. In the Customer model of Fake Company, they have the user details (such as email, phone number…etc) of all their customers. The campaign will be launched on June 24, 2020. Therefore we’ll have to design a Campaign Management System through which we will launch various types of campaigns targeting many customers and let them know about the ongoing campaigns through email.
System V1
Current System
In this system, we will fetch the desired user’s information from the user database and take the email id and start sending them an email.
- A certain DB contains the information of all the users.
- We fetch all the User records from the DB.
- A User Model has the email and phone number
- Among all the fetched records we filter all the records with email and send them to a separate process namely Email Sending Process. This process gets all the emails and starts sending emails to those mail one by one.
This whole process is triggered by a Cron Job which is executed at the 00:00 am. So we will write a routine in the cron
which will trigger our fetch-item-from-db
process. We can put the fetch-item-from-db
and the cron
routine in the same server to lessen the delay.
Modifications
- Rather than fetching all the user parameters we only need
user_id
,user_email
. No need to load the whole user record. In this way, we are saving space in our server’s memory. Why? If you load all the information and if the information for a user is 1KB, for 20 million users it is around 19 GB data. So we can reduce occupying this space by only loading the id and email.
System V2
In this version, we have two separate processes P1 and P2 to handle the Email work and SMS work. And we are also fetching only the email and phone number of the user.
Looks good?
Well, if you want to send 100~300 emails, well then it seems good. But what if you want to send 20,000 emails? Then we have a problem.
Problems of V2
If the email process (the triangle in the upper portion of the image) is an HTTP/HTTPS process, then we can send the email id one by one from our db-fetching-process
to our email process
. So what is the problem?
If our db_processing_system
goes down, we are back to square one. Even if our email_sending_process
is working fine, we are stuck as long as ours db_processing_system
is down.
So, seems like this connection between the db-fetching-process
and email process
is a bottleneck of our system. So how can we mitigate this?
We can use a Message Oriented Middleware (MoM) here. Let’s use a FIFO Queue between these two connections.
Let’s look at the diagram first.
System V3
So there’s a Queue in between db-fetching-process
and email process
. What we are achieving with this?
We first collect all the desired users from the database
and push all their email id one by one to the Queue. And our email process
will pick them and send the email.
Here, our db-fetching-process
is the publisher which is publishing messages to the queue and our email process
is the consumer, consuming messages from the Queue.
So, with this how do we mitigate the above bottleneck?
Say, we have to send 20,000 emails. We get all the 20,000 email id from the database
and start publishing them to the Queue. Now say after publishing 10,000 email id, our db fetching process
is down. But now, we are not stuck. Because we have email id already inside the Queue, the email process
won’t have to pause. It will execute its task. It will consume email id from the Queue and will send the mail. And like this, it just solved our issue.
If you look at our version: V3 image carefully, you can see that we have a new component here which is a Content Management System(CMS). System V1 and System V2 did not contain them. What’re its usages?
We have to send an email, right? So we have an email message body. And it can vary from campaign to campaign. So certainly we can’t hardcoded them in our email process
. So that’s why we need a service where we can create, update, delete content. And our other services can get the content’s from this CMS.
So in our email process
we will have the campaign id(inside our message that we are consuming from Queue) and with the campaign id, we can fetch our desired content which is the email message body
from our CMS.
Looks good, right? Ok, so you developed this system and deployed in the production and you are now chilling. The very next day you are getting complaints from your customers that you are flooding their inbox with lots of email advertising a certain campaign. And now, you are sweating. What went wrong? The system looked good.
Let’s find out.
Problems of V3
- Say, this whole system crashed. Then it will reboot after some time. In this scenario, let’s explain the problem. Before crashing, the system has fetched 10,000 email id from the
database
and the pushed it to the Queue. Now 7,000 emails have already been sent among them. At this point, if thedb-fetching-process
crashes and becomes down, youremail process
will still be able to send the email to the email id that’s been inside the Queue. But when thedb-fetching-process
reboot, it will fetch all the email id of the users who are part of this campaign again from thedatabase
and will push them inside the Queue again which theemail process
will pick again and will send email again to the email id. So we are sending emails multiple times to the same user and flooding our users’ inbox. Very bad user experience, isn’t it? Not to mention we are burning resources. - Until now, how are we fetching our users from the database? We are making the decision in our
db-fetching-process
to fetch our desired user for a specific campaign. What if we want to run multiple campaigns at the same time? Will we keep this configuration inside ourdb-fetching-process
? If so, then we’ll have to keep changing it each time we run a new campaign. Doesn’t sound good to keep this kind of dependency, does it? No, it does not. We should not keep this logic here. - At one point in time, we are reading all our rows of the user table at once. If we have 100~200 rows, that’s ok. But if you have millions of rows, is our current way of reading the table will perform well? Of course, not.
- Up until now, we have only one consumer of our Queue that is our
email process
. What does it do? It sends emails. Now if we have a spike in our Queue meaning we have so many email ids are coming to the Queue but to process all of them we have only one worker. So we are giving all the load of our Queue to the onlyworker
which isemail process
here. If this load is more than what it can handle, it will crash. - Our vendor party will be sending the mails finally. But all the vendors API endpoints have some limits, right? You can’t just bombard their API. You have a certain rate limit at which you can hit those API. Also, vendors will usually likely to want us to send them a bulk request. So can’t hit their API one by one. We can hi once with a bulk payload. So, maybe we can hit the API once and the payload will contain the email id for 1000 users.
So identified 4 problems. Let’s see how can we fix this.
- For problem 1, we need to keep a state. The state denotes to which users the email has already been sent. Therefore, if our
db-fetching-process
goes down and it reboots, it will again take all the desired users from the database and check to which uses the email has already been sent. Therefore it will not push the same email id to the Queue. - We need a dashboard or a portal from where an admin can create a campaign. We can have intermediate storage from where our
db-fetching-process
can get all the active campaigns. - We need a paginated way to solve this problem. You’ve heard about paginated API, right? What is done there? When you have so many records you divide them into pages and send the responses page by page. On each page, you may have 20 documents. That’s the mechanism we’ll be using here. For millions of rows, we will read the rows chunk by chunk. Each chunk may contain 5000 of our desired users. So, if we have 20 million rows in our table, and among them, we need 1 million users for our current campaign, we can get these users chunk by chunk. If each chunk contains 5000 users, we will be needing
1000000/5000 = 200
chunks. - We need to horizontal scale our worker meaning we need to multiple workers who will be consuming from the Queue. Therefore we can distribute the load of the Queue among them.
Let’s look at the diagram below:
System V4
- We are using Redis to maintain the states.
- Using a campaign management portal to create a campaign. The current active campaigns are in the Queue.
db-fetching-process
takes the currently active campaign and based on the campaign gets desired users from thedatabase
chunk by chunk.- We have horizontally scaled our
email-sending process
and created some copy. All of them together will handle the load of the Queue. (Ok, in the picture it looks vertical but it’s actually not vertical scaling. We just created new servers for the email process. Creating new servers is a horizontal scaling to simply put) - Rather than hitting our vendor API frequently, we hit them once with a bulk payload. Our bulk payload contains the email id of 1000 users. Therefore we won’t have to face the throttling issue.
Notes: To read more about QoS you can check this out: https://www.rabbitmq.com/confirms.html