In this blog we will try to build a miniature version of AWS Simple Notification Service.
Simple notification service allows user to publish messages to topic. A user subscribe to topic(s). Whenever a message is published to the topic by publisher, subscriber receives the message published in the topic. Both publisher and consumer are unaware of each other. They do not communicate directly.
We are required to Design SNS service that clients all over the world can use to read and write messages.
User should be able to publish messages
Message Order— Message order must be maintained
Grouping by topic — Message must be grouped by topic.
User should be able to receive messages
Message Order — Subscriber should receive message in the order they were published in the topic
Grouping by topic — Subscriber should receive only the subscribed topic’s message
For retry purpose message should be retained for 21 days in case it has not been consumed. (In AWS SNS keeps trying for ~20 days if the consumer [lambda or sqs] is not available.)
Scale & Performance
Application should be able to sustain uneven traffic distribution.
Peak to be 50M messages per day.
Avg Message size is 4kB.
It should be able to process considerably high volume of message
Solution should be available across multiple location (Mumbai and Singapore) and application should be able to process message in case of one DC’s failure.
Building An MVP
The above diagram shows a basic flow of the system. Publisher publishes message(s) to topic(s) and consumer(s) read/process them.
- Publisher — One who publishes the message.
- Consumer — One who reads the messages from topic (CONSUMER_ID)
- createTopic(topic_name): →TOPIC_ID
- createSubscription(topic_id, consumer_id): →SUBSCRIPTION_ID
- publish(TOPIC_ID, Message): → Sends back Ack on successful write
- generateMessageId(TOPIC_ID) → MESSAGE_ID
- distribute(TOPIC_ID, CONSUMER_ID)
- remove(TOPIC_ID, MESSAGE_ID): → Send Ack on successful removal
All apis are self explanatory. Message Position Marker Service, this service is helps in deleting the correct message id from the topic. Instead of deleting the last message from the topic, it ensures the it only delete the message id which was last read. This ensures the ordering of the message.
Adding resiliency and Availability
Not all API/Service’s usage pattern will be same. So we need to be smart with the no of instance of the service will be needed. Below diagram tries to present an indicative service replication. As is the general use case one message can be consumed by multiple application hence we should be running more number of distribute service. Similarly message id service is used both while writing and reading the message, so this should have higher replication than publish and distribution service.
In above diagram we created multiple instances of the services in a single location, we will now replicate the whole setup in different location. This will add resiliency as well as reduce the latency. Publisher in Mumbai will hit the DC nearest to it and publish the message. User in Singapore will hit the nearest DC to read the message. This requires synchronization between two DC. We can have distributed services with a consensus framework (raft, paxos) to have common state across DCs.
Data replication between DCs needs to be monitored closely. Inter region latency of write will be high and that will impact overall time between write initiation and acknowledgement of successful write. Here based on latency requirement we can choose asynchronous, synchronous or hybrid replication.
In case of asynchronous replication as soon as the write is complete in the nearest DC the ack is returned. This will give lowest latency for publish. But it will also low data durability. Before the data can be read from the remote location, it has to be replicated to that DC. Hence distribution will have high latency.
In case of synchronous replication we will have high latency while publishing and low latency on read. It will give us high data durability as data must be replicated to remote DC before a successful publish is acknowledged.
The third approach is to chose middle path where we do synchronous with say 1 remote location and remaining location will do a lazy replication. When combined with access pattern this can be very good. We chose the DC which is used heavily for consumption and select that DC as part of Synchronous ack. And remaining non frequent DCs can be replicated lazily.
peak * topic id * message id * metadata * message * ~ memory
10M * (64 bits) * (64 bits) * (256 bytes) * 4kB * 21 ~ 5TB
Based on available hardware we can come up with no of machines needed to implement this architecture. If latency is a issue we can go for SSD or RAM intensive instances. Or If we can tolerate some latency we can go with HDD.
Network bandwidth calculation:
Since at peak we are writing 250 GB per day. We need to find the max peak write at any time in the day. For simplifying the calculation, I am assuming that peak will be 1.5 times of average. This gives us 35Mbps.
READ or consumption can also be obtained similarly. For simplicity I have assumed READ is 5 times write. Actual data can be gathered in requirement phase.
WRITE → peak * 1.5 *1000 /3600 *24 ~ 35 Mbps
READ → WRITE * 6 ~ 200 Mbps
ID generation + DC with highest latency + write to disk → publish latency
ID selection + Max(Remote DC read) + Message position marker update → read latency
Based on the above bandwidth requirement we need to select instance that has required network capability.
With this exercise we have designed a Notification system.