Cloud Messaging at Scale

Published in

Feedzai Techblog

6 min readDec 12, 2019

In the scope of two Feedzai summer internships, we implemented support in the messaging library for Amazon Web Services (AWS) and Google Cloud Platform (GCP), as we are already using a lot of their services. This post is a summary of our findings.
Article written by Henrique Ferrer and Tiago Martins.

Every second, thousands of real-time events are processed by Feedzai fraud prevention solutions. For each event that enters the system, multiple services get notified and react accordingly. This constant load demands a highly-available and fault-tolerant communication infrastructure that keeps up with the volume while ensuring low end-to-end latency.

Background

Like most distributed systems, we use synchronous RPC style and asynchronous messaging to communicate between services. Messaging is supported by a custom in-house library providing a thin abstraction layer, that wraps the drivers for each supported messaging broker. This library implements a publish-subscribe API that hides the peculiarities of the underlying driver and broker from the developers.

Most of our production deployments use either RabbitMQ or Kafka but we also support ActiveMQ in data science environments.

Why use the cloud?

With cloud deployments representing a relevant and growing slice of the business, it’s important to start considering cloud-native solutions to replace the standalone messaging brokers.

Current messaging services require a lot of maintenance. Our teams spend valuable hours tuning the configuration of the service to ensure reliable operation and to understand problems when they occur. Cloud services hopefully alleviate this problem because the servers are managed by the providers.

In the next sections you find the evaluation of AWS and GCP for each factor that is important for Feedzai.

How it works

Messaging library implements a publisher/subscriber paradigm where each published message is sent to all subscribers.

This is a simplified version of the messaging API that we implemented:

Subscriber subscribe(String topicName)
    void startConsumption(Consumer<Message> consumer)
    void unsubscribe()
Publisher publisherFor(String topicName)
    void publish(Message message)

AWS and GCP messaging services are very different and both work in a unique way.

Pub/Sub

Google Cloud Pub/Sub uses a more standard approach where everything is in one service. The publisher sends the message containing the data to the topic and the subscriber receives it through a subscription. This standard approach is easy to understand and manage, which makes Google Cloud Pub/Sub very straightforward.

SNS and SQS

We needed to combine two AWS services to have a fully featured messaging service and they are namely Simple Notification System (SNS) and Simple Queue Service (SQS).

SNS deals with all the routing logic. It works by fanning out messages to a large number of subscribers, which can be web hooks, emails, SMS… but most importantly SQS queues.

SQS job is to send a message to a subscriber and store it until it gets acknowledged. If this doesn’t happen then the message will be consumed by another subscriber.

SQS has two types of queues:

Standard queues offer high throughput and at-least-once delivery.
First-in-first-out queues (FIFO) are designed to guarantee messages are delivered exactly once, in the exact order they are sent.

Unfortunately, SNS is only compatible with standard queues.

Limitations

Both of these services have limitations, some of them we can’t control. In the cases we can, the solutions are sub-optimal because we can change only the way we use the service, but not how it is built. Also these services are not fully featured and lack some basic actions that others offer out of the box.

Order in message delivery

GCP PubSub doesn’t guarantee that the messages arrive in the same order as they were sent.

AWS doesn’t guarantee it either. SNS doesn’t have this feature and it cannot be linked to a FIFO queue.

Message latency

This technology is new, so it lacks the years or performance tuning that RabbitMQ and Kafka have. Hopefully these services will be faster as time goes on, we just have to wait.

Subscriber routing keys

Routing keys allow subscribers to cherry pick messages from a topic. Unfortunately this feature is not available on GCP Pub/Sub and implementing this resulted in significant overhead.

Slow subscription

SNS topics are connected to SQS queues with a subscription. But it’s possible to subscribe a topic to a queue more than once.

The subscribe call is the only way to get the subscription Amazon Resource Name (ARN) without iterating over all the topic subscriptions. This results in a slow subscribing action if a topic has many subscriptions.

Federation

We use RabbitMQ federation plugin to propagate messages across data centers and also for blue-green deployments. Kafka also provides similar support through MirrorMaker. This feature is useful because it enables two or more cluster to connect without having to change your applications. For cloud architectures, cross data center communication may not be that relevant anymore but blue-green deployments are still a very useful technique to deploy without downtime. Unfortunately, neither AWS nor GCP provide out-of-the-box support for mirroring topics. It’s not hard to implement but still a limitation to consider.

Performance

The performance test was made with 2 machines, one publisher and a subscriber.

On the first machine we had two threads, one creating the messages with a timestamp and queuing them. The other thread was fetching these messages and sending them. With this approach we prevented the coordinated omission problem.

The second machine had several threads waiting for messages to try to keep up with the publisher. Once a message is received it was timed based on the timestamp.

Results

GCP had low latencies for the majority of the messages, only some taking up to more than a second. For AWS latencies are a lot worst, they can’t even compete with GCP.

Unfortunately neither of these services come close to the performance that RabbitMQ or Kafka offer. Maybe this happens because they have to replicate the information to a lot of nodes, or there are a lot of improvement that can be done and hopefully they can evolve over time.

Pricing

Currently we are using RabbitMQ with the majority of our clients, and on this comparison scenario, the cost is around $700 per month just to pay for the servers. We also have to pay for the man power that is needed to maintain it running all the time.

AWS

Pricing is done based on operation, (to be exact $0.4 for 1 million API requests) and if SNS and SQS are used together, we don’t have to pay for deliveries. For the current traffic, we estimate $1400 per month, which is double of what it costs to use RabbitMQ.

Google Cloud Pub/Sub

Pricing is calculated based on data transferred by the service. The price for publishing or delivery of messages is $40 per TiB and for the current worst case traffic, the price would be about $1800 per month.

Conclusion

At Feedzai, we like to experiment. We learned a lot about how the cloud systems work and why we should use them. But unfortunately in this case, both services are too expensive. Maybe in the future we will be able to adopt them, but for now RabbitMQ seems to be the right choice for the majority of our projects.

On the other hand, we are now a step closer to the cloud. In the future, if the pricing decreases we are ready to change our applications to use these messaging services.