MessageBus — Loft’s standard for communicating asynchronously

Henrique Carvalho Silva
Loft
Published in
8 min readFeb 28, 2020
Washington, D.C. Miss Helen Ringwald works with the pneumatic tubes through which messages are sent to branches in other parts of the city for delivery — by Esther Bubley

To revolutionize the real estate market and bring freedom living to the people, Loft is creating a multitude of unique digital products. The challenges not only involve finding, visiting, buying, and making the home of your dreams but also many others, including financial, legal, and architectural challenges. Each of them is addressed by one of the many squads that form our ever expanding tech product team, Vila Nova Tech (VNT).

This great diversity of products and squads asks for decentralized approaches to deal with the challenges we face. To this end, VNT employs distributed system architectures such as the microservices architecture. Not only does it provide each team with autonomy but it also enables the scaling of our products in the long run.

To take full advantage of microservices, we also aim to employ an event driven architecture. In a few words, this architecture makes components in a distributed system collaborate with each other by exchanging messages in an asynchronous fashion. They do so by publishing their data to known topics of interest so that other components may subscribe to them and consume the data they are interested in.

The nature of such collaboration allows us to leverage a series of benefits that include: resilience to failure, loose coupling, ease of scaling, and facilitated collecting of data — this last one is specially important to fuel the studies conducted by our data science and business analysts teams. In order to achieve this goal, however, we need an asynchronous infrastructure for communicating events between the components in our distributed systems.

Requirements for an asynchronous communication infrastructure

This infrastructure must provide applications with means to publish and subscribe to topics of their interest. Along with these capabilities, comes a series of requirements that are inherent to asynchronous communication of this sort.

The first one is exactly-once message delivery guarantee. Our infrastructure must ensure that an application subscribed to a given topic will receive a message published to it. Messages cannot be lost. Furthermore, this infrastructure must also ensure that this application will not receive this message again afterwards, otherwise its processing could alter the application’s state in unexpected ways and cause undesirable side effects.

Another requirement is related to the capability of dealing with a common issue in distributed systems, backpressure. As there are different components in a distributed system, each one with its own processing capability, we have different resistance to the flow of data throughout it. To tackle this issue, there are different strategies that our infrastructure may be able to adopt, such as throttling.

Applications will also need to cope with communication failures typical of distributed systems when transmitting and receiving messages. To this end, our infrastructure should also implement mechanisms such as circuit breaking and retry policies with configurable exponential backoffs. They are essential to improve fault tolerance in applications making use of this infrastructure.

Observability is another common concern for this infrastructure. With the inherent increased complexity of testing and debugging distributed applications, it is crucial that exchanged messages can be traced both in cases of failure and success. In case of failure, operators must be able to identify problematic data and schedule messages to be reprocessed. Additionally, it is also important to keep track of message processing latencies.

Quality assurance is also a desirable feature for this infrastructure. It should not be necessary to waste processing time with messages that do not fit an application’s minimum acceptance criteria for the quality of their data. This may include not only schema validation but also programmable enforcement of business rules pertaining to the exchanged data.

With larger amounts of data, may come the need for horizontally scaling applications that process messages subscribed through our infrastructure. This may involve different concurrency challenges as message consumers grow to encapsulate several threads, processes, and machines.

Finally, we need to make it possible for our data scientists and business analysts to experiment with all the data exchanged between applications. This means it is indispensable to have a way to offload all the messages flowing through our communication infrastructure to an analytical repository. In other words, all of this data should be collected and made available at Loft’s internal Data Lake.

Choosing an infrastructure solution for asynchronous communication

Browsing through the different asynchronous communication infrastructures available on the market (e.g. Amazon Kinesis, Apache Kafka, RabbitMQ, Redis Streams), it is very hard to find one that meets all the requirements we listed above. This is expected, since many of these requirements are desired when communicating asynchronously but still not necessarily belong to communication infrastructures themselves. Usually, applications are left to build, on top of such infrastructures, the needed mechanisms to meet all of our listed requirements.

Leave to each team at Loft the responsibility to build such mechanisms into their applications would be counterproductive and error prone to say the least. Each team would end up building a different solution to tackle the same asynchronous communication problems over and over again, instead of focusing their energy on their real problem: reinvent the real estate market.

Moreover, which of the aforementioned infrastructure solutions would be best? Should the entirety of Loft adopt Kafka or RabbitMQ? Should we use streams or queues? Should each team adopt its own solution according to its needs? What would be the impact of each of these approaches for collaboration between applications of different teams?

Furthermore, each of these infrastructure solutions has its own particular interface. Should teams couple their products to a particular infrastructure? What if it needs to be replaced due to no longer meeting the requirements of our products? Should we rewrite and patch all of our applications then? What would be the cost for Loft when it happened?

MessageBus

MessageBus is the answer of Loft’s DataX team to all of these questions. It is our own infrastructure for asynchronous communication. It is rather more of a standard than an infrastructure per se, but applications use it transparently, not knowing what runs underneath. This gives us several advantages. The most prominent one is that our applications are not coupled to a specific communication infrastructure but rather to the standardized interface that MessageBus provides.

As a consequence, we can run whatever suits our needs under it. When choosing between Kafka, Kinesis, Redis Streams, and RabbitMQ, we first ditched RabbitMQ. Unlike the others, RabbitMQ is a message queue service and not a data stream. There are a few differences between basing asynchronous communication in queues and in data streams, but basically the nature of queues make our infrastructure more complex and increase the difficulty of achieving exactly-once delivery guarantees.

Data streams are simple and their storage characteristics allow the publish/subscribe mechanism we want, as well as other quirks such as allowing messages to be replayed when needed. Still, this choice leaves us with a few alternatives left. In terms of scalability and cost, Kafka seems to be the best alternative in the long run. We can run a cluster and scale it horizontally as our traffic demand grows. However, is Loft there yet?

Currently, Loft operates in part of Sao Paulo and Rio de Janeiro and is still preparing an expansion to other cities and countries. The volume of sales and acquisitions on a daily basis is still not close to the huge number we expect it to be in a few years. Does this small amount of data justifies the cost and effort of running a Kafka cluster? Amazon Kinesis offers a managed data stream with shards that support up to 1 MB/sec for input and 2 MB/sec for output at a fairly reasonable price. Today, the throughput of a single shard is enough for our infrastructure needs in the near future.

Despite Kafka being what we aim for in the long term, Kinesis is what Loft needs now. So we have started our infrastructure by running a single Kinesis shard under MessageBus and will probably scale it to multiple shards before replacing it by Kafka when the volume of data exchanged by our applications demands it. No application will be impacted by this infrastructure change because they are all using MessageBus and not Kinesis. How is this possible? Our interface is comprised of two simple clients: a producer and a consumer. Through them, applications can publish and subscribe to data in our infrastructure.

MessageBus client interfaces

Independently of the solution employed under the hood, an application only needs to know the data it wants to publish and to which topic. In our case, the data will always be an array of bytes. For consumption, on the other hand, an application must only subscribe to a topic of interest and provide a processor function that will effectively process each binary message that arrives.

This allowed us to start humbly with one Kinesis shard and leave complex message routing mechanisms for the future when we scale to multiple shards. We can also begin with a single-threaded consumer client for the single language we support and eventually scale it horizontally as we seem fit, with all the complex mutual exclusion logic it will demand. We can also have an initial version of features like retry policies and metrics aggregation for message observability already implemented in a centralized fashion, thus lifting this weight from the shoulders of our application developers.

At first, we may not provide all the guarantees and requirements we have listed at the beginning of this article. To do so would require a costly and complex distributed system that would take a long time to be delivered. Nevertheless, we can still meet the current needs of our applications by providing these two simple clients and by making them implement all the functionalities we believe to be indispensable now.

These clients’ interfaces, however, will remain unchanged as we scale our infrastructure and build more advanced features such as: message quality assurance tools, routing and sharding mechanisms, horizontal scaling with concurrency safety for multiple consumers, and exactly-once delivery guarantees for multiple consumer groups. Our detailed plan for these features, including our high-level architecture and details on how to implement them, will be disclosed in future articles.

--

--