Originally published at rockthecode.io on August 14, 2017 by Eli Hamburger, a Director in the Core System Infrastructure team within Aladdin Product Group. This posting also contains a December 2018 update from Eli at the end describing some of the current work going on.
This article will give you a glimpse into how we’ve implemented and scaled the architecture that powers Aladdin — the BlackRock Messaging System. Please let us know if you’re interested in hearing more!
Since its early days, Aladdin was comprised of services. Aladdin’s architecture design is driven by a few simple rules:
- “One BlackRock” — There should be one way to access a piece of data, or for example, book a trade. Don’t reinvent the wheel.
- Front-end applications do not have access to databases or filesystems. They must utilize backend services to accomplish their needs.
- Services can run on any host, be moved whenever needed, and can be scaled up by simply starting additional instances (on any host). The corollary is that applications using the services should never need to know the host a service is running on, or be responsible for things like load-balancing.
To meet these needs, BlackRock developed the BlackRock Messaging System (“BMS”) many years ago. Since then, BMS has evolved and grown. Today, it connects thousands of services across multiple data centers and regions. It provides several messaging paradigms used by applications and services to meet their needs. It is resilient to outage, recovering automatically within two minutes, guaranteed. It provides security — every message is tagged with unspoofable user and client identifiers. It’s even hot-deployable, reducing the risk of changes to BMS itself (this was one of our most interesting engineering challenges).
BMS supports several methods for applications to communication with each other. The messaging paradigms that are most commonly used are Request/Response, Conversation, and Broadcasts (Publish/Subscribe).
This paradigm is simple and the most common. It’s used for stateless requests to services. An application can send a request to a service and the service will reply with at least one response. BMS will load balance the request across all instances of the service, delivering it to a single instance. BMS will also optionally retry a request if the service crashed or disconnects during processing. Finally, BMS will queue requests when a service can’t keep up with a burst of requests, or is temporarily down. Request/Response The queuing provides resilience for Aladdin — even a full restart of a service does not result in errors to users.
Over the years, developers have relied on these messages for send-and-forget type messages as well. They depend on the fact that BMS rarely goes down and that it’s therefore safe enough in many cases.
Similar to Request/Response, an application can initiate a back-and-forth conversation with a service. This is useful for stateful interactions. A conversation starts just like a Request/Response: an application sends a request, which BMS load balances across all running instances of the target service. Unlike Request/Response, once the service accepts the request, a bidirectional ‘channel’ is established. Both the requesting application and service instance can send messages to each other at will. Additionally, each side is notified by BMS in the event of an issue with that channel. If the instance of downstream service crashes or disconnects for example, the application is notified and can attempt to reconnect to another healthy instance if it chooses to. Conversely, if an application disconnects, the service is notified, allowing it to clean up any state.
Conversations can be used for “get and subscribe” type requests. For example, an application may want to fetch the current price of BlackRock’s stock and subscribe for all changes, thereby ensure the price the end-user sees is always up-to-date. Should the client disconnect, BMS will inform then service that it should stop sending updates. My team uses conversations as the communication backbone for our “compute farms”. They allow our schedulers to efficiently communicate will the executors running on our compute nodes, with BMS dealing with complexities such as node loss and network partitions.
Broadcasts — Publish/Subscribe
Broadcasts are a mechanism within BMS to, well, broadcast data. BMS transmits broadcast messages to any and every application or service that has registered interest for messages of a specified type or topic.
The broadcasts are quite lightweight.
Our BMS networks delivers on the order of 7–10 billion broadcast (10–20TB) a day to applications. They are also WAN optimized — BMS is careful to only deliver a message over each of our WAN links once. When a broadcast message reaches its destination office, data center, or client site, a local BMS repeater will fan the message out to all local targets. This significantly reduces our bandwidth needs, keeping costs down.
The BMS protocol is fully asynchronous. An application can (and should) send many requests over a single connection to BMS. Each request message sent is tagged by the application with an id number, which is unique only to the application and connection. Corresponding responses and errors are tagged by BMS with the same id number. This allows a program to send many requests and receive responses in any order — avoiding head of line blocking. Our implementation of BMS is also asynchronous, allowing it to scale without the bottlenecks, complexity, and overhead of thousands of threads.
Architecture Basics & The “BMS Network”
BMS is implemented as a distributed system. Every Unix host that runs Aladdin services runs a BmsServer instance. BmsServer is a multi-process C++ application. Applications, both clients and servers, connect to a BmsServer instance over TCP or Unix Sockets (we don’t use multicast, SCTP, or other protocols…yet). Application Servers connect to the BmsServer on the same host they are running on. Desktop applications connect to a pool of BmsServers designated for desktop use. All BmsServers then connect to the local datacenter and from there to global hubs. Discovery for developers is simple: connect to localhost. If the app is running on a desktop, connect to a server from the desktop pool instead. They don’t need to know where each service is, they just need to connect to BMS.
BmsServers connect via hubs to each other and form a BMS Network. From a resiliency standpoint, we have a feature called “2-Minute Non-Stop”. This feature provides a guarantee that the BMS network will self-heal after host or network outages in less than two minutes. In practice, recovery often takes much less time.
The BMS network allows us to keep application and firewalls for that matter, quite simple. We don’t worry about ports. Applications just connect to BMS and are instantly ready to send and receive messages.
We opted to keep the rest of BMS light in terms of configuration. There are no routes to configure or hard code. When a service connects to a BmsServer, it will tell BMS the message types and topics it’s interested in receiving. The BmsServer then advertise to other BmsServers the list of services connected. This will propagate until all connected BmsServers are updated. Each BmsServer remembers the connections (to services and to other BmsServers) that it needs to send messages to for a given topic.
Conversely, when sending a message, the application includes a topic in the header of the message. If the corresponding service is connected anywhere in the BMS network, the message will be sent through. If not, the message will be buffered by the local BmsServer until the service connects. As a result, applications have no need to know where services are physically running.
Broadcasts work the same way. A service can send a broadcast on a topic, which one or more applications (or other services) have registered on. For example, applications interested in market updates would register for broadcasts on the “MARKET_UPDATE” topic. The service responsible for sending updates will send each update once to BMS with “MARKET_UPDATE” as the topic. BMS knows where all the receivers are running and delivers each message to them.
I mentioned above that BMS Repeaters enable up to optimize our WAN usage. Instead of sending broadcasts to each user over the WAN, BMS will send the broadcast once to the repeater at the other end of the WAN link, which will then forward the message to interested users over the local area network. The repeaters, like the hubs, are just regular BmsServer instances. They aren’t special in any way — they are just running in a different location. The appropriate broadcasts flow to a repeater as soon as applications connect to it and register to receive them.
I’ll tackle the simplest part of BMS security first: Encryption. BMS supports TLS and will require it in many cases.
Next up: identification and authentication. Our goal with security is to ensure every single message is tagged with the correct user identity and client code. To accomplish this, we authenticate every connection when it’s established. BMS supports several authentication mechanisms, from simple username/password authentication to “process inspection”, whereby BMS discovers the user from the OS itself. The latter enabled us to hot-deploy major authentication enhancements without changing every application — allowing us to nimbly enhance BMS at scale.
Each application’s user and client code is included in the header of every message BMS sends. This information cannot be spoofed (attempting to do so triggers BMS to kill your connection). This provides downstream applications a bulletproof source of truth to use for authorization checks.
BMS itself provides authorization in places that services cannot. While services can authorize requests they receive (using the identity information in message headers), BMS controls who is allowed to register to receive requests. By default, if a normal user attempts to register to receive requests, BMS will terminate the connection. Any user can register to receive broadcasts. However, BMS does permit a service’s owners to restrict access to specified message topics. Anyone can send requests and broadcasts on any topic. This has a nice performance property — we only need to perform a certain authorization checks when applications first connect to BMS — not on each request.
Special control messages, which allow us to operate BMS, are authorized as well and are limited to select super users.
Two principles have allowed all our applications and services to communicate with each other. First, all services are connected to the same BMS network. Applications just connect to BMS and can immediately begin communicating. However, this was not enough. We needed a way to communicate effectively across programming languages. To BMS, the payload of a message is just bytes. BMS can therefore be used with any encoding format you choose. However, if each language and team in BlackRock used their own serialization format, we would have created boundaries that would have limited the use of each service.
We therefore created the “BRMap” many years ago. A BRMap is a binary and ASCII serialization format that can represent every data type BlackRock needs. It supports chars, shorts, longs, doubles, strings, byte arrays, maps, arrays, tabular data sets, dates, date-times, and timestamps. Like JSON, it’s self-describing — keys and data types are encoded with the data. This makes it simple to display and debug. Like BMS, BRMap support is available in many languages and is one of the first libraries we develop when adopting a new language in Aladdin’s technology stack.
BMS continues to enable us to scale Aladdin incredibly effectively in a variety of ways. It enables us to create services that solve business needs quickly. It allows us to solve hard problems — like authentication — centrally and only once. This saves time and money, both when rolling out new functionality and for ongoing maintenance. BMS as our primary communications platform enables our support teams to scale and support our rapidly growing developer community. Finally, it has enabled us to scale our business by supporting the ever-increasing size and scope of Aladdin.
**Update for December 2018**
Since this article was posted in August 2017, we’ve continued to invest in BMS and plan for the future of Aladdin. From a scale standpoint, the number of services continues to grow and BMS is routing on the order of 50 billion messages per day. We’ve focused on adding routing capacity, and providing a richer set of service level metrics to our service owners and operations teams.
We’re also investing in how messaging in Aladdin will work in a hybrid, multi-cloud deployment model, and how BMS will evolve to meet those needs. The open-source Kubernetes (K8s) project is of particular interest for orchestrating applications in this space. One interesting question that we’ll need to answer is: how integrated should BMS be with K8s? Controlling the network and flow of data outside of our application K8s clusters provides a nice separation of duties. It will allow for continued cross region communication and will allow us to deploy services across K8s clusters without worrying about communication. This could, for example, allow us to shift some instances or a service to a beta K8s cluster without any impact or compatibility concerns. On the other hand, tighter integration may offer benefits and the simplicity of a single set of controls for orchestration and communication.
As a first step, we created a cloud native K8s sidecar to broker communication and authentication between applications running inside K8s pods and BMS. As our plans for Aladdin crystalize, look out for more changes to meet our new needs.