Part III: memcached as a peer to peer tracker in OpenChange Asynchronous Notifications
In previous articles we have provided a general overview of OpenChange asynchronous notifications framework and detailed the reasons why memcached was chosen for inter-process communications between internal OpenChange RPC services.
We concluded the last article on a note indicating that a messaging queue service was used to handle external communications between OpenChange and remote services such as Dovecot. We however did not get into further details.
This article details this part of the architecture and how we designed it to make it scalable and resilient to a high number of concurrent users.
Single point of failure
A single point of failure (SPOF) is a part of a system that, if it fails, will stop the entire system from working (Designing Large-scale LANs — Page 31, K. Dooley, O’Reilly, 2002). While software architects try to avoid or mitigate SPOF components in their architecture, what is really provided in the end is at best a globally accepted consensus and at worst an illusion.
The semantics of computing lies on binary data. It is either 0 or 1, true or false, there is no maybe (yet). There is therefore always a single point of failure when it comes to communication and process in general. Hosts themselves are single points of failure (SPOF). We — humans — are sorting SPOF based on criteria we define, generally aiming at qualifying the severity of a failure. This helps us take decision and move forward. If we suffer an outage we’ll preferably opt for an architecture where existing communications keep working rather than a solution where everything fail at once. This is risk mitigation, nothing more.
In fact, single point of failures are everywhere. For every non atomic operation that has a dependency, this kind of problem arises. The critical path method in project management is for example used to reduce its impact.
When it comes to message queue, we have different choices of architecture available — each of them with their own SPOF — which fall into two categories: a hub and centralized approach or a peer to peer architecture.
If you are looking for a great article with an overview on broker vs brokerless models, the article from Martin Sustrik available on ZeroMQ blog is really worth reading:
Architecture of most messaging systems is distinctive by the messaging server ("broker") in the middle. You can think…zeromq.org
With a hub, the approach is centralized. Publishers and subscribers only need to know about the hub in order to be able to communicate together. The hub is responsible for routing the messages, eventually for proxy purposes. This is very convenient, but also very constraining as it becomes de facto a single and central point of failure, a bottleneck under some circumstances.
This could obviously be addressed by load-balancing the router but you see how wrong the approach is? We have not yet really started that we are already bringing out the big guns.
Network congestion is the main weakness
In this scenario, we need to choose the solution depending on several requirements: How many messages do we need to handle per second? How many client connections do we simultaneously need to manage? Do we need authentication? What is the average size of the messages to route?
Above all, what is the maximum capacity we can safely rely on before we observe a significant degradation of performances? What are the hardware requirements?
Peer to Peer approach
In a peer to peer approach, communications are going directly from the publisher to the subscriber without any intermediary. If you are in a one o one scenario, it is pretty straight forward because you just have to know the address of the subscriber. If you however have to handle 100k simultaneous clients, this is a complete different story and this can be addressed with a service, similar to a BitTorrent tracker.
The “tracker” server keeps track of where file copies reside on peer machines, which ones are available at time of the client request, and helps coordinate efficient transmission and reassembly of the copied file. (source Wikipedia).
In our case, the external service such as dovecot only needs to know if the client is available and where to send the notification for the online user. Once a match for a given username is found, the address of the remote host is returned and the peer to peer communication can be initiated:
- the communication starts with an OpenChange node associating an ip address and port to a common name in the directory service with a key/value similar to username = tcp://127.0.0.1:4242.
- When the dovecot plugin needs to dispatch a newmail notification, it queries the service and lookup the address matching a given common name. The address/port of the OpenChange node where the user is connected is returned.
- The dovecot plugin, then contact directly the OpenChange node and push its notification content.
The weakness is the directory service
If the directory service stops or crashes, then we also have an outage with similar consequences than what would have happened with a centralized hub approach. The only difference would be that communications already initiated or clients that would have cached tracker data would continue to operate properly.
The directory service is therefore nothing more than a key/value store with much more get than add operations. If the service had to deal with a very high number of concurrent users, and even if performances were to decrease, the direct communication between external services such as dovecot and OpenChange would mitigate the risk and absorb the load for the duration of the peak.
The high availability is obviously a question to ask, but the approach has fundamentally changed. We do not worry upfront about load-balancing and congestion issue as with the hub approach but instead focuses on the kind of key/value store that would be good at handling high number of simultaneous get operations.
One one hand we had a centralized hub that would potentially have to manage either n number of message queues or n * 2 number of connections for every client (1 for publisher / 1 for subscriber) for the duration of the transaction; on other hand we have a key/store value primarily used for get operations and direct peer to peer connections for the data transfer of notifications.
We chose the peer to peer approach.
memcached as a directory service
The technology we chose to implement this directory service, was once again memcached.
In our deployment scenarios, external services such as dovecot and OpenChange are either running on the same host or deployed within a trusted environment. Authentication is therefore not a strong requirement.
The lifetime of the information is also limited and does not require persistent storage. It lasts for the lifetime of an Outlook session and in worst case scenario, if the key associating the common name with the hosts was flushed, a restart of the Outlook client would fix the problem.
The average amount of memory required to manage this service with memcached for 100k concurrent clients is also very low. If we consider the maximum key length to be 250 characters and the key value to be a multiple of 27 characters (tcp://xxx.xxx.xxx.xxx:yyyyy) — when we have multiple instances of Outlook connected for a given user — then we would need between 27Mb and 40Mb respectively for 1 host record and 5 host records associated to a given name.
Finally, this service was already deployed for other parts of OpenChange (indexing, IPC communications between internal OpenChange endpoint service, OpenChange mapistore backends) and was therefore not impacting the complexity of the architecture.
We now start to get a pretty good understanding of the architecture, both between OpenChange internal services and between OpenChange and external services. We only have few remaining technical aspects to cover:
- Which network library was chosen for peer to peer communications? (spoiler here)
- Edge triggered notifications vs level triggered ones
- How to manage and divide traffic to avoid network congestion issues at the OpenChange node level?
These are the questions to be answered in the next and last part of this series of articles.