A Common Part in Today’s Cloud Software

Jeff Pei
6 min readNov 24, 2016

--

A similar piece of code in many infrastructure software. Photo: Pexels

What is common in Kubernetes, Openstack, SaltStack, Logstash/Elastic, and Zabbix ? Why we still need a polling engine when having push & message bus?

TL;DR: Showed the Master/Agent model’s prevalence in today’s cloud by citing 17 notable infrastructure software adopting this model in 5 categories. While they were built for different purposes, a shared impactful aspect is how fast and scalable the master interacts and orchestrates with the agents. We discussed the poll/push as the interaction options, and argue that a generic polling/aggregation engine remains a necessary part for such systems and may greatly reduce the duplicate work in them.

Most of the infrastructure software are distributed, where manageability is the key to success. Today such software heavily relies on a master / agent(daemon) model to ensure a “full control” on many operations , such as provisioning, software deployment, cluster management, monitoring, network configurations/SDN, asset discovery, security policy management, and configuration management. Commonly there is a master service to interact with the agents to execute commands and run health checks; and an agent (a daemon process) deployed on each end server to execute the actual task at scale and speed.

This master/agent model (check the 17 cloud software illustrated below) is critical and prevalent in almost all aspects of today’s cloud system, except for a few simpler tasks such as discovery/config management where scalability and speed requirement is low (an agent-less solution such as parallel SSH may be sufficient). To illustrate, we summarized 16 cloud software by 5 categories with the format of agent/master:

These are just a few examples. In a “distributed system” with thousands or millions of end points doing their tasks in parallel, it is important to “always have control” on them. Managers often ask “if a subset or all of these 30,000 end points behave abnormally, can we stop them within 20 seconds and fix them within 1 minutes?” More often than not, we resort to the master/agent model for such problems. Despite that every agent may be built for different purposes, a common and important aspect of this paradigm is to interact and orchestrate the agents via a generic, flexible, and scalable polling/pushing and aggregation engine in the master side. The engine is the key to provide the manageability and visibility by efficiently managing and monitoring the agents and then aggregating the results:

  • To interact with the agents by executing commands / receiving events and run health checks on them in a scalable and efficient way (controllable concurrency and work flow (orchestration) with infinite scalability on 100,000s more more targets), like an API gateway but not limited to HTTP APIs (and 2 ways).
  • To conduct generic and flexible agent responses processing, aggregation and persistence to data storage.

Now let’s about talk poll and push, the eternal topic when designing a distributed system. We believe often we need both (for more time sensitive requirement) or just poll (for simplicity). Yes we still need to poll when we push. While push based communication provides more timely results, its complexity and error-prone nature still makes the simpler polling based communication the necessary mechanism in many tasks.

When the master cannot talk with the agents well: everything breaks up.

The robustness and scalability of master interaction with agents is critical for the whole system’s reliability. The two examples below show the sensitiveness and challenges of using push / heartbeat mechanism alone. The reason why polling based interaction is more reliable is because we have full control on when and how (retry/concurrency level) to ask the master to poll (or not to poll). By contrast, with 10,000s end points spontaneously reporting to message bus / master every k seconds or upon events, we just lose control and cannot easily retry/reproduce/even stop it...

  • OpenStack: A message bus (RabbitMQ) has been critical for OpenStack hypervisor nodes to talk to the master in a decoupled way. Many times, however, this message bus being slowing down directly impacted the whole control plane. For example, we observed that a top-of-rack switch soft failure on the RabbitMQ cluster made the RabbitMQ message queue up and we become blind to all hypervisors’ events.
  • Kubernetes: We (any many people) observed occasional kubelet agent heartbeats to master failure, which let the master think the agent/node is unhealthy and therefore mark it down, substantially reducing the cluster’s capacity when it happened on many nodes..

When such events happened, we have to resort to polling to catch the missing data or let the agents stop actions. To sum up, even with those systems with heartbeats/push/message bus, a polling based communication is more than often a necessity to complement the push mechanism to catch misses and failures.

Aim at a generic, scalable, and flexible polling and aggregation engine, we open sourced Parallec.io at eBay Cloud around a year ago. It is simply a generic parallel async HTTP(S)/SSH/TCP/UDP/Ping client java library based on Akka. It can serve as the core of the master to effectively interacting with 100,000s agents. It mostly helped with the concurrency control and response aggregation to manage and monitor the agents, compared with using a simple Apache HTTP Client. Yes, although the push part remains to be up to you to decide depending on specific applications, Parallec helps you handling the polling part at ease and scale.

Parallec.io Workflow

Recently, we enabled several new use cases of Parallec.io at eBay in production, thanks to many of my colleagues. A brief summary of them is as follows:

  1. Application Deployment / PaaS: Parallec has been integrated in eBay main production application deployment system (PaaS). Parallec orchestrates 10+ API tasks, with each task targeting 10s to 1,000s servers over 1,000+ application pools in production.
  2. Data Extraction / ETL: Parallec has been used by eBay Israel’s web intelligence team for executing 10k-100k API parallel calls to a single 3rd party server with dramatic improved performance and reduced resources.
  3. Network Troubleshooting via Probing: In eBay’s network / cloud team, Parallec is instrumental to ensure an extremely low false alert rates to accurately detect switch soft failures. Parallec.io serves as the core polling engine in the master component to check agent health and mark down agents to effectively and timely eliminate noises.
  4. Agent Management / Master: In eBay’s site operation / tools team, Parallec serves as the core engine to manage and monitor a puppet agent/salt minion/kubernetes kubelet like agent on 100,000+ production servers to ensure scalable operations.

At the same time, an external Fortune 500 company with a large private cloud is currently using Parallec.io for a large scale configuration discovery project and we will give an update with its progress. Please stay tuned.

The above use cases exemplify the wide usages of such an engine and how much work we can save if we do not reinvent the wheels. I hope this post will bring people’s attention on the common ground in many “distributed system” and “cloud” projects. I hope more open source libraries such as Parallec.io are coming to simplify and de-duplicate our work as the master side in many cloud projects.

Finally, I would like to thanks @Tim Spann for also introducing Parallec.io in his post with usage in spark as a parallel HTTP client. This is my first medium post and I appreciate any type of feedback so that I can do better next time :-)

--

--

Jeff Pei

Software Deployment @LinkedIn, previously @eBay. API, author of parallec.io & REST Commander. Views my own. jeffpei.com