Elements of Service Discovery
I’ve noticed that one of the most difficult aspects of configuration management is dealing with service discovery. Often times DNS becomes the de facto improvement here, but that simply moves the problem from something like Chef search to tooling to automatically update DNS correctly. As a consumer, service discovery makes your life simple. You might be able to assume DNS was set up properly and you can use a known endpoint URL. You might use something like Consul or etcd and some code to look up endpoints in your code. No one said you have to use a key/value store either, so querying a database is pretty much the same thing here. In all these cases, finding the service is easy for the basic use case. In order to handle things when they get complicated, you need to understand some fundamental concepts of service discovery.
At the core of any service discovery is a database. You need to store those services somewhere. What is more, you need to update that database when new nodes show up and leave. It is also important that this database is made available and kept up to date across your fleet. Consul, for example, uses SWIM, a gossip protocol, to help keep data up to date across all instances by constantly “gossipping” about changes between nodes. Consul even supports a parent child mechanism where a cluster in one region of a cloud gossips amongst itself and lets those answers flow between data centers by known master nodes that help do fancy things like regional load balancing.
While tools like Consul are awesome, it is a good idea to keep a simple layer of indirection between your applications and service database. Service discovery code should generally be orthogonal to your applications, so maintaining some division between the underlying database and the code that makes service discovery easy for clients should hide that complexity.
Registration or Join / Leave
I mentioned how consuming service discovery is easy. What is slightly harder is dealing with registration. Registration is where your services start up and let the service discovery system know about it. This sounds really simple, but it is in fact, much more subtle. It is not terribly hard, but it does require some thought to understand what it means to be a service in your organization.
Lets hop in our time machine before containers and the kubes were all the rage. Back in the day, you had actual servers and you’d work to run many services on the same machine. You might have a web app and a few supporting daemons to maintain. You needed to understand that a service lives at an IP address and a port. You also need to ensure that ports don’t clash on the same machine, keeping in mind that the web app is actually a Python program that has one process for each CPU alongside a Nginx process.
The thing to realize is that, even if you use the cloud, k8s and containers, your service endpoints are still dependent on your deployments targets. Not to mention that if you’ve migrated your apps to the cloud or that container orchestrator in the sky, you most likely are keeping your database on VMs or bare metal along side legacy software and that on premise system that you can’t run in k8s because it expects a persistent volume. These are services you need most of all in service discovery!
In order for service discovery to be really powerful, it needs to be global and include everything! You need your old kludgy processes joining service discovery just like everything else. This isn’t hard technically, but it does require some thought as often times when something is “available” in service discovery, that synonymous with being functionally available. Put another way, your ability to join or leave service discovery should be simple and easy, as long as you consider health separately.
Writing a program to join/leave service discovery is really simple. Where things get more difficult is how you define whether a service is considered healthy enough to be considered an answer when a client queries for a service. It doesn’t make things any easier that every program is different in how it exposes its health. Many modern systems have started adopting health endpoints that make it reasonably easy to tell when a service is up and available. Unfortunately, the future is not now and won’t be here for quite a while, so until then, we’ll need to consider tried and true applications like MySQL, Apache, bind and a whole host of excellent software that the internet is built on top of.
When checking health it is a good idea to try to implement checks that act like a normal client. For example, you could write a health check that watches a log for a listening message to signal when the app is available for traffic. The problem is that the log message has nothing to do with whether the program bound to a socket yet. A better tactic is to write a program that will try to connect like a client would and do some reasonably safe operation that confirms the service it up and successfully handling traffic.
This brings up another essential consideration for health checks, context. A health check needs to be considered within a context that defines when a passing health check is valid. For example, a passing health check based on a request from a process on the same box as the service provides a more limited perspective on the health. This sort of health check is limited in scope but valuable when validating a service is running. The same health check, when accessed from the network, might fail, at which point the larger context reveals the problem is likely due to something outside of the application itself such as a firewall change or network issue.
To summarize, when considering health checks, it is important to:
- Create a health check that acts like a client
- Be explicit about the context of the health check
- Use different health check contexts according to your definition of health for the service
I’m a huge fan of service discovery because when implemented correctly, it can be a huge benefit. Still, it is important to understand that service discovery is not free. It takes a lot of operational work maintaining the service database and service discovery code that provides the join/leave behavior and implements health checks. Service discovery needs constant attention as well. As your organization changes and the services changes, service discovery will likely need to change in order to meet the needs of the organization. This cost is worth it as the complexity it can encapsulates makes a lot of code related to connecting to services much simpler. As you grow more services, that complexity is compounded, especially when you start to work with regional constraints, disaster recovery and all the other issues you need to consider when running global systems.