Networking for the Software Engineer: Service Discovery
As the token operator amongst developers with a little bit of networking experience, I get a lot of requests from developers that sound a bit like this:
Hi! You know a little bit about networking, right? What does _____ mean? Can you read through this and help me understand?
Networking is an intriguing field with fantastic acronyms that rival our modern texting lexicon and works like magic. However, for the software engineer, it’s not always clear why the network matters and how it affects an application. We take it for granted that when our application sends a GET request to another application, we just receive a response. But under the hood, why does it matter? What simple terms should we know to communicate about our application networking?
This is not an overview of the OSI reference model or networking layers. It is not intended to train anyone for a network certification, help configure a switch, or outline every detail about software-defined or container networking. It will not help answer the omnipresent interview question, “What happens when you type a URL into the browser and press enter?” (see link for the answer).
What I will try to cover are:
- Networking concepts we run into when we deploy an application
- How that affects our application
- The physical, software-defined, and container technology equivalents.
When I started this story, I realized that each domain of networking needs to be split into its own. What I consider the three domains of networking are:
- Connectivity (click link for the first part): How we connect applications
- Network Policy (click link for the second part) : How we stop applications from communicating
- Service Discovery: How we allow applications to be easily resolved
In this story, we will learn about the common approaches to register human-readable identities to services and resolve to them. This is called service discovery, the process of discovering services within a network.
Why should I care?
Accessing https://medium.com seems so simple. However, the network has discovered this service and is able to direct us to it. When we can’t access a URL, we get an error message telling us the site cannot be reached.
The detailed error tells us the “Name is not resolved”. When service discovery doesn’t work the way we intend, we can’t get to our service for fun or profit. When I first worked in enterprise applications, I was really confused by the idea of service discovery. I should easily be able to obtain some URL for my application and be done, right? In the enterprise, this want tends to more difficult to address. Rather than cover the general case of typing a URL into a browser, let’s address the nuances of service discovery in the enterprise.
What is Service Discovery?
Service discovery encompasses the set of processes needed to register and resolve to a service. When we refer to registration, we’re assigning a human-readable domain name (like helloworld.com
) for our service. A Domain Name System (DNS) server can help us register an intent-based alias for our service. Resolution is the process of getting from that human-readable domain name to the CPU and memory of its corresponding service. We need some way of reaching and using that service as a user.
Registration reserves the human-readable alias for our use. Resolution isn’t just “direct me to my application”, it is
direct me to the next available instance of application.
For example, think of the public restroom. When a stall is in use, no one else can use it. If there are more people than stalls, then there exists a queue of angry people waiting to use the restroom. Similarly, when users connect to an instance of an application, that instance is in use. No one else can use that application and other users have to wait for their turn until the application has completed the intended process. It can be pretty frustrating as a user, especially when we are buying a phone on sale during Cyber Monday or investing in cryptocurrency. In order to handle the increased load (number of users) on the application, we need to have multiple application instances to help process all of these users’ requests. When we have multiple application instances, which one do we resolve to during service discovery?!
Even worse, most application instances aren’t actually identified in a nice, human-readable way. Instead, they’re identified by their IP addresses. The worst case scenario is when IP addresses are statically defined in applications. When IP addresses change, everything breaks!
We can’t make sense of which instances actually belong to our application, short of remembering the IP address sequence or adding it to some inventory database. With multiple instances, we need some kind of technology to:
- Remember which IP addresses belong to our application.
- Create a pool of application instances that we can use.
- Distribute the requests to available instances.
This is called load balancing, the act of balancing requests to different application instances.
In essence:
- Registration can be achieved using a DNS alias, something with human-readable intent that identifies the application.
- Resolution can be helped with load balancing, a process that proxies the user’s request to an available application instance and tracks all of the instances.
Phew! It doesn’t seem so difficult of a concept, but why does it take us so long to get a DNS alias for our application?
A DNS Tangent
Being a global distributed system, DNS is a rather interesting case of architecture designed for self-service. After all, if one person continued to track all of the domain names registered on the Internet, it would take years for us to get a new domain name! For a deeper dive on DNS, see this reference.
Here, we’ll think about the concepts that answer the question, “Can you give me a DNS alias for my application?”
Recall that when we register a DNS alias, we are adding a lookup between the human-readable alias and some backend endpoint related to the application. When I started working in IT, I realized that registering something like helloworld.mycompany.com
wasn’t quite as simple as going to a user interface and making it myself.
An enterprise focused on security will maintain and catalog the subdomains that are registered under its company domain. We cannot allow willy-nilly registration of subdomains since abad actor can inject a malicious subdomain that may compromise internal applications or affect our users. Furthermore, an enterprise might want to standardize on the subdomain naming to include specific environments, separating non-production requests from production requests.
Even after registering a subdomain, we have to be conscientious of whether or not it is a privately or publicly registered. For example, I can look up medium.com
because it is a hostname registered with a public DNS name server.
However, an enterprise often has its own DNS name servers (usually with the .net
suffix) for security reasons. We cannot to resolve an enterprise subdomain unless our we can connect to the enterprise DNS name servers. This configuration is often referred to as “private”. If our laptop is off the company network and cannot access the private DNS name server, we would not be able to get to helloworld.mycompany.com
.
When working with DNS, keep in mind where the address is registered. Depending on our configuration, we may not be able to resolve a private DNS subdomain unless we have the correct name server.
Service Discovery Technologies
When delving into service discovery technologies, we find there are many variants that can help us achieve either registration or resolution, and sometimes both. We’ll talk about two types of technologies, namely:
- Load Balancers
- DNS
Physical Devices
Physical appliances that assist with service discovery include load balancers or DNS-related registration appliances. To handle major load, there may be dedicated physical load balancers that process requests and direct them to the correct pool of application instances. Most of the time, the pool consists of server hostnames or IP addresses. Similarly, some enterprise DNS solutions require a physical appliance to support processing.
Sometimes, it takes a bit of time to change physical appliances. When I started in the enterprise, I had to submit a ticket with a list of an application’s IP addresses to register to an alias. I was pretty frustrated because it could take weeks for the ticket to be completed. It never occurred to me that someone had to physically go to the device and enter in the information! If one of the servers that housed my application failed and I had to get a new one, I would submit a new ticket to change my pool of addresses and wait another week for it to work again.
Software-Defined Service Discovery
After I started doing more work on virtualized service discovery solutions and public cloud, I could self-service a DNS alias and a virtual load balancer in minutes! It was magical. When we use a public cloud like Amazon Web Services, Google Cloud Platform, or Microsoft Azure, we can provision a DNS alias for our application that points to a virtual load balancer, which then points to our application’s instances.
Software-defined load balancers can be very powerful, in that they can be configured to automatically register instances based on metadata or networking configuration. For more information on public cloud load balancers, check these references for GCP, AWS, or Azure. If we are not using the public cloud, there are vendor tools that can offer the same automation and functionality. By offering a load balancer on-demand, service discovery can be optimized for cost and elasticity. Instances can de-register and delete or register and provision very responsively to load. As a result, we can horizontally scale with smaller application instances in greater quantity rather vertically scale with larger application instances in smaller quantity.
Leveraging software-defined DNS tooling can provide some ease-of-use. When using public cloud, we still need to register a top-level domain like mycompany.com
with some registrar. That can take a little bit of time to configured, since we need to add the public cloud’s name servers to the domain’s lookup. After that, registering a subdomain like helloworld.mycompany.com
and adding the load balancer reference can be done in a few minutes. Despite registration only taking a few minutes, keep in mind it will take time for the new DNS entries to propagate before our laptops can resolve the alias. The amount of time our laptops or applications cache the DNS entry (statically holding onto the previous entry for some time) can affect how long it takes before we can resolve to the new backend application instance or alias. For a case study of how DNS caching might affect an application, see my investigation into NGINX reverse proxy configuration and how its DNS cache creates some interesting side effects.
Container Service Discovery
The first time I heard service discovery was actually in reference to container technology. The problem of attaching application instances to a load balancer and registering that to a DNS alias became a big pain point for container technology because containers are ephemeral. Their IP addresses can change at any moment. As a result, we needed a more dynamic way of addressing the process of service registration and resolution. Container orchestrators have various tools and techniques to accomplish highly dynamic service discovery, such as path-based routing and cluster-level DNS name servers. Both approaches combine the idea of DNS resolution and load balancing into a single tool.
While path-based routing is often referenced with the implementation of REST APIs, it is actually a formulation of service discovery. When an application instance gets created, its IP address gets added to a list curated by a reverse proxy technology, such as NGINX or Traefik. A reverse proxy acts as a single entry point for all requests into the cluster and routes the request to the correct instance. Sounds familiar? It’s like a load balancer, except even requests across instances in the same cluster also route through the reverse proxy.
A second technique for container service discovery is to create a small DNS name server scoped to the container orchestration cluster. This enables applications to resolve to each other with local DNS aliases. It is similar to creating a private DNS name server within an enterprise. For example, Kubernetes creates a small DNS name server that assigns a hostname to a set of application instances. Requests between applications in the cluster reference the name server as the resolution authority. If we want the application another
to access helloworld
, we simply code the configuration of another
to call helloworld
, which has a full DNS hostname of helloworld.default.svc.cluster.local
. This is only available within the cluster itself — we won’t be able to get to it from outside of the cluster!
In Summary
After writing and re-reading all of this, I marvel at how complex the process of service discovery can be. I was always frustrated at why I couldn’t get a DNS alias in minutes. Service discovery involves registration and resolution, accomplished by DNS and load balancing, respectively. There are many ways we can balance load, from physical appliances to dynamic container technologies, and many other ways we can resolve to a DNS alias.
I had not intended for this series to be so extensive but I think it is a testament to the nuances of networking. We tend to expect a lot of our network — it must always be available, secure, and resolvable. For many years, we’ve used networking-specific domain language around connectivity, network policy, and service discovery. As we create new technologies to move faster and account for rapid changes in our systems, we’ve started to better describe our network based on the intent of our business capability.