Service Discovery in Distributed Systems
What is Service Discovery? It’s a special service in a distributed system that keeps and makes available information about other services in the system
Working for a couple of years in the data science field and at the same time pursuing a degree in Software Engineering is like dividing yourself into two halves. Half needs to study hard math principles, all types of neural network architectures, data storage, and Business Analysis, while the second one makes you to learn how information is encoded and used by computers, how algorithms work, and how to build all sorts of applications. These two fields complete each other, the second helping the first to become a better data scientist and better understand the world in which my models are living.
In the last year of my Bachelor, I had a course that I absolutely loved — Distributed Systems. I liked it so much that I decided to go ahead with the curriculum and started to learn from one of the greatest books I read — “Designing Data-Intensive Applications” by Martin Kleppmann, and then in combining my bachelor thesis Distributed Systems and Natural Language Processing to create a Chatbot.
Now, after developing a couple of distributed systems and getting more experience in them I decided to start writing a couple of articles explaining and implementing services, patterns, and principles from this field, starting with one of the most important services in all distributed systems.
The problem statement.
In distributed services, the whole application is represented by services — web servers running on different machines responsible for different features of the application. Take for example Spotify’s functionalities. Spotify not only allows users to play music on different devices but also allows artists to load their songs, create music recommendations, provide song lyrics, follow artists, financial management, and others. All these functionalities are implemented by different services or service replicas, having hundreds or thousands of services, which should communicate with each other. However, in such kind of architecture the machines on which the services are running come and go, so the location of those are always changing. A mechanism for storing, updating, and deleting Services’ information is required, and usually for this purpose, Service Discovery is used.
What is service Discovery?
Service Discovery is usually a service designed to be a service registry, keeping all information about the services and as a point of announcing that the services are still alive. Usually, it is one or more web services that provide an API for registering, updating, and deleting service information and some endpoints for sending heartbeat requests. The API provides mostly CRUD functionalities. If you are not familiar with CRUD operations, CRUD stands for:
- Create — The creation of entities is a data storage.
- Read — Reading the entity values from the data storage.
- Update — Updating entity values in the data storage.
- Delete — Deleting entities from the data storage.
When a service in distributed systems starts, it sends a Create request to the Service Discovery to register, so the Service Discovery knows about its existence. Then the service makes the second Read request to get the information about the services to which it should communicate or send the request to. Update operation is added for redeploying the service to another location, or for redirecting the load to another service. Finally, the Delete operation is created to just delete a service in case of server drop or manually.
Now let’s explore more about the heartbeat requests.
Heartbeat requests.
A natural question may have appeared in your head. How does Service Discovery know that the service is still running after it has registered? Here comes the Heartbeat requests. After registering and getting the information about requested services, the service starts to send once in t second or minutes request to Service Discovery, announcing that the service is still alive. If the service is not sending heartbeat requests for t x 2 or 3 then depending on the implementation the Service Discovery deletes or triggers a health check of the service. Usually, heartbeat requests don’t have any payload, but again depending on the implementation, it may have some data.
The Python implementation.
Next, I’m going to show how a simple Service Discovery can be implemented in Python as a Flask web app. Following the CRUD API each, each endpoint should accept specific HTTP methods:
- Create POST;
- Read: GET;
- Update: PUT;
- Delete: DELETE
Usually depending on the amount of data stored the Read endpoint can be split in two: read_all and read_some or read_one. This one depends on the developer and architecture. For this example, we are going to implement two Read endpoints:
- read all;
- read some;
Also instead of connecting a database, we are going to implement a ServiceRegistery class which will store the services information. The service information will be a simple dictionary with the following fields:
- name — the name of the service.
- host — the host of the service.
- port — the service port.
The service information will look like this.
For service CRUD operations the ServiceRegistery has one field and five functions (remember we implement 2 forms of read requests). A field is a simple dictionary that we use to store services information and is set up in the constructor as follows:
The heartbeats, heartbeats_lock, and time_threashold are used for heartbeat requests management and will be covered later.
The create operation takes the request body and if the service described by the request body doesn’t exist then it is added, else an error message is returned as follows:
As you can see in the listing above the request status returned in case of the error is 208. This state is named Already Reported. It is usually used when a resource is requested several times.
The read-all operation doesn’t require any parameter so it just returns the dictionary with services’ information and the status code 200. In the case of the read some function it takes the list of services requested, if all services are presented, then they are returned as a dictionary, else a list of missing services names and an error status code 404.
While updating the services’ information the update function reads the request body and updates the information of the service with it. In case of the existence of this service, the updated information about the service and 200 status code is returned, else an error message with the 404 error state is returned.
Finally deleting a service is done by the following function of the ServiceRegistey:
All these functions are called in the Service Discovery endpoints listed below:
Heartbeat endpoint implementation.
To implement the ability of the Service Discovery to find services that aren’t working anymore as said before after registering and getting other services’ information the service starts to send heartbeat requests to Service Discovery. After each received request the Service Discovery updates the last heartbeat timestamp of the service. Also, the Service Discovery is running on a separate thread in which one at t time it is checking the passed times since the last heartbeat request. If the passed time is less than t, then the service is declared dead, and depending on the implementation different actions are taken on.
First, below is showed the function for adding a heartbeat in the ServiceRegistery:
This function is called when the heartbeat endpoint (listed below) is called:
When the service wants to send the heartbeat request it should send its name as a parameter, in such a way its last heartbeat request timestamp will be updated. Finally below is presented the code of the function that is checking the heartbeats, it runs as a separate thread and is just printing the name of the service that it thinks is not active.
This function is running on a separate thread in the main module.
Conclusion.
Distributed systems is a vast field in Computer Science and a technology to eager to know and understand. Service Discovery is a special service in such a system that keeps and makes available information about other services in the system. Also, it keeps on the availability of the services. It also is the service to start with when developing a Distributed Service, because each service begins its activity by registering to this service.
The full commented code of the service can be found here: GitHub link
Written by Păpăluță Vasile
LinkedIn: https://www.linkedin.com/in/vasile-p%C4%83p%C4%83lu%C8%9B%C4%83/
Instagram: https://www.instagram.com/science_kot/
GitHub: https://github.com/ScienceKot