DIMON — A distributed and decentralised monitoring system based on AntidoteDB for Guifi.net
The Guifi.net network has grown over the past 15 years as a technological, social and economic project to provide Internet access nowadays to more than 80.000 people. This infrastructure was initially built with WiFi radio connections and, nowadays, also employs fibre optic links to reach thousands of households. The monitoring system currently in place is lagging behind the network evolution, requiring manual intervention and exposing a number of single points of failure. The technology provided by the LightKone project is helping UPC to develop a novel network monitoring system built with distribution, decentralisation and automation in mind that will take over the legacy one.
Community networks such as Guifi.net are bottom-up, citizenship-driven technological, social and economic projects with the objective of creating an open, free and neutral telecommunications network based on a commons model. The whole infrastructure can be understood as a crowd-sourced, multi-tenant collection of heterogeneous network devices (wired and wireless), interconnected and with an organised IP addressing scheme. In particular, Guifi.net was born 15 years ago in Gurb, a rural village 70 km north of Barcelona, to overcome the lack of Internet access provision by the incumbent ISPs back then. As of today, the network consists of more than 35.000 operational nodes, including thousands of kilometers of fiber optics links, it covers most areas of Catalonia and is also present in several regions of Spain. Figure 1 captures Guifi.net around Barcelona and surrounding towns. Providing Internet access to more than 80.000 people, it is considered to be the largest community network worldwide.
Besides the network equipment such as routers, in Guifi.net there is a computing infrastructure located at the network edge. It is formed by a collection of heterogeneous computing devices ranging from single-board computers (SBCs), such as the popular Raspberry Pi, to mini-computers and desktop PCs. Most devices are located at users’ homes or in the premises of municipalities. These devices usually have the Cloudy platform installed on them to provide both network-related and user-targeted services.
The current monitoring system in Guifi.net is built around a centralized database that contains a list with all the network devices, all the monitoring servers, and the assignations between them (i.e., which monitor is in charge of each device). Monitors are geographically spread over the network, and the assignations are manually decided through the Guifi.net website. During normal operation, each monitoring server fetches its assigned devices list and periodically checks them for their status (response time, traffic on their different network interfaces, etc.). The collected data are stored locally at each server, available for visual inspection, by means of specific API-like calls performed through the main Guifi.net website. This current monitoring system, while in general terms works and accomplishes its duties, has important limitations regarding its robustness and resilience to cope with varying conditions of the network and its infrastructure. In particular, it does not deal properly with network partitions, in case a monitor ceases to operate (e.g., due to a hardware failure) the devices assigned remain unmonitored and collected data are stored at a single location without redundancy.
Within the LightKone project, we have designed and implemented a new monitoring system for Guifi.net that overcomes the current one’s limitations by making use of the technologies the project has developed. This article describes its most remarkable aspects.
Prior to designing the new monitoring system, we have analysed the legacy one and detected its current shortcomings. From them, we have come up with the following five main requirements for the new monitoring system:
l Redundancy: every network device shall be monitored by a minimum number of servers greater than one. This means that the monitoring servers should check which network devices have less monitors assigned, or below the minimum number, and autonomously decide to become a monitor for any of these devices.
l Automated assignment: the system shall make the assignment between network devices and monitoring servers automatically. On a permanent operation basis, the service should run autonomously without manual intervention.
l Automated reconfiguration: the system as a whole shall be able to automatically detect faulty monitors (e.g., due to hardware failures) and reassign their network devices to functional monitoring server. This process should be carried out without manual intervention.
l Data replication: the collected data shall be replicated and distributed to different parts of the system. In the event of network partition or churn of some of the storage nodes, the data should still be available for being retrieved by the monitoring service from other parts of the network.
l Load balancing: the monitoring workload is to be balanced over the network and the available monitors rather than being concentrated on a few devices.
This scenario and its requirements, especially those related to information distribution and replication, are very well aligned with some of the objectives of the LightKone project and the tools that it provides.
A conceptual depiction of the system is shown in Figure 2. There, a group of routers representing the actual Guifi.net network devices are depicted in the center, interconnected and creating a meshed network. Around them, several monitoring servers are shown, with dotted arrows between them indicating that they exchange information and coordinate between them. According to the abovementioned requirements, each of the devices is assigned to more than one monitoring server, as indicated by the coloured lines linking them. Since each server may have different capacity or available resources, some of them are in charge of watching more network devices than others.
Getting into more detail, monitoring servers consist of various software pieces, as depicted in Figure 3. A typical monitoring server, i.e., a full-blown monitor, runs a local AntidoteDB database instance which connects with other AntidoteDB instances running in other places of Guifi.net, which provide a highly available, geo-replicated, distributed and decentralised storage for the other monitoring server components to interact with. These components are three: assign, ping and snmp. Also shown on the figure, lightweight monitoring servers consist of these three same components but lack their local AntidoteDB instance; instead, they rely on a remote AntidoteDB instance running on another monitoring server to assist in the coordination process and provide the required storage.
The assign component is running on each monitoring server, and takes care of keeping the local list of monitoring servers ↔ network devices up to date. For instance, when a network device is not being monitored by the minimum required number of servers, one of them eventually starts watching it so that the requirement is met. This new assignment is immediately updated to AntidoteDB and spread all over the network, eventually reaching the rest of monitoring servers. The assignment between monitoring servers and network devices is dynamic and evolves over time, as new network devices are added to the network or removed from it, or as workload balancing at the monitoring servers requires devices being reassigned from one server to another. The different assign components do not directly talk to each other; instead, they indirectly coordinate by means of shared data structures on AntidoteDB. Last, but not least, the assign components can detect if a monitoring server has failed, unassigning any network device it had from it and taking over its monitoring duties
The ping and snmp components are also found on every monitoring server and take care of the actual probing of network devices. These pieces of software periodically test a list of network devices for their responsiveness and uptime (by means of ping packets’ round-trip-time) and gather information about the traffic on their interfaces (by querying the routers via SNMP calls). All the collected data are stored on AntidoteDB, which ensures they are automatically replicated and distributed to all the servers.
As it can be seen, compared to the legacy monitoring system, the new one shows a number of differences and improvements that fulfill the above mentioned design requirements: each network device is redundantly monitored by more than one server (and the servers watch for this to always be true), the assignment list is dynamically updated without manual intervention and the collected data are stored replicated at different places, avoiding single points of failure.
The new monitoring system relies on AntidoteDB to implement a geo-replicated and distributed storage. On top of it, the actual monitoring components use shared data structures for concurrent read and writing of the assignments between network devices and monitoring servers. To ensure data integrity, the shared data structures are implemented with conflict-free replicated data types (CRDTs). Using AntidoteDB relieves developers of the complex task of implementing a synchronisation protocol between the monitoring servers that manages data coherency through all the system, regardless of eventual failures or network partitions. Additionally, it provides an automatic mechanism to replicate and distribute the monitoring data all over the network, avoiding the burden of having to request monitoring data from different servers and assembling it in order to obtain detailed information about a specific device.
The three monitoring components (assign, ping and snmp) have been developed in a proof-of-concept implementation that uses he Go language and interacts with AntidoteDB through its Go client. The source code is currently available at our GitLab repository. As the end of the LightKone project the software is being rewritten from scratch, to soon be packaged and be made available together with the rest of the Guifi.net codebase.
Additionally, you can learn more about the new monitoring system on our full paper published at IEEE SOCA 2019.
By Roger Pueyo Centelles, Mennan Selimi, Felix Freitag, and Leandro Navarro, Universitat Politècnica de Catalunya