OpenWISP Monitoring — GSoC 2020 Project Report
The last few months I have been working on an open source project OpenWISP selected for Google Summer of Code 2020 under the mentorship of Federico Capoano and Pablo Castellano.
The end product of my entire summer and many more contributors who openly contributed to the module is:
We aimed to create a monitoring module for multiple devices (eg. routers) used in a network that can be easily monitored live with the help of multiple Charts (you can easily add one as per your use case). If any problem arises with any of the devices (for reasons caused by say Configuration issues, CPU overload, disk space not within safe limits and many such issues) then the user be notified in real time (you can set tolerance for the alerts so you don’t get mail bombed 😅).
There are many more features too which have been explained briefly in The Work section. The code is present in openwisp-monitoring repository.
Seems interesting, exciting? Want to know the journey? Here you go.
What is OpenWISP?
OpenWISP is a network management system that allows managing and automating several aspects of a network:
- dynamic auto-configuration of new nodes
- creation of VPN tunnels
- initialization of WiFi access points
- configuration of mesh networks
- configuration of any other network configuration supported by OpenWRT
The beginning:
I introduced myself in the community IM channel and started making contributions wherever possible. During this phase I learnt a lot about the general approach, patterns and practices adopted by the community. I also developed a deeper understanding about open source projects and their importance in society. Simply put this was an enlightening experience as I was in awe of the humble nature and dedication of maintainers of an open source project.
Coming to the project selection, there were 4 projects which were announced by OpenWISP. I wasn’t quite sure which one amongst them would bring the most learning outcome (this is what I aimed for GSoC, that is to learn and explore). At that time Federico (one of the core maintainers of the project and also my mentor) suggested that openwisp-monitoring was a very exciting project to work upon given the technologies involved in the public channel. So I read through the measurable outcomes and was determined to apply for it.
While working on the proposal I encountered a lot of queries since I had never worked on most of the things before. I asked those queries on the community’s mailing list and on IM to get clarifications on most of them. Do ask queries, a lot of them until you are sure you understand what the problem is. You will end up saving your time and of your reviewer (this also happens to be the most important learning this summer 😄).
I submitted my draft proposal and got critical feedback at first 😬. This pushed me to try harder and I continued reading the documentation of technologies involved, diligently going through existing examples, best practices and kept iterating my proposal until I was sure that this is it 😤. I believe one’s proposal best explains whether an individual understood the problem.
The Journey
After getting selected we didn’t waste much time and started off with setting up the Kanban board and opening up the issues on GitHub.
For all the selected students few general guidelines were posted and calls to kick-start the project were organized by the mentors to give us a smooth start.
The Work
I have noted all the cool Pull requests we worked upon this summer below. The title for each of them has been linked to the relevant Pull request.
Resources metrics
We added resources metrics. This was a very worthwhile experience as I got to work with Lua, OpenWRT, Plotly (I had very little front-end knowledge before the summer). Noumbissi Valere helped me to get OpenWRT setup with OpenWISP and Pablo Castellano helped me a lot to get used to it and Lua. The end product is as above but it wasn’t so simple. Since, I had to familiarize myself with InfluxQL, update the existing Lua script so that relevant resources data is passed to the monitoring module in NetJSON format, understand how exactly are Metrics, Charts and AlertSettings related; ensuring that I don’t break anything while working on it :)
Swappable Models
One of the core principles at OpenWISP is the reusability and extensibility of the modules that are being developed so developers don’t waste time in reinventing the wheel if they are not able to customize or extend OpenWISP; but rather work upon real improvements and if possible always contribute back to the project :)
Swappable models help developers who want to extend or modify some fields of an existing Model. They can easily do so by making use of this feature of OpenWISP Monitoring. (If you are interested this has been nicely documented in the module’s docs)
Pushing up coverage and speeding up the tests
We were successful in pushing up the coverage, trying to cover all possible lines by adding new tests. The most important thing I learnt is, writing a test is no big deal but writing an effective test thinking about multiple possible cases especially acid tests in case of some bug fix 💭, that surely is an art. Currently coverage is above 99%. The next thing we did was speed up the tests. We mocked the data wherever it was safe and possible, reduced the number of requests being made and tried out a few more optimizations. We were successful in reducing the time taken to run the tests on Travis CI by almost 1.5 times.
Abstraction Layer
This PR was a great learning experience for me as I learnt why Abstraction is an important concept to simplify things for the end user and realized that most of the things I took for granted were easy to use only because they were abstracted well. The module’s code earlier was highly specific for InfluxDB, we abstracted away the code responsible for communicating with the timeseries server and created a separate layer between InfluxDB API and the code using it. We called this layer as Timeseries Client. If a user uses any database among the ones supported by the module, he need not specifically know how he is supposed to communicate with the same. It should be as easy as switching over the following setting and that’s all.
InfluxDB setting (Fully supported)
Elasticsearch setting (Currently in development)
AlertSettings and Check Inline
We wanted two tabs, one to ease a user’s trouble of managing the thresholds for various AlertSettings related to a device so that he doesn’t get mail bombed and another one where he could easily view all the checks that are being run periodically in the background. There were few hiccups initially with AlertSettings Inline as there wasn’t any direct relation existing between the AlertSettings model and Device model, so we used nested admin to make up for this short fall.
Global configuration
Without requirements or design, programming is the art of adding bugs to an empty “text” file.
— Louis Srygley
It was Federico’s idea that since we had too many things in a single module (Charts, Metrics, AlertSettings, Notifications, etc.), it made sense to have a single place where we can easily configure default values for all of them. In production there will be 100s of devices which an operator of the network will manage. Changing values for all of them individually would be a nightmare, thus we implemented this solution to design one global configuration which can be overridden easily and if only a single value in the configuration needs to be changed then provide the user with a setting to easily do so. If you are interested to read more about this feature it has been nicely documented in the module’s docs.
I learnt an important concept here that good tools are good by their very design 😇.
The second Timeseries database
- The need to provide users with a second timeseries database alternative arose from the fact that InfluxDB doesn’t have an open sourced cluster support and so, some users might want alternatives.
- Hence we wanted to select one timeseries database that is horizontally scalable and is open sourced. It should have a good active community and documentation. We looked and compared across multiple options available including Prometheus, Victoria Metrics and few others.
- Finally we went for Elasticsearch after I developed a basic prototype which could be used along with the above Abstraction Layer. Now, Elasticsearch is more of a search index than a timeseries database so it was slightly risky decision but owing to its very good documentation we were able to safely couple it with existing code without any major reformations needed. During this phase I learnt PromQL, Elasticsearch-dsl, did exhaustive reading and in the end created a draft PR for the same.
- This happens to be the most exciting part of entire GSoC and one of the most challenging ones too with a steep learning curve. Currently there is one issue that still needs to be fixed for this to be fully useful. I will try to work and fix this because this is one very big add-on that I think the module can have and that will really be helpful to users.
Making tasks resilient to failures to prevent metric data loss
Whenever there was overload on timeseries database, write operation would sometimes lead to failures and this subsequently caused loss of metric data in few deployed systems. In order to solve this problem we adopted the following workaround, say if there are any exceptions related to timeseries database; catch them; log them and perform a retry (in the background) ensuring that the timestamp remains the same (since timestamp is the characteristic property of timeseries data) using celery to perform those tasks asynchronously.
Documentation and CI
We adopted the same styling practices of flake8, isort and used black for formatting as in other OpenWISP modules. The module has been deployed on many instances so the feedback from the same was helpful in noticing bugs. We then worked to fix it and add regression tests for the same. Whenever we added a new feature we made sure to document it. In the end we ensured that the docs are properly structured, ordered and updated. We also documented a list of all available API endpoints, added illustrative pictures to help in visualizing the module and created an animated GIF to highlight the features of the module to a new user without giving him a hard time to do so :)
Besides these there were multiple minor Pull requests that we worked upon. If you are interested to know more about them, you may refer to the Done column to get a list of all the completed tasks.
Tech worked upon
OpenWRT | Lua | Python | Django | Django-rest framework | InfluxDB | Elasticsearch | Docker | JS | CSS | HTML | Celery | and not to forget OpenWISP 🙌
What next …
My freshman summer went a bit hectic since I was working on this amazing GSoC project, two exciting nationwide competitions and completed some great courses on Coursera in parallel. I plan to continue contributing to open source and help in improving OpenWISP Monitoring so that more users can use it with lesser trouble 😅. The aim as usual would always be to keep learning and explore new domains :)
I would like to thank my mentors without whom I am not sure I would have been able to make it this far. The community as a whole and other GSoCers from whom I learnt quite a few things. Most importantly I would like to thank Google for this gem of a program were students familiarize themselves with open source and become better programmers 😄
If you made it this far thanks a lot. I hope this gives you some insight into my GSoC experience and I wish you the best for your GSoC proposal 😇