My GSoC 2019 with Developers Italia
My experience with Developers Italia under Google Summer of Code is nearing its end: it is time to draw the conclusions and outline the results of this enriching 4-month adventure, that engaged me in unexpected ways and provided me with new tools to face the challenges of my professional future.
Where it all started
Google Summer of Code is a global summer program that is meant to encourage students to take part in open source software projects, by putting them in contact with several mentoring organizations. The process starts in April, when students start sending their applications to their organization(s) of choice. That is when I stumbled upon Developers Italia, a community dedicated to the development of Open Source Software, created to support the Italian digital public services: the community exists thanks to the efforts of the Team per la Trasformazione Digitale, a team of experts under the Italian government which coordinates the “digital transformation” of the Italian Public Administration, by building the new “operating system” of the country.
Developers Italia recommended some project ideas, specifically in the field of data visualization, that I found compelling: this, and the awareness that my work would’ve been, even in small part, a contribution to my country, enticed me to send a project proposal. My initial plan expressed, in broad terms, my intention to work on a data visualization tool that would interface with the Piattaforma Digitale Nazionale Dati (PDND), a portal which aggregates Italy’s Open Data at the national level, in a way that is both standardized and easily accessible. My proposal was accepted and, upon further discussion with my assigned mentors, we decided to focus on PDND-nteract, a fork of nteract created by the Team Digitale to ease the retrieval of datasets from PDND for further analysis (a demo video is available).
Let the coding begin
During the 3-month work period, most of my efforts were focused on data science analyses using the aforementioned tool: with help from my mentor Alessandro Ercolani, I got to know the tools and their functionalities. I worked on a total of three data science notebooks, in increasing order of complexity.
The aim of this work was to create a foundation of analyses that could show the capabilities of the tool and provide a baseline of code that could be reused and adapted to different users’ needs. Throughout this process, I made sure to insert several textual paragraphs between blocks of code, so that the reader would be able to follow the steps when the code wasn’t sufficiently self-explanatory. The content was made with variety in mind, providing tables, different types of charts and several thematic maps, so that the user would be able to find his preferred representation and how to obtain it. My expectation was that the potential reader would be provided with the means to produce data science analyses of their own, tapping into the vast resources of the Italian Open Data; they would later share their analyses on the pdnd-open-notebooks repository for the benefit of the community that could, in turn, build upon or adapt them.
The technical aspects
The notebooks have been written in python and the datasets have been manipulated using the pandas library: PDND-nteract is able to parse pandas dataframes to automatically create interactive visualizations, and offers tools to make high-level parameter tuning. The more fine-tuned charts have been created manually with the matplotlib library, which allows for in-depth customization at the expense of being somewhat complex. To create choropleth maps, necessary to better represent the data from a geographical standpoint, the geopandas library has been used: it allows to import geospatial data into pandas-compatible dataframes, which can then be merged with pre-existing datasets, to augment them with map geometries. Those datasets are then used to make the final visualization, using matplotlib.
I started with an “introductory” notebook, meant to showcase a typical use case of PDND-nteract: for this purpose, I analyzed the “ISTAT Comuni Italiani” dataset, which contains the complete list of italian towns with several information, such as population, province, region. I manipulated the dataframe and used it to make some bar charts: I also got to use the geopandas library for the first time, to make a thematic map showing the population density by region.
The aim of the second analysis was to present some data regarding the polls of the 2019 EU elections in Italy, in chart and map form, at the national and provincial level. I was able to apply the previous experience to create a diversified visualization on one of the topical arguments at the time.
For this notebook, I was able to produce more complex thematic maps using different subsets of the data, as shown in the images. This was made possible by an extensive study of the matplotlib library documentation, which interfaces with geopandas to tweak the parameters of the output graphics.
The third analysis focused on the causes of death in Italy, covering the current state of things (updated to 2016) and its evolution in recent years. This was by far the most complex work, having to do with an extremely rich dataset that encompasses multiple years, territorial subdivisions and diseases, and represents one of the main sources for the evaluation of the health status of the population.
This last notebook has been co-authored with my mentor Alessandro Ercolani: as a disclaimer, neither of us is a physician, so the results that we obtained only serve a demonstrative purpose.
I started with an analysis of the macro-causes of death, followed by a chart showing the top causes broken down in more detail.
With malignant tumors being among the most common causes of death, I thought it appropriate to analyze them separately, plotting a chart of the incidence by type and sex.
I also made some thematic maps showing, for different diseases, which provinces are most affected; for this visualization I obviously needed to merge the information about each province’s population from a different dataset.
The second part of the notebook focuses on the trend for the period 2003–2016: I started with some line charts regarding AIDS, lung tumors and heart attacks. This is followed by a more detailed break down of the trend in the incidence of tumors in Italy’s provinces, represented through line charts and a thematic map.
The list of my contributions on the aforementioned data analyses can be found in the pdnd-open-notebooks repository on Github, complete with the instructions to set up the environment and explore the notebooks. The readme of the repository details the steps that a new user should follow to create their data analysis and contribute to the community.
Towards the end of the coding period I also made a contribution to the pdnd-openapi-server repository: even though it was a small addition, it required some thorough research that allowed me to come in contact with the OpenAPI 3.0 specification and offered an interesting insight on the APIs that enable PDND-nteract to fetch datasets.
My GSoC adventure may be ending, but my participation in the open source software space sure doesn’t! This experience pushed me to learn many new things and to take better advantage of my tools; it taught me to better organize my priorities around deadlines, and to communicate with people in charge. It sure was challenging at times, but I always found the stimulating and fun side of it to prevail. I am very grateful for this rewarding experience and I thank Developers Italia and my mentors for making that possible.