Data & Analytics Framework

From “this data is mine and I manage it” to “these data are ours and we manage them together”

Raffaele Lillo

Published in

Team per la Trasformazione Digitale

9 min readFeb 20, 2017

Questo articolo è disponibile anche in italiano

“Submersion” is a serious phenomenon that affects the whole of Italy. We are not, in this case, referring to the all-too-common practice of tax evasion. There is another, equally large source of capital to bring to the surface, and its mechanism of “recovery” is technological, organizational and legislative. There is a great resource that isn’t able to emerge and make itself useful to citizens, businesses and all public administrations because it is fragmented, scattered across different places, imprisoned by rules and practices that impede movement, sharing and optimal use.

Public information is the enormous data set that describes the realities of citizens and businesses — where and how we live, what we do — and, like the first public investments in telecommunication, represents a strategic asset to take advantage of, even with state intervention.

Public information pertains to all of us and is necessary to launch businesses, conduct activities and access public services. It contains data the PA needs to offer services; data that can help the State identify problems more efficiently and develop better solutions; data through which citizens might get to know the actions of the State and assess its results. Gaining access to public data and benefitting from it are essential components of the New Operating System of the Country. We would like to achieve these goals during our mandate, using this channel to provide you with step-by-step updates on our progress (and, if you want to help us write the chapter, Data, contribute to our discussion group or use #DataGovernment).

This task is within our reach

In Italy, there are many entities already at work on these issues who are capable of systematizing and contributing their own experiences. The in-house development teams of the different public organizations we are working with to build the technological foundations of this project have many valuable skills. They are a great resource, but more importantly, they share our vision: to finally make public data available and accessible to the different public administrations and to all of civil society, in the full knowledge that access to public information will not only make life easier for enterprises, but can offer new opportunities for business.

Imagine not having to fill out hundreds of forms with the same information, or, with a single click, be able to manage your demographic and residential information, view the daily updated status of your business, and conduct communications with the PA; imagine smarter cities, with buildings that alert the system of structural damage, with road networks that interact with traffic control tools and monitoring centers; imagine being in full possession of public information, to be able to see how your taxes are being spent, to see the type of services you are receiving and, finally, make sure that all of these capabilities are placed — with the proper privacy and security precautions — in the service of civil society, which will be able to use public data to develop new services, generate value and employment opportunities and finally give life to the much-invoked API economy. (1)

Overcoming this technological and processual challenge will require a change in mentality.

To “exiting the silos” we will have to overcome the traditional mindset of “this data is mine and I manage it” and embrace the realization that public data is an asset to the State (and therefore, to all public administrations) and of civil society in general. Instead, the task of collecting, managing and making data available will be entrusted to managers located within their own area of expertise. We need an alert and comprehensive command center to reorganize and unite existing resources and ensure the birth of new projects already open to data interchange. It will be crucial to systematize available talents and positive forces: the Team for Digital Transformation has begun this process by working, on an experimental basis, with a select pool of State and Municipal software houses, to build, as quickly as possible, the first prototype of a Data & Analytics Framework. It’s a partnership between a big data platform (which acquires, processes and provides information for analysis and machine-to-machine interchange) and two teams: a team of data scientists to work on the data and propose new prototypes and solutions, and a team of data visualization experts who use visual storytelling to effectively convey (and make more accessible) the information present in the data. At the same time, we are working with CONSIP to acquire cloud computing resources and elastic storage to complement the specific needs of the private cloud.

This is a process of opening and rationalizing information resources, which would ideally be developed on a dual track: on the one hand, public data will be incorporated into a single central framework that will guarantee standardization, coherent interconnectedness (it will become possible to connect data from different sources and vertices that relate to a single phenomenon or entity) and consistency of use (API and dashboard themes).

In other words, it will be possible to use more sources to look at phenomena from different angles, see directly into the framework without having to worry about resynthesizing heterogeneous data.

Furthermore, the public administrations and municipalities that do not have the skills and resources to invest in the design and management of yet another infrastructure will be able to use the resources made centrally available by the Data & Analytics Framework (in the aforementioned public/private hybrid cloud). The public administrations will be able to “persist” available data, collect new data (that previously could not have been handled), use it as input in proprietary applications that interact (via API) with the central framework, give it to a team of data scientists (if they have one), and eventually, get support from a central team of experts.

From here on out, this document gets even more “geeky” and difficult to understand for non-technicians.

It… could… WORK!

Data & Analytics Framework (DAF) is the centerpiece of the project. It consists of:

A big data architecture component, tasked with centralizing and storing (data lake), analyzing and synthesizing (data engine), and distributing (layers of communication) data in batch mode or real-time streaming. Prioritizing open source software wherever possible.
A team of data scientists, data architects and domain experts charged with the design and conceptual development of the framework, the construction of interconnection models for different data sources, data analysis and development of machine learning models, and working with the software development team to produce data applications.

The function and infrastructure components of the platform can be better explained by using the data life cycle as a template, from information creation and ingestion to its final point of consumption. Below, we propose a constantly evolving high-level architectural design, accompanied by a description of the phases of the cycle.

1. Data Ingestion

Data can come from many different sources: management software (e.g. Anagrafica Nazione Popolazione Residente, ANPR), application software designed to perform specific user-based tasks (e.g. log server apache as in the case of a website), data streams (e.g. Twitter feed, currency converters), text documents (e.g. laws, contracts), public catalogs. These data are sent to Kafka, where they are treated as distributed streams and redirected to the components that will consume them. Besides being fault tolerant, Kafka is also able to manage increases in complexity due to the gradual increase in the scope and operation of the DAF. As soon as they are ingested, the data are saved in as-is format in HDFS (Hadoop file system) to guarantee a layer of persistent raw data. The same data are saved in a columnar database like Parquet or something similar (wherever possible, in the case of structured or semi-structured data) to improve analytical performance using MPP (massively parallel processing) query engines like Impala and/or Hive, mature technologies that enable SQL-like queries on distributed systems.

**1* Data Ingestion — Private Cloud (2)**

To accommodate the many different needs of the various public administrations we will be working with, we have predicted the coexistence of clusters installed within data centers managed by a group of software houses and institutions of the PA (we are working on an experimental project to test the viability of this idea with Sogei, ACI Informatica, InfoCamere e ISTAT). Within these environments, the data are anonymized (3) and subjected to the manipulations necessary for consumption. We are working with the Italian Data Protection Authority to better define the methods of anonymization and handling of information relevant to the project.

2. Data Processing

Once on the platform, ingested data can be sent as input, either parallel to the ingestion or in separate batches, to applications (based for example on Apache Spark) that manipulate it, integrate and/or aggregate it, and save it in other text or columnar files (e.g. parquet) or in so-called operational databases (i.e. Hbase, Cassandra, MongoDB, Neo4j and the like, depending on the circumstance).

3. Analytics & Prototyping

“Raw” data, as previously explained, can be joined, aggregated and analyzed by a team of data scientists (with direct access via queries on MPP’s like Impala or Hive, and interactive analyses using notebooks integrated with Apache Spark and R) to extract information hidden within the enormous volume and variety of data. At this point, it can be used not only to create new applications and improve existing ones, but also to provide guidance to policy makers. The analysis phase includes the construction of predictive models, machine learning and logistics that will represent the core of the data applications.

4. API and Data Applications

Data that has been processed and inserted into the operational databases, as well as the predictive models / machine learning / logistics, will be made available to 3rd-party applications through an API ecosystem. The API will be developed according to the pattern of microservice architecture and will represent the core of the model of interoperability between data and services within the PA and between the PA and citizens. For example, for those who want to offer it, it will be possible to develop, within one’s own app, a geolocation service dedicated to locating crimes carried out over the last six months by calling on the specific API and exhibiting the requested (and continuously updated) data. It will also be possible to send, through the digital citizenship App, notifications to citizens on issues suggested by the recommendation engine (4) (i.e. public debates on topics of interests, laws that affect one’s own profession or geographical area, etc.).

Our imagination is the only limit to what we can build: the citizen’s dashboard will provide infographics that relate to the individual’s life, to his family, to his surroundings; a business’ dashboard will make it possible for entrepreneurs to be updated on things related to their activities, the economic sector in which they operate and its geographical context; an NLP (Natural Language Processing) engine will help people navigate the ocean of laws, rules, regulations and decrees; a recommendation engine will use the available data to anticipate the needs of citizens and suggest actions to take within the panorama of available apps; Open Data will no longer be just a container of csv files, but a system of APIs that provides constantly updated and correctly catalogued information that may easily be searched through and comprehended via the use of data visualization techniques.

These are just some examples of what can be done. To boost our creativity further, we need your help! We are sure that the subject of data is one that will excite the entire scientific community as well as the community of civic hackers, which we are happy to collaborate with. Follow us on our website and contribute your ideas by participating in the discussion group on this topic, which may be found here, and writing us on Twitter with the hashtag #DataGovernment.

(1) API Economy refers to economic activities generated by the construction and cross-use of services exposed via the API (Application Programming Interface).

(2) In the terminology used here, private cloud means the private cloud used by the PA itself, while the public cloud is the cloud that may be accessed by groups outside of the PA.

(3) The anonymizing process refers to the procedures dedicated to the masking of information that could be used to uniquely identify a specific person.

(4) Recommendation engine refers to a model of machine learning that tries to predict the preference of a user relative to a list of items.