‘Big data’ — a whole new world for statistics, also in Europe

European Court of Auditors
#ECAjournal
Published in
15 min readMar 16, 2020

Interview with Fabio Ricciato and Kostas Giannakouris, Eurostat, the EU’s statistical office.

Big data is a big buzz word, and many people see it either as the solution for — or alternatively — the cause of many future problems. One of the many questions is how governments will (be able to) use those data, or what the ever-increasing ‘datafication’ of our society will mean for public institutions that need to distil reliable and useful information from increasingly vast amounts of data. To find out what Eurostat is doing to ‘tame’ the big data beast, we interviewed Fabio Ricciato and Kostas Giannakouris. Both Fabio, having been in telecommunications and having broad academic experience regarding data, and Kostas, who has had a long professional career in Eurostat, were members of Eurostat’s Big Data Task Force. They are both members of Eurostat’s Methodology and Innovation in Official Statistics team. They were also speakers at the ECA Big and Open Data conference, held on 27 and 28 November 2019.

From left to right: Fabio Ricciato and Kostas Giannakouris

By Derek Meijers and Gaston Moonen

Big what?

In 2013, the European Statistical System, the network of Eurostat, national statistical institutes (NSIs), and other national statistical authorities of the EU, published the Scheveningen Memorandum on big data and official statistics. Kostas: ‘At that time, everyone was talking about big data, sometimes without even knowing what it really meant. It was a term à la mode and used by everybody who was trying to create value out of all the new data that became available.’

The phenomenon of big data attracted a lot of attention in the tech world. Eurostat, together with Statistics Netherlands, launched an initiative on big data in official statistics: ‘Following the Scheveningen Memorandum, Eurostat developed an action plan and a road map to understand the phenomenon “big data”, and position it within the activities of the European Statistical System’ (see also Figure 1). Fabio adds: ’Even scoping the work of the taskforce was no easy task considering the fact that the term “big data” means different things in different contexts. I find it actually a bit misleading in the context of official statistics, because it focuses attention on the size of source data, which is not the most important aspect for us.’

“[…] big data means different things in different contexts. It is actually a bit misleading in the context of official statistics, because it puts attention to the size of statistical data, which is one of the least important aspects”.

Figure 1 — Key elements of Eurostat’s Big Data project

Source: Eurostat

A step back is necessary to understand the meaning of big data in the statistical context. It would be essential to look at official statistics and how they were produced before the ‘arrival’ of big data, he explains. Fabio: ’Statistical systems were developed to produce information in a world where data was a scarce resource. Think of living in the eighties of the last century. Times in which most people did not create much data at all during their entire lives.’ To illustrate this, he points out that, in those days, you would generally only create data when certain types of events occurred, say when you changed your residence or registered a car. ‘You only created data when you had to register something somewhere, but you could live for weeks and months without producing a single bit of data. There was no GPS, hardly any electronic devices, most purchases would be paid for in cash. Sure, there were some digital technologies but no data technologies. In other words digital traces were scarce and, anyway, barely stored, hence they did not transform into data!’

“…digital traces were scarce and, anyway, barely stored, hence they did not transform into data!”

Big data or big data

Fabio explains that scarcity of available data implied that statistical offices were the only go-to place for policy makers, journalists, and anybody else who needed to get quantitative facts about society, the economy or the environment. ‘The main task of statistical offices was to produce statistics. In a world of scare data that implies collecting the source data in the first place, and then processing them to transform them into statistics. The collection process was especially costly.’ He raises the question of data collection methods: ‘Traditionally — by asking all or a selection of people, in censuses or surveys, or, a bit later, by relaying on what people have declared for administrative purposes, for example when changing residence or buying a car. These are the traditional data sources.’

Fabio points out the prominent role of survey data for statisticians. ‘Surveys enable us to design the data we will retrieve by deciding whom to ask, what to ask, and when to ask. Because we can control and engineer the data collection, statisticians save costs in not asking all people but only a statistically representative sample, and the subsequent processing of such data is much simpler. This is less true for the administrative data that were generated for administrative purposes and then re-used for statistics. Still, administrative and survey data require an active declaration by the person concerned or by an administration.’

Fabio notes that the effort to collect data through such traditional channels represented a big share of the total cost of statistics production, to the point that ‘producing statistics’ and ‘collecting source data’ could be seen as almost overlapping concepts. Instead, he points out that data collection is instrumental to producing final statistics, and the relation between collection and other process components, and most prominently processing, depends on the nature of the data at hand. ‘At the change of the millennium, with the introduction of the internet, the World Wide Web, smartphones, GPS, Internet of things, online social networks, within a few years our lives became digital. Nowadays, we have even digitalised our physical life. Our smartphone may record our steps, heart rate and sleep rhythm.’ Fabio: ‘If I were to ask you now, would you know how much data you produce as an individual, or as an organisation? You produce data every millisecond! Quite a difference compared to the few times per year thirty years ago!’ Then, again laughing: ‘We use the analogy of an individual as a data fountain that is continuously producing, or better, spouting data, and leaving a comprehensive data trace behind. And, there are many companies that, like buckets, are continuously collecting your precious data!’

“…You produce data every millisecond! Quite a difference compared to the few times per year thirty years ago!”

It’s not (only) about size

The two Eurostat experts explain that, in the world of official statistics, the term ‘big data’ came to refer basically to any other data source apart from survey data and administrative records.

In this context, Kostas mentions the issue of ‘datafication.’ ‘This means that we produce data with everything we do. By expressing our opinion on social media, tracking our fitness, navigating to places, or storing our photos, routes or favourite music online, we live a great part of our lives in the data cloud and through our smartphones.’ He explains that for a statistician, big data is a rather specific concept. ‘If I had to give a definition, I would go in the direction of ‘anything that is left behind when people use information and communication technologies — from cameras to sensors — can be considered big data.’

According to Fabio the term ‘big data’ is actually quite misleading in this context for two reasons: ‘First, the size of the data is not the most important dimension in our field.’ He explains that some new data types are even smaller than, for example, traditional survey data. And that, even if you have a big amount of data, aspects other than size have more important implications in the context of official statistics. ‘Actually, it is the nature of the data, its characteristics, the way it was produced, by whom and for what purposes, that matters for the way you access and interpret the data. It’s much more about the quality of the data than about their quantity!’ For this reason, Fabio prefers to use the term ‘non traditional data’ instead of ‘big data’ when talking with colleagues from official statistics offices. He acknowledges that size of data is a key dimension in other fields, e.g. in computer science. His second argument is that big data is used as an umbrella term for very diverse classes of data. ‘Any meaningful statement referring to the whole “big data” universe can only be on an abstract level. When discussion gets operational, you must start referring to particular classes of data, e.g. satellite data or data from the Internet or mobile network operator data, and so forth.’

“…It’s much more about the quality of the data than about their quantity!”

Another common mistake, according to Kostas, is the assumption that big data is a uniform concept. ‘My experience has shown that there is not that kind of uniformity where you have a single data source, data class and a related statistical process with statistical indicators. On the contrary, you have to look at various data classes, types, and sources in order to be able to use them in a statistical process with meaningful statistical indicators.’ He adds that at Eurostat they want to produce statistics based on multiple sources. ‘So big data is going to be just another source next to surveys, census, or administrative information.’

When Eurostat first started exploring big data for official statistics, the focus was on raising awareness among its staff and trying to understand why and how big data could be used to produce valuable statistics. Kostas: ‘We should keep in mind that the goal is not to use new data sources for the sake of doing so, but in order to produce better statistics. Using new data makes sense only if they lead to better statistics.’

Fabio underlines that the goal is, and has indeed always been, to produce useful information. ‘In a world where we need to invest 80% of our effort into data collection to be able to produce valuable information with the remaining 20%, we obviously run the risk of mixing the collection and production phases. However, at the end of the day, the goal of statistical offices is not to collect data, but to produce information.’

“…Using new data makes sense only if they lead to better statistics.”

Difficulties when obtaining data

Regarding data access, Kostas explains: ‘The current system — in which national statistical institutes collect and process national data, after which some of this processed data is shared at the European level — does not always fit for new data sources that are often collected by the private sector.’

There is another aspect to it. Fabio: ‘Clearly, we see a need to close the legislative gap between the question about who holds the data — either the individual citizen, a private company, or a public body — and official statistics systems that need to extract statistical information from this data.’ Adding to this, he notes that there is no need to move data away from the place where they are collected in order to process them. With new computation technologies, we can move the processing towards the data. And this is also true if the data are scattered around multiple data holders. He explains that this model is very appealing for new classes of data, especially when data are confidential or privacy-sensitive. ’Sure, all kinds of public data, such as data about tenders or contracts, must and will remain publicly available. But fine-grain personal data, such as my individual financial transactions, the places I have visited, etc. should not be moved around. Bringing together personal data from the entire population to a single place is not a good idea, regardless of how secure that place would be, because data concentration causes risk concentration.’

“Bringing together personal data from the entire population to a single place is not a good idea, regardless of how secure that place would be…”

According to Fabio this is not only unwise from a technical point of view — as it is inefficient to move large amounts of data — but also because that would mean concentrating the risk of misusing the data in one central and publicly known honey pot. ’That would not be the right approach for comprehensive and sensitive personal data. And bear in mind that we are not merely talking about your marital status or your health status, but about your every single encounter with other people and your every heart beat!’

Using data without sharing data

Fabio explains that an organisation such as Eurostat is not interested in collecting and saving individual personal data, but only in producing aggregate statistics based on such data. ’This means that, instead of bringing your data to me, I can bring the computation methods, the analytics, to you. By using certain cryptographic techniques, from the family of what are known as Privacy Enhancing Technologies, the algorithm that runs on your data extracts only the component of such data that is needed to build the statistics. It encrypts this data component in a special way, so that I am not able to decrypt it — it remains protected even from myself. But still it allows me to compute the final statistics.’ He refers to a simple application of such technologies that is now being evaluated, along with others, in Eurostat. ‘These technical solutions enable statisticians to use personal data for the production of statistics, but not to see the individual data as such. We can, for instance, compute the average salary of a large population without being able to find out anyone’s individual salary. Similarly, I can compute how many people vote for red or blue without knowing which individuals voted for red or blue, or how many people are in a certain district without seeing the individual positions, etc. In order to extract global statistics, we no longer need to see the individual data points.

“In order to extract global statistics, I no longer need to see the individual data points.”

Kostas underlines that this approchof ‘getting statistics out of the data, but not the data’ is an important way to build trust. ‘‘As a citizen, I will more easily allow the statistical office to apply a certain algorithm to my data, along with the data of many other people, when I know it is technically impossible for them to retrieve my individual data.’ Adding with a smile: ‘Making something technically impossible is always stronger than making it legally forbidden! Blockchain is another example of secure technologies that cannot be tempered with, and help build trust into the process.’ He believes that in the future we will be working in some kind of a network where aggregate information will flow, but not raw data as such. ‘We must share computation, algorithms, logs of what algorithm was run on what data by whom, we must share everything … except the raw data! I think this is part of a paradigm shift. And these new possibilities mean we have to come up with new questions and new solutions for data and knowledge sharing. And although we are still in a learning phase, the future looks very promising!’

“Making something technically impossible is always stronger than making it legally forbidden!”.

The new system augmenting, not replacing the legacy system

According to Fabio the big data work at Eurostat is still in a pioneering phase where, as According to Fabio and Kostas statistical offices around the world are still in a phase where their staff are familiarising themselves with what this new world is about, with the new technologies that enable and at the same time motivate a profound paradigm change in the way official statistics will be produced tomorrow. Fabio continues: ‘Big data is about data, technology and people. We have to re-engineer a socio-technical system.’ He then refers to three layers of the socio-technical system: the hardware, the software and … the humanware. ‘The humanware refers to the regulatory framework, the organisational process, corporate culture and all the human side of the system. You may see it as another level of coding, above the software and the hardware. Just a bit more abstract than software for the time being.’ He explains that we have to upgrade the humanware level as well. ‘Likewise, for new software updates, and in general for any other technological system, we need to maintain backward compatibility with the previous version. ‘Fortunately, engineers in different fields know how to develop new systems that augment but remain compatible with legacy systems.’ As examples, he refers to black & white TV, which was still available when the colour TV was introduced. Or stereo radio, with mono radio still compatible and existing next to that older system. ‘Trusted smart statistics will be compatible with the legacy statistical production processes for traditional data: it will augment it — not replace it — with new data, new technologies and new statistics.’

“The humanware refers to the regulatory framework, the organisational process, corporate culture and all the human side of the system”.

Fabio and Kostas say that the first exploratory activities about big data within the European Statistical System led to framing the problem in the right way, which is half way towards finding the right solution. ‘As for every scientific or research challenge, asking the right question is the most difficult part. If the question is clear, it is not so difficult to find the solution. And vice-versa, when the solution cannot be found, it is very likely because the question was wrongly formulated. Asking the right question and convincing other institutions to see things in a different way, this is what takes time and effort. And the solution must be sought collaboratively. With such complex problems, the solution is not “found” but “built” together. In methodological research for instance, it is not about choosing between your methodology or mine, but about collaboratively developing our new methodology.’ Fabio gives an example related to algorithms. ‘Like a network, you have to document your analytics and your algorithm to ensure it is executable by machines and understandable by humans that will eventually improve it in the future. This tight interplay between the software and the humanware must be kept in mind when we re-engineer each of the two levels. For instance, what is known as the literate programming paradigm means that codes must be written and documented in a way that can be understood by both, machines and human experts.’ For him this is just one example of where statisticians have to learn from other communities and fields.

“… asking the right question is the most difficult part. If the question is clear, it is not so difficult to find the solution”.

“… codes must be written and documented in a way that can be understood by both, machines and human experts”.

Trusted smart statistics

The two Eurostat experts explain that Eurostat has coined a new term to refer to such a deep paradigm shift: Trusted Smart Statistics. Kostas adds that they saw the need for a systemic approach, identifying statistics as the output of smart systems that produce the new data types. ‘And our computation or algorithms tap into that data stream to process it and produce the statistics. We added the word ‘trusted’ — which could be interpreted in different ways — to highlight that the statistics we produce are trustworthy, which we can guarantee because they are produced by using proper data that were developed in proper ways.’

But there is more, Kostas specifies: ‘Another more profound reason to mention trust is that it means that the statistics system is trustworthy as well!’ Both Eurostat experts consider this important as it means that, as a statistician, one might be able to access very sensitive data from citizens, but because of all this technology and process referred to earlier, it is technically impossible to link the data to an individual or to misuse it. Fabio adds: ‘And this means the trust works both ways. Because, if we as citizens do not fully trust the statistical office about how our data are used, we will not make our data available to them, or we will lie. And this will reduce the quality of the final statistics, and people’s trust will suffer.’ Laughing: ‘Trust is a bi-directional path, a closed loop. Either we mutually trust each other, or we mutually distrust each other. If I do not trust you to use my data correctly and in a trustworthy manner, you cannot trust my data, because I will tell you a lie or nothing at all.’

“Trust is a bi-directional path, a closed loop. Either we mutually trust each other, or we mutually distrust each other”.

Kostas underlines that this is an important understanding: ’By using humanware, software and hardware properly, we can make it impossible to misuse data for any other purpose than the one written in the code that everyone can check. And by making this type of misuse technically impossible, on top of legally forbidden, we can build real and solid trust into our system.’

“By using humanware, software and hardware properly, we can make it impossible to misuse data for any other purpose than the one written in the code that everyone can check”.

Both Kostas and Fabio underline that the key in the whole approach is trust. Fabio adds: ‘One important question in the future will be to decide whether we are able to access data directly from the fountain, i.e. from the citizens, rather than from the buckets, i.e. from private business companies. We have these two channels, Citizen-to-Government (C2G) and Business-to-Government (B2G). He explains that with trusted smart statistics, such as trusted smart surveys, or even a trusted smart app that citizens can install and that guarantees an individual’s privacy, an important step has been taken in the direction of developing statistics that citizens can trust and share with the statistical office without being worried how their information is treated. ‘Especially when we have made it technically impossible for malicious parties to misuse algorithms or the data they produce about citizens.’

Kostas concludes that one of the main challenges is that, nowadays, everybody has become a data provider and data are very easily accessible through various channels. ‘That is where we would like to make a difference as an official statistical office: being the ‘to go-to’ trustworthy source of reliable official statistics of high quality to help other institutions, both at EU level — such as the ECA — and the national level, to continue to build trust in the years to come.’

This article was first published on the 1/2020 issue of the ECA Journal. The contents of the interviews and the articles are the sole responsibility of the interviewees and authors and do not necessarily reflect the opinion of the European Court of Auditors.

--

--

European Court of Auditors
#ECAjournal

Articles from the European Court of Auditors, #EU's external auditor & independent guardian of the EU's finances.