NTTS2017 Live Blog: 22B — Dissemination: innovation in the dissemination of official statistics
This is a quick live blog from the session at the NTTS conference at Eurostat. I have tried to quote the presenters as much of possible, but due to the speed, I’ll probably have failed at some parts. Errors are mine, not the speakers! I’ll be refining this article during conference, if you spot errors and typos, please let me know.
Philippe Bautier opens the session. He is very interested in this topic as he leads Eurostat dissemination efforts. The presentation by Brugt Kazemier and Kees Zeelenberg is unfortunately canceled. But as Bautier says: more time, so more quality time for the two coming presentations.
Here we go: I hope I do an okay disseminating these two talks on dissemination! ;-)
Disseminating statistical data by short quantified sentences of natural language
presented by Miroslav Hudec
Hudec is a professor Applied Mathematics at the university in Bratislava.
Hudec states that communicating summary statistics is a good and comprehensive way to communicate information, but not for people who do not have a good understanding of statistics. So people often see graphs as an alternative. But not all data can be represented in graphs.
In this session Prof Hudec presented a different alternative: linguistic summaries (abbreviated as LS).
He quotes from a reference that these linguistic summaries would be nice if they are not as terse as the mean.
What are linguistic summaries?
Instead of reporting: mean value of records: 300, standard deviation: 50
One can also report:
- Most of records are around (near) 300.
- About half are between 250 and 350.
This has the nice benefit that ordinary people can read it, also those who have no background in statistics. Hudec mentions more benefits:
- Explanation of relational knowledge in the data. Example: “about half of young respondents have rather positive opinion about the population census”
- Nice way to write queries: “Find districts where most of municipalities have high ratio of arable land” . I can definitely see how writing queries in normal language would help people find what they want. Websites
An example. This contains just a simple set of data, but you can use this with all kinds of data. When we want to present data to journalists, small-business owners, the usual overview of summary statistics (in middle) might not be helpful, as not everyone knows what a standard deviation means.
So what you can also do, is present summative statements (bottom table) The statements that want to show are the ones with high validity. So here, we would just show the bold statements. Simple sentences that everyone can interpret.
Hudec explained how this works conceptually. You define a number of bins with variable slopes. Then the computer can use that information to judge/generate linguistic statements.
- Linguistic summaries are less sensitive to imprecision in data
- Easily understandable for a majority of data users
- Data disclosure is not a problem. You can still provide the raw data for the ones who want to dive into it
- Preparation takes more time / calculations take longer. This is true but that effort might just be needed if people otherwise do not interpret the data (correctly)
- LS are meant for dissemination but might influence data collection
Q: LS for query generation was mentioned. Is the idea that users could use these queries in natural language in the future on the websites of statistical bureaus instead of the usual (quite challenging) user interfaces for filtering/selecting data?
A: Yes, it can indeed be used for queries, SQL queries to be specific. The computer translates the language to particular queries. Because of the slopes, you can then make different queries, which are ranked by how useful they are, and then the most useful ones are returned to the user.
Q: are there already pilot projects?
A: This is ongoing research. We have done a first project with municipalities, but not yet something that has been showed publicly.
Contestina: A visibly understandable path toward more effective data dissemination
presented by Mariagrazia Zottoli
The goal of good statistics communication is to transform statistics into knowledge.
Data and charts are published, but most people do not have the skills to transform it into knowledge.
Solution: digital storytelling (my note: I would probably call it automated generated reporting, but I agree that does not sound as good..;-)
That is why Zottoli and her organisation created Contestina: a tool for more effective data dissemination
What is contestina?
A software agent that facilitates comprehension of statistical information, guiding users through thinking process.
For now, it helps readers read data from economic and competitive data domains.
How does it work?
Contestina is created in layers. First layer is rules definition.
In order to execute the rules, C needs access to
Highest level is the visualisation layer, which uses the data to generate automatic reports (called digital stories)
Students look for info.
Overview where all data is presented. Contestina storytelling is obviously where the interesting part is here.
M then presented a worked example. C gives a first objective understanding of the data.
To get more information on a statement, you can just click to view more. The program also shows descriptions of the definitions that are used.
Contestina tries to put all the useful info together. No more need to surf all web to find the useful info.
- Facilitates comprehension
- Contextual reading
- C tries to address the lack of capabilities. Not only accessible and transparant, but also understandable and comprehensible.
Q: query in english. Is C multi-lingual? Can you query in multiple languages?
A: C is beta-version, so for now only in Italian. Translated into English here. Next goal is to make it available multi-lingual.
Q: Are you working on request part using Natural Language?
A: Yes, going to work on non-traditional, unstructured data. Will change how it works to facilitate this, to offer good basis information.
Q: about the rules, I assume this is automated. Is this just for didactic purposes?
A: Work relational parameters to query from SQL. So yes, rules change in function of context. You can put 2013, but also 2015, and then everything changes.
Q: What is the license? Do you sell it or is it MIT? What are reuse conditions?
A: We are trying to find out answer to this question. For now for private purposes. We will sell it as product for beginning and/or offer consulting services for it.
Q: Work report-builder. Gives higher or lower. But very hard time significantly higher or lower. Do you offer this?
A: In this first stage, we have just set easy rules. We just wanted to give to the users a first reading, maybe pointing out there is growth. Let’s try brainstorm after talk to see if we can together can think of anything.