How to evaluate your public service?
A story from the Paris Summer Innovation Fellowship 2016 @ Five by Five ;)
Viz available here
During the last two weeks, I’ve been working at Five by Five headquarters Rue Martel in Paris for the first Paris Summer Fellowship. The program brought 20 young people from all around the world (Canada, USA, Germany, Israel, Colombia, Slovakia, France) with a global will to help make Paris a better place and work on projects with social impact.
Data Scientists, Open Street Mappers and DIY Urban Designers collaborated to improve mobility using beacons in public transport, improve public services, help people build communities or save Paris citizens in case of a zombie attack.
Here is a map of the projects that have been worked out these two weeks :
How to evaluate your public service? An answer given by the Cour des Comptes
Is it possible to assign a score to an administration, so that we can compare how well they are managing the public money?
I decided to work with the French Cour des Comptes to improve the access to information from the report the Court writes about public services.
The aim of the Cour des Comptes is to put into action the Article 15 of the Declaration of the Rights of Man and Citizen: “Society has the right to ask a public official for an accounting of his administration”. The Court is divided into 7 Chambres, each of which being responsible for a set of administrations. The different Chambres audit the administrations in order to check the quality of the way they are using the public money. At the end of their audits a report is written and made public by the Cour des Comptes.
I’ve decided to work on the “Rapports d’observations définitives des chambres régionales et territoriales des comptes”, available at data.gouv.fr from 2013 to 2015. A team has already worked on these reports during the data session at the Court, so I gained from what they did to pursue the project.
How does it work?
The goal is to detect whether a text is positive regarding the administration or not, and to give a measure of it.
When it comes to sentiment analysis, we think of Twitter and all the analyses that have been done around this field. We think of all the open-source models already trained on huge amount of text data. So there’s nothing to do, and I can apply these models and I will have a very decent accuracy!
But in reality, when I wanted to test a few open-source models on the data from the Cour des Comptes, I had very bad results even on simple sentences. The reason is that most sentiment analysis open-source package are trained on text from social networks, so they are really good at distinguishing sentences like “Ooooh, this is awwwwesome, I love it!” from “It was disguting I won’t come back in this restaurant”. But when it comes to specific languages like the one from Cour des Comptes, using these models lead to a high bias. First, these models are mostly trained in English, but we can still have some models trained in French with smaller datasets. But most of all the type of language used is completely different, and the way we express positive or negative things is totally different.
So, the model had to be adapted in this specific case. I tried first to check which words were used most and less, but this was not enough to distinguish the texts.
Building a model
Because there were no existing model trained on judiciary datasets, it was necessary to build one.
What is positive, what is negative?
The first thing I wanted to do was to make a list of all positive and negative expressions used by writers at the Cour, and with a measure of importance for each of them. Fortunately, during the data session the team who had worked on the subject had made such a list, and they enabled me to use their work for this project!
Being in possession of this list, I was able to build a function that assign a score to a text! Basically, the program reads all the text and each time it sees an expression from the list it adds its measure to the final score.
How can we know if the algorithm works well?
To answer this question, I have checked the sentences where the algorithm was mining the expression. I realized that the expressions I had in mind as positive were sometimes used as negative, and vice-versa ! This was a serious problem which proved that
- this first algorithm can be improved
- an evaluation of the algorithm was necessary
So I decided to take 1 000 sentences, randomly chosen among all critical sentences that the algorithm had mined with the list of expressions, and to label them by hand.
This was a laborious task, but it enabled me to have an idea of the accuracy of the algorithm.
With this labellization, I was then able to manually detect the false positive from the list of expressions.
Improve the model
I decided to build a simple model which goal is to predict whether a sentence will lead to a misunderstanding by the list of expressions, that is an expression that has been mined by the list as positive or negative but because of the context the sentiment is opposite. According to the labels I had produced by hand, this model improves the first one by several percents in accuracy.
Using Tableau Public to make it readable
The visualization of the scoring is a map of France with all administrations audited marked with a color dependent on its score, with information about its location, about the content of the text and the expressions used to score it.
The visualization can be accessed here.
Two weeks is really short to lead a project to the end, this is why I’d like to thank Five by Five for the help they provided for all of us, and a special thank to Adnène and Vincent from Cour des Comptes, Joseph from Snips and Martin from Air BnB who helped me for the public service evaluation project.
See you next year!