Data says hello to world
TL;DR: any data about things in the world that are not directly countable/quantifiable should be taken with a pinch of salt because it is affected by human preconceptions of how it should be quantified and any “insights” gained from it will likely propagate those preconceptions and have undesirable consequences. Data is a picture of the world, not the world itself, and it is very easy to lose sight of that when being pressed to deliver “value” to your organisation or project .
I’ve been called an “over-quantifier” before. This might be true, and attending a degree in Physics plus a bootcamp in Data Science only reinforces that view. I’m sure there are many people similar to me, in that they think that more information means greater ability to do good and if only we had more things measured, then we could do more good!
You’d expect me to be quite pleased then, given that society seems to be increasingly data-driven and influenced by the practices of measurability, predictability and sociometrics. The creation and maintenance of stable systems of digital measurement and record-keeping have been extremely beneficial in so many respects and the development of the tech sector in the past decade has revolutionised the way we leverage data for good.
I’m not entirely happy though. The ascendant data-world has many foundations built on sand.
Steffen Mau’s The Metric Society is a long analysis of the mounting cases where attempts to measure society are either highly inadequate or downright pernicious. While I don’t agree with every argument made and I diverge from the author’s political views and academic leanings, they made valid points on obsession (noble or not) to measure every aspect of the world, especially the social one. Some of these problems illustrated are ultimately the product of agents or institutions that you cannot control. However, if you are a member of any organisation that has an impact in these respective fields or you are in a position where you have to act on these metrics, then critiquing your raw data should be an ethical imperative.
Here are three problems identified in The Metric Society, based on my reading and interpretation of it:
- The mismatch between metric and implicit real-world value: the distance between the metric and what we presume it indicates (e.g. academic h-index; GDP/capita and other economic indicators).
- The unreliability of the measuring agent / rater / customer to properly assess quality (a patient reviewing a GP clinic; a 1–5 star rating of a university lecture).
- Placing too much trust in data-driven systems that cannot account for causal relations in the world or for qualitative data that humans are more sensitive to: (e.g the COMPAS system used for assessing reoffending rates in many US courts)
Problem 1:
In any endeavour to quantify and measure x in order to deliver insights into y, we’re assuming that x and y point to something in the real world: x could be rainfall in cm cubed and y is number of new tree saplings this year; x could be mentions of a trending celebrity and y could be the number of likes; x could be average number of bedrooms per flat on a B&B platform, while y is the average number of customers. The list of examples is only as limited as the willingness to quantify.
We might only deal with data that is clearly numerical and lends itself to quantification (like the rainfall and trees). I’d be quite confident to assume that the difference between the number of trees I have recorded in the table on my screen and the actual forest is either negligible or mostly due to random errors. The conceptual leap from trees to n_trees is small.
Let’s suppose that you’re a data scientist in a grant-maker for high-impact research. You have a huge amount of data on different academics and their respective departments and there is a lot of pressure on you to make the right call. Which variable in your table of institutions do you jump to? The h-index.
The h-index (proposed in 2005) is an indicator of academic publication. It refers to the maximum number of publications h that have been published at least h times by an author (Mau, p75). Though this might seem like a valuable indicator, as it tries to take into account number of publications and influence amongst peer’s publications. However, it’s inadequate for comparing across different academic fields, since different fields have wildly varying ranges of h-indexes. It also favours more senior academics, since their papers have been around for longer and have had a longer time to accrue citations. The metric also seems to rely on a somewhat faulty assumption that the number of papers and number of citations tend to be proportional, so it fails to account for a researcher with a low number of papers that have a really high impact! The thing is, what you really wanted to capture was the researcher’s influence and/or possibility to help the world through their skills. What you got was a measurement of their esteem in a not-quite-perfect publishing system.
Ok, you might think that this kind of problem is resolved when dealing with more concrete measurements. Surely economic data provides more concrete, numeric variables that aren’t as divorced from reality as performance indexes. Go back again to our list of problems: there’s an implicit, real-world value, that is subjective, tacit (let’s say something as deceptively basic as how wealthy a nation is) and the measurable proxy for that value (GDP, GDP/capita, GNP, PPP). The metric is still useful, in that it can provide some form of comparison and can be used to evaluate progress over time or as a result of policy, but it is still distant from that intangible value in the world we try to condense. I recommend Economics: The User’s Guide, by Ha-Joon Chang, for a really entertaining and well written lay-person’s expose of economic metrics. Whenever your analysis or model depends on a social or economic metric, keep all your statements about it close to the actual thing it measures, or openly and repeatedly state that the metric has falsifiable assumptions.
Problem 2:
You’re assessing data for a health service provider. Future spending decisions and hiring practices will be informed by your analysis of what’s been recorded so far. Among the myriad factors you take into consideration are duration of waiting time, average appointment length and the 1–5 star review of the clinic (which is publicly accessible). The organisation’s bottom line is: get more patients through every day and improve the ratings ! There might even be a rewards and penalties system whereby GPs who see under a certain amount of patients fare less well and this leads to a case of perverse incentivisation. This phenomenon has already made itself known in the media, especially regarding health services. But the increased prevalence of patient / customer ratings and reviews of services is the exact opposite public trend you’d expect.
When I last moved house and had to register to a GP clinic, the number of stars the clinic has was the most salient thing on my search results. This is just tacitly lumped in there, alongside customer reviews of my nearest curry-house and although I can choose to ignore it, it does colour my perception of said-clinic. When did it become common knowledge that the public could adequately rate care from doctors? Customer service quality and a doctor or nurse’s care are not commensurable. Yes, other metrics, the average appointment length (some patients DO need more care than others), waiting time (a doctor has to respond adequately to all patients, they can’t just boot you out once your time is over) are imperfect, as they don’t capture quality of care. A patient’s review (as indicated by a single integer they chose in a flash, with no accountability for their decision) is even more distant from reality.
I say this as someone who has recently been asked to supply a review of a GP video appointment. I cannot bring myself to do it because, frankly, I’m not qualified. I rated the quality of the video conferencing itself and left it at that.
Now imagine that you’re delivering insights to the management board of a university, and you have a dashboard of data regarding lecture attendance, average results in different courses and, saliently, student reviews of lectures. You’ve scraped these from a teaching review platform that purportedly promotes transparency and accountability. You notice that some modules have lower retention rates and also that the ratings lecturers receive vary wildly. Action: use the available institutional mechanisms to incentivise lecturers to improve their content and/or delivery. That sounds simple, except that it doesn’t affect the next batch of data, except by causing further stress to the academics involved. The original data did not reflect the quality of the lecturer — it was an amalgamation of competing factors, including the difficulty of the course and module (some are, let’s face it, harder than others) and, more importantly, the bias of the reviewer. Student reviews of lecturers suffer from selection bias (the most vociferous students will be more likely to express their views) and halo effect (whereby when a person is regarded favourably and this means that their negative characteristics are glossed over or vice versa).
Problem 3:
The last profoundly worrying feature of a data-driven world (but not, a necessary one) is overconfidence in our numerical model of the world. Models are built on past data generally to make predictions of future data. Models are also tested, and given a score for their accuracy (e.g. R squared or a ROC-curve AUC). The best models are the ones that have tested the best on data so far, and with further testing, our confidence in those models will only grow. Even the most robust models will still suffer from these key limitations :
- i. the data used is an incomplete representation of the world and
- ii. the model has no way of understanding causal relationships in the world. General statistical models are cause-agnostic.
An excellent example of this is COMPAS — Correctional Offender Management Profiling of Alternative Sanctions: an evaluation software that was widely used in the US (Mau, p79) — utilizing a data-based model to assess the risk of a person reoffending. COMPAS was criticized and accused of being significantly biased in its assessment of African American citizens, and an algorithm has circumvented more than half a century of socio-economic theory and understanding. What is most worrying about COMPAS is the fact that such systems are given such immense trust and authority to influence important decisions. Even though we should resist alarmist imagery of a numerical dystopia, I think such examples need to be examined in the public light as data science malpractice.
I admit, that throughout the book, Mau seems to pick particularly disturbing examples of data and statistics being mis-used (he starts by mentioning China’s citizen rating system). Nevertheless, these instances can crop up in any data scientist’s line of work. Hence why I think that data scientists should act as responsible, proactive citizens and always consider the ethical implications of their work.
Mau, S. (2019). The Metric Society: on the quantification of the social. Polity Press, Cambridge. https://www.amazon.co.uk/Metric-Society-Quantification-Social/dp/1509530401