A basis for an ethical AI framework for humanitarian response
AI (and the sub-field of machine learning) has become a hot topic in humanitarian tech and many promises are made about potential futures. In the race to become pioneers we have to ensure we protect the affected populations we wish to help. A tangible and actionable ethics framework is needed and a potential starting point is laid out in this article.
I have recently finished a piece of work looking into the use of bleeding-edge technology in health data. Many of the concerns about negative outcomes of using AI in health apply across industries. We’ve started to see early applications of the technology in the humanitarian sector and now is an appropriate time to think about the implications and consequences of its use.
Current Examples of the use of AI in humanitarian response
- 510 global are using machine learning to predict damage after typhoons to better plan humanitarian response.
- Many researchers are exploring the use of AI to trace buildings and roads for OpenStreetMap.
- The IFRC and the UN are interested in forecast-based financing to pre-emptively fund responses.
- I have used machine learning to predict migration flow to better resource responses.
There is a great story from Microsoft’s Dr Rich Caruana which highlights why we have to be cautious. He created a model that did very well at predicting pneumonia and the question was asked, should we use this on real patients? Thankfully they made the choice not to. At the same time, a friend was investigating the data using a rule-based system and found that if you have a history of asthma, it lowers your chance of dying from pneumonia. Asthma is in fact a big risk factor with pneumonia and the model was wrong. What had happened was people with asthma pay more attention to their breathing and are plugged into the healthcare system and so get treatment quicker, leading to better outcomes. This nuance wasn’t captured in the data and so the model reached an incorrect discriminatory conclusion.
“What really scares me is, what else did the neural net learn that’s similarly risky? But the rule-based system didn’t learn it, and therefore I don’t have a warning that I have this other problem in the neural net.” — Dr Rich Caruana
Returning back to our earlier examples and thinking about them at a high level we can quickly hypothesise potential ways the models could be discriminatory.
Predicting Damage after Typhoons
The damage numbers are self reported by municipalities and some exaggerate to gain more funds. If those municipalities have particular features prevalent in them then those bias might be ingrained in the model. The model might also learn that areas with wooden buildings are most affected. When applied to a new area it could discriminate against a population who use a different new material, but are equally as affected.
Tracing building types
The model might not detect buildings of certain shapes (round) which might belong to a certain population group and then are not included in identify populations.
I was recently at the fantastic GeOnG conference in France which inspired this article. When discussing models many data scientists give a single figure when communicating model performance — “It works at 93% accuracy”, especially when comparing performance against a human counterpart and talking to non-technical users. A problem is that this loses a lot of nuance. Working better on average is not good enough, it has to work better for everyone. We need the tools to communicate performance clearly, especially about those most likely to be discriminated against.
I had a look around at a few frameworks for AI and most of them seem to be a set of principles to adhere to. As illustrated above though it is easy to derive a bad model and not notice. As the saying goes, the road to hell is paved with good intentions. I think it is important that the framework has concrete steps, tools and outputs used to test and communicate the impact of AI models. Below I set out a possible starting point for a tangible framework for assessing the ethical use of AI in humanitarian situations. This is not meant to be complete or well researched, but rather a launching point for discussion.
Framework Basis
Explainability — As illustrated above, one of the key elements to consider is explainability and the understanding of how conclusions are reached. While advances are being made in this area, generally the algorithms that give the best results are the most complex and least explainable (black box). Black box algorithms should not be written off straight away, however. We need to consider explainability on the spectrum of distance from/impact to the beneficiary. If an algorithm is going to decide whether a person receives a service, the process needs to be wholly transparent and defendable. In other use cases, such as summarising documents, that have little to no direct impact on a person, then the requirement can be less strict. If there is only a small gain to be had from black box algorithms, then a white box model should always be favoured.
Definitions of distance from a beneficiary should be defined as segments and for each category, requirements of the explanation level needed, ranging from no explanations, to the major factors and influences understood, to complete understanding of the model.
A standard way of communicating performance — A single statistic is not enough to communicate the performance of a model across all the considered population. When communicating performance of a model, its performance should be stated across different population segments. Special attention should be paid to those historically discriminated against with these specifically highlighted.
Accuracy of the training data — A peer review of the training data should always be distributed with the model’s implementation. This review will highlight the biases and weaknesses contained within the data and should be updated as the models are trained based on it highlighting further problems. This will allow humanitarian responders to make informed decisions on whether to use particular implementations of a model and will facilitate discussion amongst peers as to its appropriateness.
Human in the loop — Unless 100% accurate or very far from the beneficiary, a human expert should always be kept in the loop of decision making by AI implementation. We need human expert augmentation rather than automation.
Clear distinction to humanitarians between research and applied — The framework is designed to protect people. If research is being carried out with no plans for implementation, then the model doesn’t need to adhere as strongly, but this should be communicated clearly in the work, as to avoid confusion and false expectations.
Questions to answer
There are also some questions which, as a community, we need to decide where we stand. If a current implementation is 90% effective for everyone, should we implement a model that is 94% effective for one community and 96% effective for another? By introducing a better outcome for everyone we introduce a bias into the system. Should we deny improved outcomes for everyone, if they are not equal?
These are just some initial thoughts and I am sure the suggestion can be changed and strengthened in many ways, as I am in no way an expert in the field. I would love to hear what you think is missing.