Risky behaviour — three predictable problems with the Australian Centre for Evaluation

Hoping to guide navigation with reports. Image generated with Gencraft.

After advocating for better evaluation all my working life, I was keen to see the launch of the new Australian Centre for Evaluation (ACE), with its stated aims of improving the volume, quality, and impact of evaluations across the Australian Public Service.

One of the two business units in ACE — the Evaluation Leadership, Policy and Capability Unit — has the commendable objectives of supporting effective implementation of the new Commonwealth Evaluation Policy, and strengthening evaluation capacity, including recognising evaluation as a professional stream for recruitment and professional development.

But I am disappointed by the way the second business unit — the Impact Evaluation Unit — has been framed, and its prominence in descriptions of ACE. Despite the range of proposed activities for the new centre, its central focus seems to be promoting the use of a particular set of impact evaluation designs.

In May 2023, the responsible Minister, Andrew Leigh, made this very clear when he announced,

“We’re creating an Australian Centre for Evaluation, to conduct rigorous impact evaluations, including randomised trials.”

ACE is intended to not only to conduct some useful evaluations but also to shape perceptions and evaluation practice across government. As the media release stated, it will

“demonstrate that better evaluation is possible everywhere, and that all policies and programs benefit from evaluation plans that use the most rigorous methods feasible”.

As a former Professor of Public Sector Evaluation, who has watched evaluation debates and practices play out internationally for the past few decades, I am concerned that this emphasis fails to draw on what has been learned from the evidence about evidence-based policy and practice and how evaluation can best support this.

ACE seems to be promoting an outdated “gold standard” view of evaluation methods and “what works” view of evidence-informed policy.

I’m disappointed at the missed opportunity to support much needed evaluation innovation. But more than that, I’m concerned by the risks in this approach — risks which are evident from the history of evidence-based policy, and which should be anticipated and mitigated.

Three predictable problems

Firstly, the emphasis on impact evaluations risks displacing attention from other types of evaluation that are needed for accountable and effective government, including good process evaluation that looks at accessibility, coverage and quality and which to aims improve it during implementation.

Unless specific attention is given to other types of evaluation, the implicit message will be that impact evaluation is the “A game” — and government departments will be encouraged to frame evaluation plans around answering impact evaluation questions rather than on providing the sorts of evidence that are most needed.

Secondly, the emphasis on a narrow range of approaches to impact evaluation risks producing erroneous or misleading findings. ACE will focus on impact evaluations that use a counterfactual design — an estimate of what would have happened in the absence of the intervention. This is most commonly done through a control group (as in an RCT, where participants are randomly allocated to either participate in an intervention or a control group) or a comparison group (as in different quasi-experimental designs where participants are matched to similar non-participants in various ways).

Counterfactual approaches can work well when an intervention is at the individual level but are not possible or appropriate for many of the important policy issues government faces, especially those that operate at a community level such as public health or are about affecting complex systems such as improving water quality. For such interventions it is not possible to create a credible counterfactual.

For these types of interventions, causal inference needs strategies more like those used in historical or criminal investigations — identifying evidence that does and does not fit with different causal explanations and adjudicating between them. There has been considerable development in these approaches over the past ten years or so, but these do not appear to be included in the scope of the new Impact Evaluation Unit of ACE.

Privileging counterfactual impact evaluation, as a strategy to influence evaluation design more widely in the Australian government, creates the real risk that evaluators and evaluation managers will try to use these approaches when they are not appropriate. It also means that non-counterfactual evidence from individual evaluations or from multiple studies will be disregarded, even where it has been systematically and validly created. Systems-level interventions, which are not possible to evaluate using a counterfactual approach, will be seen as not having evidence to support them even when they are effective.

All of these consequences can lead to erroneous conclusions and incorrect recommendations for policy and practice.

We have seen during the COVID pandemic what this can look like. One of the big questions was whether or not masks reduce the risk of transmission. A narrow interpretation of what counts as credible evidence led to some people claiming there was “no evidence” that masks worked because there had not been an RCT done. And then when the DANMASK-19 RCT in Denmark failed to find statistically significant differences between those encouraged to wear masks and the control group, it was widely reported that it had shown “masks don’t work”, despite the acknowledged limitations of the study, including low statistical power, short follow-up, low compliance, and questionable measurement. (A recent paper in The Conversation and expert comments on the study discuss these in detail).

Thirdly, the focus on ‘measuring what works’ creates risks in terms of how evidence is used to inform policy and practice, especially in terms of equity. These approaches are designed to answer the question “what works” on average, which is a blunt and often inappropriate guide to what should be done in a particular situation. “What works” on average can be ineffective or even harmful for certain groups; “what doesn’t work” on average might be effective in certain circumstances.

For example, the Early Head Start program in the USA aimed to support better attachment between parents and children and better physical and mental health outcomes for both. On average it was effective. But for the families with the highest levels of disadvantage it was not only ineffective, it was actually harmful — producing worse cognitive and socio-emotional outcomes for children than those in the control group. However, although the report identified this result, it was not highlighted in the short summary of findings and the program is often included in lists of ‘evidence-based programs’ without any caveats — for example, the Victorian Government’s Menu of Evidence for Children and Family Services, which states reports that it was found to be effective, and that negative effects were “not found”.

These differential effects should not be explained away as a “nuance”, as they so often are, but as a central finding. Responsible public policy demands protecting the most vulnerable from further harm from government interventions. Summaries of findings framed around “what works” ignore this.

This simplistic focus on “what works” risks presenting evidence-informed policy as being about applying an algorithm where the average effect is turned into a policy prescription for all.

For example, the findings from the high-profile Women’ Health Initiative RCT of hormone replacement therapy (HRT) have often been reported as if they showed no protective impact of hormone replacement therapy and instead increased health problems for all women taking HRT. In fact, the study showed benefits in terms of cardiovascular health and reduced risk of fracture in younger women who took HRT close to the time of menopause.

Current evidence-informed practice for HRT use involves individualised decisions about HRT, taking into account each person’s context, including age, time since menopause and other risk factors, rather than a blanket recommendation to not take HRT based on the average effect,.

How to mitigate these three problems

Firstly, make it clear that impact evaluation is not necessarily the most important type of evaluation or what should be the focus for government departments. Say it in communications and demonstrate it in action. Celebrate and showcase evaluative work which informs implementation and adaptive management as well as impact evaluation. There is much to be learned from evaluations such as real-time evaluation and rapid evaluation which aim to inform and improve implementation and outcomes during implementation, and from approaches which engage community members as co-evaluators and co-designers in the interventions which are intended to benefit them.

Secondly, make it clear that impact evaluation can draw on a range of designs and approaches which can be systematic, scientific and credible without the use of a counterfactual.

ACE’s Impact Evaluation Unit would be much more useful if it demonstrated the appropriate use of a range of impact evaluation approaches and designs.

The Impact Evaluation Unit could draw on the useful guidance and examples of non-counterfactual impact evaluation that have been developed over the past decade. These would include:

The 2012 report published by the then UK Department for International Development “Broadening the Range of Designs and Methods for Impact Evaluations;

The quality standards for realist evaluations, meta-narrative reviews and realist syntheses produced by the RAMESES projects;

The 2020 guide to Handling Complexity in Policy Evaluation from the UK’s CECAN centre — the Centre for Evaluating Complexity Across the Nexus (of the food, energy, water and environmental domains);

The way diverse evidence was brought together to draw conclusions about the effectiveness of masks for reducing COVID transmission, including evidence from particular incidents — a choir practice where there was mass transmission even though everyone washed their hands and did not share utensils, and a restaurant where only diners downwind of the infected person caught COVID.

The recent publication from the Independent Evaluation Group of the World Bank on “The Rigor of Case-Based Causal Analysis”, which demonstrates how case studies can provide more than nice illustrations but also be used to generate credible and generalisable evidence to inform policy;

The new ‘Methods Menu’ for evaluating policy and institutional reform developed by the International Initiative for Impact Evaluation and the Millennium Challenge Corporation.

The work of the new Causal Pathways network which is building awareness, will, and skills to use non-experimental evaluation approaches for evaluating strategy and systems change.

The extensive collection of resources and guidance on evaluation methods, designs, approaches and processes on the open access knowledge platform BetterEvaluation, including (alphabetically) causal link monitoring, collaborative outcomes reporting, contribution analysis, narrative assessment, process tracing, qualitative comparative analysis, qualitative impact assessment protocol, rapid evaluations, realist evaluation and realist synthesis — in addition to counterfactual designs such as RCTs.

It should not be a matter of advocating for any particular evaluation approach, but encouraging an informed choice of what is most appropriate, taking into account the nature of what is being evaluated, the nature of the evaluation, and the resources available.

Thirdly, I would urge ACE to move away from its focus on ‘what works’ and ‘measuring what works’ to the more useful questions of “what works for whom in what ways and under what circumstances”, aiming to support differentiated policy advice depending on context.

It was shown during the COVID pandemic how useful it was to bring together diverse, credible evidence that identifies and explores variation in different sites and for different people and tries to explain them, and then to use this to inform policy and practice that is differentiated. This is more valid and useful than expecting a single study, or even a meta-analysis of multiple studies, to produce The Answer about “what works”.

Current government policy, for example, appropriately provides different clinical guidelines for the COVID vaccinations for different age groups, and the use of anti-viral medicines for different risk groups, based on evidence about “what works for whom in what circumstances”, not a blanket finding about “what works”. Nor is this simply about reporting outcomes for sub-groups. Evidence needs to be created, shared and used in ways that are appropriate in a VUCA (volatile, uncertain, complex and ambiguous) world, for wicked problems, and for interventions in complex systems.

Evidence-informed policy needs a process for systematically generalising evidence to new contexts, not an assumption that it involves looking up “what works” and then applying it. [Edit: There are different ways of doing this, through understanding causal mechanisms and the contexts in which they work.]

The risks in the approach being taken to ACE are not academic concerns. They speak to significant risks in terms of both missed opportunities and potential harm. At a time when Australia and the world confronts multiple interlocking environmental, economic and social crises, we need evaluation to step up and help governments do better in complex systems rather than misdirect effort and resources.

The Australian Centre for Evaluation needs to do better.