What is ‘good’ systems change and how do we measure it?

16 min readDec 6, 2023

By Søren Vester Haldrup

UNDP has set up an M&E Sandbox to nurture and learn from new ways of doing monitoring and evaluation (M&E) that are coherent with the complex nature of the challenges facing the world today. Read more about the Sandbox here.

We convene a series of participatory sessions as part of the M&E Sandbox. In each session we collectively explore a theme in depth, inviting practitioners to speak about their experience testing new ways of doing M&E that helps them navigate complexity. You can read digests and watch recordings of our previous Sandbox sessions here: using M&E as a vehicle for learning, measurement and complexity, progress tracking and reporting and how to measure systems change. Do also consult our overview piece on innovative M&E initiatives and resources. As a new thing you can also join our M&E Sandbox LinkedIn group to connect, learn and share insights with likeminded Sandboxers.

In our most recent Sandbox session we explored the question: What is ‘good’ systems change and how do we measure it? This blog post provides a summary of the discussion and includes the recording, an overview of questions and answers from the discussion, as well as the resources shared during the session.

If this post has sparked your interest, I recommend that you watch the full recording right here:

Evaluating systems change is one of the big challenges that many of us are facing and we began to discuss this in our previous sandbox webinar. But one thing is measuring if change is happening in a system, another is evaluating whether this change is desirable. So how do we assess if change is good or bad? How can we introduce rigor into such a (subjective) assessment? And who defines what is good and bad change? Furthermore, if change takes a very long time to materialize, what do we look at in the meantime to get a sense of whether we are doing the right things?

We explored these questions with a deep dive into Laudes Foundation’s rubrics-based methodology for measuring systems change (a rubric is a framework that sets out criteria and standards for different levels of performance and describes what performance would look like at each level). We also heard more about UNDP’s own experience working with these questions. The panelists were: Lee Alexander Risby from Laudes Foundation, Katherine Haugh and Morgane Allaire-Rousse from The Convive Collective, Farzana Rahman and myself (Søren Vester Haldrup) from UNDP.

The session brought out a number of important themes. Three stand out to me:

Measuring change or evaluating change: the latter requires criteria for what good and bad change looks like. Rubrics can be a useful tool and they can function as conversation starters for change agents about what is going on, what role they are playing and what they should do next
We need to measure what matters rather than what is easy to measure. This requires broadening what data we rely on, and how we collect, analyze and present it.
Critically reflect on who defines what ‘good’ looks like. Inclusion and participation of the people we are trying to help is essential.

Measuring ‘good’ systems change using rubrics

Lee, Katherine and Morgane kicked us off with an explanation of why the Foundation set out to develop and implement a rubrics-based methodology. Laudes was launched in 2020, responding to climate and inequality crises with a focus on green and fair industry transitions (read more about Laudes). As part of this work, the foundation was keen to help itself, its partners, and the wider field of philanthropy, to understand their contribution to change, while learning and adapting to new and unforeseen circumstances.

Learning from past experience: Laudes’ Foundation (and its predecessor: C&A Foundation) had previously used a measurement system based on 25+ quantitative KPIs to measure progress and change but this had limited utility as it did not generate discussion nor where the KPIs used much beyond feeding into some beautiful dashboards. This experience taught the foundation 3 valuable lessons: 1) numbers alone don’t capture systems change (they tend to focus on what can easily be counted rather than what is most important), 2) KPIs have limited utility for facilitating decision-making without context, 3) if you limit measurement to numbers then you limit learning.

Focus on what matters most: In this connection, Laudes looked to rubrics are a way to introduce a more subtle approach to measurement of change when dealing with a really complex problem — one focused on capturing what matters most rather than what is easiest to measure. This entailed developing a set of criteria to describe what good (desirable) and bad (undesirable) change looks like (i.e. how good is good) and the evidence required to assess whether such change is indeed happening.

Mixed methods: Measuring this happens through a mix of methods. I.e. mixed methods for collecting the evidence required to diagnose what is going on the system as well as a step focused on synthesis and interpretation of the data (what is it telling us?). In this work Laudes is primarily interested in the contribution question (are early and later changes happening and how did we contribute to them? Is the system changing as we expected and how did we contribute to it?) rather than attribution (are we doing the right things right?).

The rubrics and how they’re used: Laudes Foundation has developed 21 rubrics that work across different levels, from processes to long-term impact. When measuring a specific initiative, a smaller set of relevant rubrics are chosen and assessed on a rating scale from ‘harmful’ to ‘thrivable’. The 21 rubrics are categorised into four groups, with some natural overlap between categories B, C and D.

Laudes Foundation’s rubics-based framework

The assessment of change against rubrics is intended to inform regular learning (rather than serving an accountability function). To do this assessment the Foundation synthesizes partner/grantee reports, and draws on insights from programme teams and external contextual data and evidence. This happens during an annual sensemaking exercise. This sensemaking involves workshops at various levels (intervention level, cross-programme level, and management team level) and it addresses questions such as ‘what is changing in the system or not and why? What is working and not working? What are the challenges and blockers? What mission critical issues should we monitor going forward? What are the options to adapt strategy (and the ToC)? This way of working has generated a range of interesting insights and dynamics around new ways of working. Read more about Laudes’ learning.

Psychological safety and a balance between the short and long term: : The team at Laudes have faced (and are still grappling with) a number of challenges. Two stand out: Balancing the long and short view on change. Systems change does not happen overnight and the team is at times seeing very little change in the system from one year to the next, so what do they look at in the meantime to get a sense of whether they are making progress? Psychological safety: this is essential, but people may not feeling comfortable sharing what they actually think or to speak up and disagree. This requires a certain organizational culture and for people feeling comfortable with uncomfortable conversations, and being willing to be open and vulnerable.

Tracking longer term change while monitoring interim progress

Next up, I (Søren) gave an overview of how UNDP is approaching the question of measuring systems change and how we seek to deal with some of the challenges that Laudes has been facing, such as how to balance the long and short view on change. I outlined an M&E Framework for a portfolio approach that is focused on serving three interrelated functions: 1) help us and partners to continuously learn and adapt from day one as we launch a portfolio focused on systems transformation, 2) enable us to capture system-level change (medium to long-term), 3) allow us to track and report on interim progress (in the short to medium term).

To enable us to serve these 3 functions, we have been working to rethink a range of M&E practices and tools, including the results framework. As a result we distinguish between documenting long terms systems change (using mix of data to track shifts in the system) and interim progress where we focus on momentum and learning as indications of progress. During the presentation I also shared a principles-based monitoring framework that we are piloting with the Bill and Melinda Gates Foundation. The principles function as interim progress measures that keeps our focus on our overall objective will preserving our flexibility to iterate and adapt. You can read more about this in our recent blog here.

People-centric systems change: who defines what good is?

Last up, Farzan Rahman from UNDP Bangladesh shared her team’s experience adopting a more people-centric focus when discussing what good change looks like and how to measure it, not least putting marginalized groups in centre of this discussion.

As part of this process, the team has sought to broaden their concept of measurement to include a mix of qualitative and quantitative sources of data and a wider range of data collection methods (such as ethnographic research and storytelling) to arrive at a more holistic assessment of what is going on and the felt experience of marginalized groups in the country. This work has required changes in the types of capabilities that the team needs (such as deep listening, qualitative research and how to craft stories) as well as the mindsets among team members and funders in the country (shifting from a top-down accountability lens to one focused on empowering and supporting people out in the real world).

UNDP Bangladesh rethinking how they measure systems change

Questions and Answers

This section provides an overview of some of the many questions raised by the Sandbox community during the webinar and answers from Laudes and UNDP

Questions and Answers — Laudes Foundation

Question from Sandbox Community: We have a set of rubric for DAC which are very general. So, are we to tailor them according to the kind of program we evaluate?

Answer (Lee): Thanks. DAC criteria and rubrics are of course general which is part of their attractiveness, particularly in international development context. On the flip side being general means they lack the specificity to deal with the intricacies of systemic change, particularly when this involves the corporate sector. So, I would agree that a bespoke approach is sometimes needed to understand how well change has / is happening (stitched specifically to a system-wide theory of change). In such situations rating relevance, effectiveness, efficiency etc. etc. just does not provide sufficient framework to inform decision-making. Lastly, and as a provocation — M&E needs to put use and utility front and center — so that it needs to be wanted by the audience (user) and sparks joy. There is a very large graveyard of evaluations and monitoring frameworks which died because they were not used and / or useful (lots with DAC criteria as well). We need to be asking the users of M&E (Government, NGOs, corporates, foundation managers) — What decisions do you need to take and what kinds of evidence do you need to make those decisions better? Putting ourselves in a DAC criteria ‘box’ is not the starting point for an answer.

Answer (Katherine): Thanks! I would agree with Lee with the caveat that the joy of rubrics is that they provide consistency and structure. Making them too bespoke could make them have high utility for one program, but be completely inapplicable to another. So my suggestion is to do exactly as Lee suggests in his provocation, use the rubrics tools available to you, and focus more on bespoke work with communicating the “so what” of the rubric ratings to multiple audiences. Your rubric data shouldn’t be your only source of data, so the other sources will come in to help bring the data to life and have it be tailored for your audiences and the key decisions they are making.

— — —

Question from Sandbox Community: Many times we work with organizations that the most important thing they can do for the health of the system is to preempt negative change, rather than hope for positive change. And keeping change at bay — may be “the point of light” one needs to celebrate. So how do we use the rubric help us judge and tell that story and don’t underestimate the value of that effort of fighting back powerful negative trends?

Answer (Lee): Good question, in the Laudes use of rubric-based measurement, we have 8 which cover ‘earlier changes’ in the system, for example, when we and our partners are using B1 rubric to assess the quality of policy changes related to climate and transitions we are interested in looking at stakeholder involvement (who), ownership (depth and breadth) inter alia…, alongside the trajectory of change. Hence, if a rating is ‘unconducive’ overall, we get an indication of trajectory of movement (positive, neutral or negative) and so where it is important to continue investing resources and time. We are also currently working with our Programme teams to identify those short-term indicators of change for each rubric that can be used to code those all important trajectories and ‘points of light’ despite an overall system rating which is harmful or unconducive.

Answer (Katherine): Couldn’t agree more with Lee! I would add that we are supplementing the rubric data collection with context monitoring so we can see that certain contexts may be very conducive, conducive, not conducive or very unconducive. That helps us know: if we aren’t seeing change on the rubrics it doesn’t mean we are necessarily doing the wrong thing. It means that the challenges we are up against (in Laudes’s context — tackling the global climate crisis and equity) are extremely complex and require years (generations even) of continuous effort to see change. While that’s true, we also want to see points of light as Lee mentioned above for interim sensemaking and decision-making.

— — —

Question from Sandbox Community: You seem to assume that systems change happens quickly in your area of work– but that is not an assumption that holds for many types of systems/problems. So perhaps, critical to think how to adapt the rubric to the underlying theory of change(s) / causal pathways at work and the organization, so they contribute to evidence and story telling that makes sense?

Answer (Lee): Thanks. Actually, no we do not assume that change across fashion, finance and building industries will happen quickly. Indeed, our theories of change operate on multiple timescale over a decade — and even that is a short-time span. We use the rubric-based measurement to inform (alongside other factors) in our annual strategic adaptation processes. Evidence and stories only make sense if they are linked to specific strategic decisions that need to be taken. Our job as evaluators is to work with programme teams and strategists to identify what decisions are on the horizon and what evidence is needed to make them effective.

Answer (Katherine): In addition to what Lee has shared, Laudes’s rubrics are built on the underlying theory of change and each individual causal pathway that teams use. You can learn more about that here.

— — —

Question from Sandbox Community: I am curious about how the process laid out of synthesis of rubrics and sensemaking on pathways includes partners and is made useful to them (raising the collective system awareness of the group of partners and enabling sense making and coherence that they otherwise couldn’t achieve on their own)?

Answer (Lee): At the present we only involve Laudes programme staff, management team and board in sensemaking as part of the developmental evaluation. One has to start with gradual steps to embed sensemaking internally. Of course our partners’ learning reports (which are submitted once per year) are a vital ingredient for sensemaking. Outside of sensemaking we convene partners so they can discuss their progress and challenges together, and we have a Partner Fund that provides resources to enable collective learning. In the future, I would see involving partners more actively in sensemaking as important, but we want to train that sensemaking muscle first with the programme teams.

Answer (Katherine): Lee said it perfectly. I would add that Laudes is very thoughtful with the requests they make of partners’ time and in that vein has decided to focus first on getting the sensemaking muscle and the process down right internally before bringing them into something that is still coming together.

— — —

Question from Sandbox Community: I would be quite interested to hear a bit more about how you concretely developed the rubrics. How did you ensure buy in within your organization (and indeed the wisdom and knowledge of the broader field/partners)?

Answer (Lee): The basis for developing rubrics was our theories of change (there is never one system change theory) and underlying that we have 14 or so casual pathways across the industries. From the causal pathways we synthesized common early / later changes etc (see the ToC) and from that distilled the rubric measures. During this process we did involve our programme managers actively, and some partners. The lead experts who worked with us on this all were Dr. Jane Davidson and Dr. Thomaz Chianca — two well known rubric experts. They continue to work closely with us (see: Home | Real Evaluation).

— — —

Question from Sandbox Community: What software are you using to create/visualise the theory of change with dropdown menus for seeing different actors and approaches?

Answer (Lee): Thank you for asking. Pimcore was the website used for the interactive Theory of Change (developed by Schuttelaar and Partners). We are now moving to another CMS called Umbraco and it should function in the same way.

Answer (Katherine): I’d also recommend trying out Sankey charts for visual theories of change like this website.

— — —

Question from Sandbox Community: Do you use any AI tools for report synthesis or other tasks? If you do, which tools are these? Thank you!

Answer (Lee): Thanks. We are just developing that. Of course using AI to summarize vast amounts of partner learning reports, and contextual data quickly and accurately, in response to questioning will save time and money in the future. But it does not remove the human element needed to check the outputs.

Answer (Katherine): Agree with Lee. You can use something as simple as NVivo or something more sophisticated like ChatGPT or other natural language processing software. Of course, you’ll need to be mindful of personal information.

— — —

Question from Sandbox Community: I currently work as a MEL specialist at Dalberg. What is your advice for bringing a systems lens to consulting projects? With tight timeframes can sometimes be tricky.

Answer (Lee): Thank you for your interest! I would say — visit published sources of systems thinking and think about what applies to you! Adopting a systems change approach might not be for every organization as not all have a clearly stated systems change goal and approach.

— — —

Question from Sandbox Community: Which appropriate tools can be used to measure changes in project level?

Answer (Lee): Thank you for your interest. There are many different tools being used to capture most important changes at project level. If you want to escape the usual logframe, and want to capture unintended changes, you could resort to outcome harvesting techniques and approaches. If you want to keep track of changes in a comparable and standardized way that describes “what good” would look like and that allows the project staff to look at both quanti-quali data, I would say why not use evaluative rubrics? :)

— — —

Question from Sandbox Community: How do you deal with the ‘dark matter’? — the fact that some changes just aren’t visible until something changes, e.g., a policy-maker’s opinion might be shifting without them doing anything different… until suddenly they do do something different. Somebody described this to me as like gardening — bulbs underground where you have no idea which ones are growing (or even still alive) until they pop up. That is, how do you deal with the unobservables / missing data which indicates progress?

Answer (Lee): Great question, Caroline. One we have also been debating on for the past year. When embracing the idea of systems change, progress might actually be about changes in debates or narratives, or changes in how actors are collaborating around something. In your example, you might not immediately see policy change, but you might see increased and improved policy debates on a certain subject, or you might see new stakeholders becoming a part of these debates… you might see usually marginalised voices being heard and taken into account, which is significant change in the right direction (of a new policy being approved/implemented).

Answer (Katherine): I would agree with this with the addition that, in my experience, you can almost always pick up on smaller changes ahead of bigger ones if you are really paying attention to the right thing.

— — —

Question from Sandbox Community: Have you played with trend data? … so quant and narrative data plotted on a timeline. Working with ecological and human systems, it is always hard to assess progress / momentum with static snap shots.

Answer (Katherine): Great question and idea! Yes and no. Even with trend data, we’d still be looking at a snapshot in time. Which is why we are creating shorter-term outcomes to track and a continuous context monitoring process to put those outcomes in context.

— — —

Question from Sandbox Community: As development is normative (we are spending money in pursuit of an outcome that systems are not delivering), interested to hear your perspectives on how we evaluate options comparatively and look at value for money without transparently comparable indicators and relative contribution to system changes? If we can’t answer the question, then intervention (and not only outcome) becomes normative — we just do what we want to do.

Answer (Katherine): Wonderful question. This is exactly what we are working on now. We’re creating a visual architecture to see the progress against short-term outcomes identified (expect to see, like to see, love to see) and context (not conducive to further outcomes, conducive, very conducive). We are indeed looking at value for money but with comparable indicators and perceived contribution to systems change. It’s quite in the details, so we’d need to explain it more fully. Maybe for a next Sandbox!

Questions and Answers — UNDP Bangladesh

Question from Sandbox Community: how do you capture the stories and how do you make sense of them?

Answer (Farzana): Thanks for your curiosity. We engage with the community by ourselves and sometimes rely on civil society organizations and community-based organizations to capture the stories. We then try to break down the stories into our thematic work areas to see whether the changes these stories indicates are well aligned with the changes we wanted to make. However, we are new to this approach.

Additional resources

Here’s a list of resources shared during the session:

Laudes Rubrics-based framework (Laudes Foundation)
Evaluation examples that use Laudes’ rubrics (Laudes Foundation)
Rubrics Intro (Better Evaluation)
Jane Davidson on Evaluative Rubrics (AEA 365)
Life Stories From Children Working in Bangladesh’s Leather Sector and its Neighbourhoods: Told and Analysed by Children (IDS)
Storytelling and Evaluation: Can Pathways be the Theory Missing in Explanatory Narratives? (Florencia Guerzovich)
Measuring systemic change in market systems development (USAID)
Example of M&E of systemic change in the field of Market Systems Development (USAID)
Principles-focused evaluation — the guide (Better Evaluation)

If you would like to join the M&E Sandbox and receive invites for upcoming events, please reach out to contact.sandbox@undp.org.

A bit more about the speakers:

Lee Alexander Risby: Director of Effective Philanthropy (EP) at the Laudes Foundation in Switzerland
Katherine Haugh: CEO and Founder of The Convive Collective
Morgane Allaire-Rousse: Systems Thinking Strategist at The Convive Collective
Farzana Rahman: Global Programme on Environment and Climate Change at UNDP Bangladesh
Søren Vester Haldrup, Head of UNDP’s M&E Sandbox

What is ‘good’ systems change and how do we measure it?

Measuring ‘good’ systems change using rubrics

Tracking longer term change while monitoring interim progress

People-centric systems change: who defines what good is?

Questions and Answers

Questions and Answers — Laudes Foundation

Questions and Answers — UNDP Bangladesh

Additional resources

Written by UNDP Strategic Innovation