Mapping early years practice: building a robust clustering pipeline

Rachel Wilcock
Data science at Nesta
9 min readNov 22, 2022

This article has been co-written by Rachel Wilcock and Juan Mateos-Garcia.

Introduction

Nesta’s fairer start mission has one goal driving the work it does: to narrow the disadvantage gap between children on free school meals (FSM) and their peers. The work is focused on children aged 0–5, as we know that the trajectory of one’s life is shaped most significantly in those early years. Children born into disadvantage are far more likely to experience poorer health, lower earnings, a shorter life expectancy and lower levels of happiness than their peers.

As such, we would expect Local Authorities (LAs) with higher levels of deprivation to see poorer outcomes aged 5. Here, we are using the early years foundation stage profile (EYFSP) and particularly their Good Level of Development (GLD) measure as our outcome.

The EYFSP is an assessment conducted aged 5 in England, in the final term of Reception. It has a set of broad categories within it, focusing on:

  • Communication and Language
  • Physical Development
  • Personal, Social and Emotional Development
  • Literacy
  • Maths
  • Understanding of the world
  • Expressive arts and design

Within these there are further breakdowns called Early Learning Goals (ELGs) — for example Physical Development is separated into fine and gross motor skills and Expressive arts and design is separated into Creating with Materials and Being Imaginative and Expressive. If a child passes the expected level of development on all of these ELGs, then they are meeting a “Good Level of Development”. For more information on the EYFSP, there is an EYFSP handbook.

We can look at trends in the EYFSP and the GLD across England, as every LA reports on their results. This is also broken down into children on free school meals (FSM), children with special educational needs and disability (SEND), and different ethnicities. By linking LA data on Index of Multiple Deprivation to the percentage of children reaching a GLD, we can explore the hypothesis that higher deprivation levels leads to poorer outcomes aged 5.

Two graphs showing two trends. 1) If you’re not on free school meals, you are more likely to reach a good level of development if you live in less deprived areas and 2) if you are on free school meals, you are less likely to reach a good level of development if you live in less deprived areas.
Figure 1: Pink dots are London boroughs, blue dots are all other LAs in England. The yellow line shows the national average percentage of children reaching a Good Level of Development. (Left) The mean percentage of all children reaching a Good Level of Development in the Early Years Foundation Stage Profile plotted against the mean Index of Multiple Deprivation (IMD) decile for the LA. (Right) The mean percentage of children on free school meals (FSM) reaching a Good Level of Development in the EYFSP plotted against the mean IMD decile for the LA.

Figure 1 shows that whilst this trend of less deprived LAs having a greater percentage of children reaching a GLD holds true for all children (left plot), when we only look at children on free school meals (FSM) the trend is reversed — FSM children in less deprived LAs have poorer outcomes than their peers. We do note there are some exceptions, particularly in the London Boroughs, where children are doing better than comparative LAs of similar deprivation.

So what are the LAs which are positively deviating from this trend doing that other LAs could learn from? Each LA will face different challenges, and will have different policies and practices in place. A rural county will not provide services in the same way as an inner city urban area. However, if we can group LAs who share similar circumstances, perhaps we can learn from the LAs who have more children reaching a Good Level of Development (GLD) than their counterparts. If we can do this, then we can disseminate suitable practices to LAs in the same groups who have fewer children reaching a GLD.

There are many ways to group LAs — the local authority interactive tool (LAIT) being one of them — but we wanted to create an alternative, open-source clustering using publicly available datasets. LAIT includes data on child protection, children’s health, pupil attainment and judgments from Ofsted, but we wanted to take into account the environment in which a child grows up and include more data on the parental circumstances. This means, when we cluster the LAs they’re more likely to have similar circumstances and thus are more likely to have challenges in common with each other than with LAs outside of their cluster. This broader, environmental consideration is a key reason why we can’t often directly compare urban and rural areas (Rural-Urban Classification, Government Statistical Services).

Methodology

We have developed a robust clustering pipeline in order to group the LAs together, based on similar demographics and environments. Feeding into this pipeline are two datasets: the Fingertips Public Health England dataset and the Consumer Data Research Centre (CDRC) Access to Healthy Assets and Hazards (AHAH) data. These were chosen due to their comprehensive coverage of environmental and demographic factors which will define an LA.

  • Fingertips contains a wide variety of health data, including data on the prevalence of mental ill health, substance misuse and domestic abuse. Known as the “trio of vulnerabilities”, these are key risk factors considered in safeguarding, which may have an effect on a child’s ability to develop at the same rate as their peers. It also includes demographic information and deprivation levels.
  • The AHAH contains data on access to public services, such as GPs and pharmacies, but also pollution levels, and the amount of green space in an LA.

With around 400 variables, the combined Fingertips and AHAH datasets provide an incredibly well-rounded description of the LAs.

With the EYFSP as our outcome measure, and the Fingertips and AHAH as our input datasets, we then feed these into the robust clustering pipeline — a pipeline consisting of 5 key stages, which are outlined in Figure 2.

Graphic setting out the five steps in the pipeline. 1) Principal component analysis, 2) UMAP, 3) running three clustering algorithms, 4) creating a network graph and 5) examining the clusters.
Figure 2: The steps in the pipeline to generate the clusters.

This methodology involves dimensionality reduction techniques that compress our wide dataset into a smaller number of informative dimensions, and then uses these to cluster LAs based on their similarities and differences.

Why do we call this pipeline robust? There are many different clustering algorithms that are implemented with different parameters, and the results can vary subtly between runs, for example depending on their random initialisation values. We seek to ensure that our results are not impacted by our choice of cluster or parameters by by implementing an ensemble of clustering algorithms (K-means clustering, Affinity propagation, Gaussian mixture) and a grid of parameters that yield a network connecting more strongly those LAs that are clustered together frequently by the ensemble.

We then decompose the network into clusters of LAs using the Louvain community detection algorithm, which looks for partitions of the network that maximise its modularity. We tune some of the parameters in this pipeline based on their ability to generate clustering assignments that maximise differences in early year outcomes across clustering; this is the consequential outcome that we are interested in understanding.

Results

The final clustering is shown in Figure 3 below, with seven clusters (numbered 0–6) created. The LAs are distributed across the UK, with the “typical” clustering not seen (i.e. clusters along geographical north/south divides).

We can also see the difference between the LAIT clusters and our own. For example, the Children’s Services Statistical Neighbour Benchmarking tool would place Birmingham close to Luton, Sandwell and Wolverhampton. By contrast, Birmingham (the largest LA in our Cluster 1) is found to be similar to only Sandwell and Wolverhampton, with Manchester, Bradford, Middlesbrough and Leicester (amongst others) also included in our cluster.

Each individual cluster is defined by the variables in the dimensionality reduced final dataset which are the most different to the other clusters. Therefore the clusters can be described as below:

  • Cluster 0 — LAs with high deprivation and lower than average life expectancies.
  • Cluster 1 — Bigger cities outside of London, characterised by a higher cardiovascular disease prevalence and more children in low income families than average.
  • Cluster 2 — LAs with low deprivation and a higher than average life expectancy.
  • Cluster 3 — London boroughs (excepting Richmond-upon-Thames, Bromley and Greenwich), with significantly higher than average pollution.
  • Cluster 4 — More rural LAs with an ageing population.
  • Cluster 5 — Metropolitan areas in the North of England with higher than average hospital admissions and a higher crime rate
  • Cluster 6 — LAs containing a high proportion of commuter towns with a higher than average working age population.
Map showing the clusters across England, descriptions in the main text.
Figure 3 Map of the clusters generated.

Now we have the clusters, we can look at how the percentage of children on FSMs reaching a Good Level of Development varies between them. We’ve normalised this using the z-score, which removes the mean from each variable and divides it by the standard deviation. We compare different clusters in Figure 4.

Box plot showing the z-score and how if varies between and within clusters for the percentage of children on free school meals reaching a good level of development. Cluster 3 which contains London boroughs does considerably better.
Figure 4 Box plot showing the variation in z-score for each cluster for the percentage of children on free school meals reaching a Good Level of Development.

Most noticeable is that Cluster 3 (the London boroughs) has a higher percentage of children on FSMs reaching a GLD than the other clusters, and Cluster 2 (the LAs with lower than average deprivation) has the worst results. However, it is unlikely that the situation of a child on FSMs in a London borough is the same as a child on FSMs in Hampshire, for example. This is why we can’t compare between the clusters, and instead need to focus within the clusters.

Figure 5 illustrates this variation within the clusters. Each individual dot is an LA, and dots towards the red end of the spectrum have a greater percentage of children on FSMs reaching a GLD, whereas those towards the blue end have fewer children on FSMs reaching a GLD. For this figure, we have normalised performance inside each cluster (by calculating its z-score) with the goal of identifying outliers in a consistent way across clusters).

A downside is that, in that chart, we can’t compare performance between the clusters because each distribution has been normalised by a different mean and standard deviation. This is consistent with our goal of comparing LAs that face similar challenges to one another.

Graph showing individual dots for local authorities and how they vary within clusters. Y axis is the z score for percentage of children on free school meals reach a Good Level of Development. Wide spread for each cluster from +3 to -2.2 showing variations within clusters.
Figure 5 The z-score of each LA in each cluster for the percentage of children on free school meals reaching a Good Level of Development.

So where do we go next? We know that the data alone isn’t enough to unpick what the LAs are doing on the ground, and therefore we need to combine this clustering with qualitative information that we intend to collect in the next phase of the project

Conclusions and next steps

Our next step is to conduct an England-wide survey of LAs and their policies and practices in the early years. It will be the largest, most in-depth piece of research into Early Years service delivery ever conducted in the UK, and will hopefully help us to discover what the LAs who have a high percentage of FSM children reaching a GLD are doing differently. We want to identify best practice in these areas and support LAs with similar circumstances to implement promising suitable practices in their own Early Years services and programmes.

However, this does not mean that the clustering work will remain static. It will be updated with the latest EYFSP data once it has been released, and we are always looking for new, open datasets we can also include in this work, for example the new 2021 Census data. The code for the robust clustering pipeline can be found on the Nesta Github. The clustering pipeline can also be adapted and used in other projects, with other Nesta projects using this methodology already in the works.

This project highlights the importance of many aspects of data science, from the championing of open data and code, to the combining of the quantitative results with qualitative work to provide a well-rounded, more informed view and generate extra insights into the findings. All of these are core values of the Data Analytics Practice at Nesta, and we hope these values are shared by many others working in the field.

If you’re interested in more of DAP and Nesta’s work, visit the Nesta website or have a browse through the Data analytics at Nesta Medium page.

Useful glossary of terms

AHAH

Access to Healthy Assets and Hazards

A dataset containing information on, amongst others, access to GPs, pharmacies, fast food outlets, green space, blue space and air quality.

CDRC

Consumer Data Research Centre

A collaboration between the University of Leeds, University College London, the University of Liverpool and the University of Oxford which gathers public and private consumer data to produce novel datasets, research and insights.

EYFSP

Early Years Foundation Stage Profile

The Early Years Foundation Stage Profile is conducted at age 5 in England and is made up of a number of different measures aimed to capture where a child is at developmentally.

FSM

Free School Meals

You are eligible for Free School Meals if you receive Universal Credit, and your household earns less than £14,000 per year.

GLD

Good Level of Development

A Good Level of Development is defined as having reached the Expected Level of Development in all the Early Learning Goals in the EYFSP as well as within literacy and maths.

IMD

Index of Multiple Deprivation

The IMD combines a number of deprivation measures to give an overall index for areas in England.

LA

Local Authority

A governing district in England and Wales.

LAIT

Local Authority Interactive Tool

LAIT is a way for LAs to compare how they rank to other LAs. The measure is based upon a number of different variables, including pupil attainment, child protection, children looked after by LAs and judgements from Ofsted.

--

--

Rachel Wilcock
Data science at Nesta

Senior Data Science Lead in the A Fairer Start Mission @ Nesta. Interested in using data in a fair and equitable way to improve outcomes for children.