Hierarchical Models for Data and Policy, and a Walk-through Tutorial!

By Maharshi Dhada and Jazmin Labra Montes

Data & Policy Blog
Data & Policy Blog
8 min readSep 23, 2024

--

Introduction

Modern societies are complex and interconnected. Data and policy plays a key role in maintaining a high quality of life, and this relies on analysing the data acquired from a variety of sources such as carbon emissions, engineering telemetry, individual health data, etc. The insights may consist of understanding individual behaviours, while evaluating the general societal trends to drive policy-planning.

This article discusses a practical data analysis technique, of statistical multi-level models (or statistical hierarchical models), that is an indispensable tool for policymakers, researchers, and data analysts alike. Hierarchical models are a popular choice for representing and analysing data that are grouped or nested within various levels of context. For example, individuals are often nested within households, households within neighbourhoods, neighbourhoods within cities, and so on. Such nested data is encountered quite frequently in human society, engineering fleets, Manufacturing processes, diseases spread, etc.

Analysing nested data using conventional techniques is often challenging. This is because the individual data often may not represent sufficient information to enable a statistical model to make confident predictions, known as a situation of high variance due to sparse data. On the other hand, pooling together the data from all the individuals may result in the loss of individuality, and the model may not in fact represent any individual of the population. This is known as the case of high bias due to complete pooling. Hierarchical models systematically resolve the problem of high variance and high bias, and allows for the examination of both the micro-level (individual or local) and macro-level (regional or national) influences on outcomes of interest.

A generalised structure of hierarchical models involves representing lower level data (such as an individual or a neighbourhood) using their corresponding individual models, whose parameters are in turn sampled from a higher level model shared among similar other individuals. The individuals who share the higher level distributions can be identified or clustered based on various similarity evaluations for example statistical measures or expert knowledge. As a result, the higher level model represents the general behaviour of the population, whereas the individual behaviours are represented by the lower level distributions. At the same time, the higher level model also provides prior information when an individual has sparse data, not enabling the individual model to make confident predictions.

By accounting for the variance at each level of this hierarchy, multilevel models offer a more nuanced understanding of the data, modelling the dynamics that single-level or completely pooled models might overlook. Figure 1 shows a schematic diagram for a basic hierarchical model with individual models and a single higher level model.

Figure 1. Schematic diagram of a basic hierarchical model [1]

Hierarchical Models Use Cases

Hierarchical models are indispensable for modern Data and Policy, since:

  1. As discussed above, these models provide a more accurate estimation of effects by acknowledging the nested structure of data, thus avoiding the ecological fallacy and other pitfalls associated with aggregate data analysis.
  2. They enable the exploration of cross-level interactions through the higher lever models, allowing researchers to examine how contextual factors modify individual-level relationships. This is particularly critical in policy evaluation, where interventions may have differing effects across different population segments and geographies.
  3. Multilevel models enhance predictive accuracy by borrowing strength across the levels, leading to more reliable forecasts and simulations.

Apart from data and policy, multilevel statistical models have found diverse applications across various disciplines. In engineering, hierarchical models are used for predictive maintenance planning by enabling knowledge transfer within fleets for effective failure prediction for the cases where individual machines have sparse data. In education, for instance, these models are utilised to assess the impact of school-level variables, such as class size or teaching quality, on student performance. Similarly, in public health, multilevel models help understand how individual health outcomes are influenced by neighbourhood characteristics, like access to healthcare facilities or exposure to pollutants. In environmental science, they are used to analyse the effects of regional policies on climate change mitigation at both local and global scales. In the social sciences, these models assist in unravelling complex social dynamics by examining how group memberships and societal structures affect individual behaviours and attitudes. Hierarchical models have also found uses in predicting burglaries in metropolitan neighbourhoods and forecasting disease outbreaks/ spread in cities!

This versatility highlights the capability of multilevel models to provide deeper insights into complex phenomena by considering multiple levels of influence simultaneously. Thus, the integration of multilevel models across different fields underscores their significance as a robust analytical tool for addressing a wide array of research questions and practical challenges.

A Hierarchical Modelling Exercise

Thanks to the open source Python libraries, such as Stan, deploying a probabilistic hierarchical model is straightforward. With the right understanding of data, statistical distributions, and preprocessing, one can easily formulate and deploy a basic hierarchical model in Python using Stan! In this section, we show a step-wise walk-through to deploy a basic hierarchical model for forecasting the Covid-19 occurrences in NHS regions within England.

England is divided into seven NHS regions. Exhaustive Covid-19 data is publicly available through the UK government website, corresponding to the population in each of the NHS regions, with the goal to enable open source research and development. The data includes the rate of Covid-19 infections, ICU admissions, and deaths.

One of the key insights required for Covid-19 policies was modelling the infection rate (i.e. number of infections) so that the policies can be formulated to mitigate the situation, including the use of hospital facilities or social distancing rules for example. With each region characterised by separate demographics, population density, age, etc. it seemed reasonable to model the Covid-19 infection rates independently. However, there would often be a case where an individual region does not have sufficient health data/information to describe the population with good certainty. While this is not the case for Covid-19 data, such situations may arise for less widespread diseases or in non-medical applications mentioned earlier. In this exercise, we shall simulate a case where an NHS region has sparse data to model the rate of Covid-19 infections. The data used for this exercise can be found at [2].

We shall see that a hierarchical model enables the policymakers to model the rate for NHS regions with sparse data with better accuracy and confidence. Hierarchical model presented here shows how the problem of sparse data for a region can be addressed by systematically borrowing information from other NHS regions with sufficient data.

To that end, a simple third degree polynomial function is used to model the segments of time series data corresponding to various phases of the Covid-19 pandemic for each of the NHS regions. We compare two cases, which are the following:

  1. Independent Model: The independent model treats NHS regions in an isolated manner, or independently. Such that the coefficients of the polynomials are inferred using only the data from their corresponding NHS regions.
  2. Hierarchical Model: The hierarchical model involves sampling the coefficients of the polynomial functions, corresponding to each of the NHS regions, from a higher level Gaussian distribution with non-informative prior. The higher level Gaussian distribution is shared by all the NHS regions, expecting the NHS regions to have similar infection rates since they all constitute England.

Monte Carlo Markov Chain sampling, facilitated by Stan in the backend, is used to estimate the parameters. The Google Colab notebook in this link presents a basic hands-on implementation of this technique using publicly available Stan and Python libraries. The Colab file includes code blocks showing the aforementioned Independent and Hierarchical polynomial models in Stan, and also the preprocessing steps for the data used in this exercise. For the results presented here, a snippet of data corresponding to the Delta variant is used, however other Covid-19 variants in the form of date ranges of interest or ICU/ death rates can also be chosen via the Colab file. The admissions, normalised by populations, are presented in Figure 2, with each colour representing a separate NHS region. Along the x-axis are the number of weeks for which this data is plotted.

Figure 2. Each colour representing the admission rate of the corresponding NHS region, normalised by their populations

The results of this exercise are presented in Figures 3 and 4. In Figure 3, infection rate modelled using an Independent model for a single NHS region are presented as plots, for the cases of sparse and sufficient data. That is, only the data points shown in the corresponding plots are used to model the rates for those cases.

The blue dots represent the data used to infer the polynomial model coefficients. The thick red line shows the maximum a posteriori, or the most likely, estimate of the polynomial function for the most likely estimates of each of the polynomial coefficients. The surrounding thin orange lines represent each of the 2000 samples of coefficient combinations derived from the MCMC sampling, combinedly representing the confidence of the model. If the orange lines appear to be tighter, the coefficient values are deemed to be more confident.

Figure 3. Comparing the Independent model coefficients inferred using sparse and complete data

Figure 4 compares the cases where the same NHS region with same sparse data is modelled using a Hierarchical model with the previously used Independent model. Data from two of the other NHS regions, with similar trends as the region with sparse data but having sufficient data, are also shown in the plot corresponding to the Hierarchical model as thin grey-coloured lines.

Figure 4. Comparing the Independent vs Hierarchical model coefficients inferred using sparse data

It is observed, in Figures 3 and 4, that a Hierarchical enables the modelling of sparse individual data with greater accuracy and confidence, by systematically borrowing information from other similar individuals in the population. Incorporating multilevel models into data analysis therefore facilitates the development of policies that are context-sensitive, ensuring that interventions are targeted to the specific needs of various communities and subpopulations. The adoption of multilevel statistical models is crucial for developing effective, equitable, and sustainable solutions. It is left for the readers to try their hand at the Colab file and calculate the exact improvements in accuracy across various use cases!

References

[1] Dhada, M., 2022. Statistical hierarchical modelling for industrial collaborative prognosis (Doctoral dissertation).

[2] https://www.england.nhs.uk/statistics/statistical-work-areas/covid-19-hospital-activity/

About the authors

Maharshi (Rishi) Dhada is the SF Express Research Fellow in Industrial Logistics, jointly based at Darwin College and Institute for Manufacturing at the University of Cambridge. He has developed statistical hierarchical models for prognosis and anomaly detection for his PhD thesis. Maharshi collaborates with several industrial partners and is now working on Logistics research problems in the intersection of industry and academia, including but not limited to the applications of statistical multi-level models in Logistics.

Jazmin Labra Montes has been actively involved with the Innovation and Intellectual Property Management (IIPM) Laboratory at the University of Cambridge’s Institute for Manufacturing. She currently works as a Senior Teaching Operations Lead at the Judge Business School. Her research focuses on intellectual property management, sustainability, and data protection policies.

***

This is the blog for Data & Policy (cambridge.org/dap), a peer-reviewed open access journal published by Cambridge University Press in association with the Data for Policy Community. Interest Company. Read on for ways to contribute to Data & Policy.

--

--

Data & Policy Blog
Data & Policy Blog

Published in Data & Policy Blog

This is the blog for Data & Policy (cambridge.org/dap), an open access journal for the impact of data science on governance. Editors-in-Chief: Zeynep Engin (UCL, Data for Policy), Jon Crowcroft (Cambridge, Turing Institute), Stefaan Verhulst (GovLab, NYU). Published by CUP.

Data & Policy Blog
Data & Policy Blog

Written by Data & Policy Blog

Blog for Data & Policy, an open access journal at CUP (cambridge.org/dap). Eds: Zeynep Engin (Turing), Jon Crowcroft (Cambridge) and Stefaan Verhulst (GovLab)

No responses yet