Respecting privacy and reflecting diversity: using data for societal good

Centre for Net Zero
Jun 21 · 5 min read

Centre for Net Zero is an open research lab, helping to realise faster, fairer and more affordable paths to net zero. One of the primary ways we do this is by creating detailed simulations of the energy transition based on real world data. This data allows us to get away from averages and instead consider the full variety of people and businesses involved in the energy transition across the globe and how they may engage in, or be impacted by, different policies. By providing key stakeholders with this information, we can enable faster, bolder and more just decision making.

Our association with Octopus Energy means we can draw upon the experience of serving more than two million UK households. With this comes an incredible opportunity to understand how energy needs are currently influenced by factors such as seasonality, property attributes, local weather conditions and, increasingly, time-of-use prices.

Whatever the area of research, all of our work is governed by two core principles: respecting the privacy of Octopus Energy customers and ensuring that the data we use and the conclusions we draw reflect the full diversity of people and places within society.

At the start of every project we go through a structured impact assessment closely aligned with the Open Data Institute’s Data Ethics Canvas. We apply a series of challenges to ensure we only process the data we need and consider how that processing might negatively impact the customer. We then apply measures to minimise any risks. For example, understanding location is important to our research for many reasons, such as considering weather, network connections and the availability of public EV charging, but most of these factors can be considered without using specific addresses. So, once we bring data together, we remove specific addresses so that data and conclusions can’t be linked back to particular households. Where data still needs to be processed at an individual record level, we have additional safeguards in place including training, encryption and other forms of security.

Aggregation is another important tool we use to reduce the sensitivity of the data we use. In the words of the Information Commissioner’s Office: “In general, the more detailed, linkable and individual-level the anonymised data is, the stronger the argument for ensuring only limited access to it… the more aggregated and non-linkable the anonymised data is, the more possible it is to publish it”. Where we identify needs in the pursuit of net zero to share or publish data outside the Centre, we will use aggregation and similar methods to further anonymise and minimise any risks.

We are also mindful of the sample we are analysing and take important steps to ensure that all segments of society are represented as accurately as possible in our work. Clearly, our sample is characterised by people who chose to join Octopus Energy over alternatives. The subset of households with smart meters is further characterised by practical aspects of their rollout and installation, and people on smart price tariffs are again a self-selected group who have particular circumstances that can skew them differently from the overall sample.

These groups are characterised by factors such as an increased likelihood to live in a house instead of a flat, to own their property and to over-index in age groups such as 45–55. As you might expect, customers on smart tariffs such as Octopus Go or Outgoing Octopus are more likely to have particular forms of low carbon technology, such as electric vehicles.

Snapshot of the proportional representation of OE properties in the UK, by Local Authority District (LAD). Cooler colours indicate that OE property proportion < UK property proportion.

So how do we go about creating proportionally representative samples? One approach we use is stratified sampling. This makes sure that for some meaningfully distinct segment under consideration (such as Property Type and EPC rating), our sample is represented in the same proportion as some reference population (such as the UK).

Distribution of daily gas consumption on a very cold day in 2021, for UK houses in different EPC bands

In some cases, we may consider outputs at a higher geographic granularity, where a representative sample cannot be drawn at a lower one. In other cases, we might be interested in highly specific sub-segments, whose representativeness in the broader population is uncertain. Here, we may need to tightly caveat conclusions, make use of supplementary public datasets, or even create the appropriate data via new, experimental initiatives.

Beyond privacy and proportionality, there are other short and longer-term factors to consider. Octopus Energy has previously shared analysis of how Covid-19 has influenced at-home energy behaviours. We remain conscious of the impact of the pandemic on smart meter data and as the new normal begins to emerge, we are monitoring models for concept drift. This ensures that any relationships which no longer stand between the input data and the target are flagged and a model can be retrained on newer data. What does this look like in real life? As lockdown restrictions are gradually lifted, there’s a lot of discussion around a hybrid model emerging whereby we work from the home and the office. This will result in certain households exhibiting new patterns of energy consumption in a way that differs from both pre-lockdown and lockdown behaviours.

Looking at more systemic issues, missing data, or ‘missingness’, is a widespread characteristic of smart meter data as a result of the diversity in device models, network failures, and the multiple steps in data transmission. These problems are expected to subside as technology improves but in the meantime, we carefully curate our training data. For example, we might choose to impute values where missingness is not at random. In doing this, we can avoid creating on-going biases in dropping records with missingness. Certain model types like hierarchical generalised linear models (GLMs) can also help us out by providing ‘shared’ knowledge across key segments. They allow us to use informed baselines for segments that are based upon higher-level population behaviours, even when the actual data for a segment is low or there are issues around data quality.

By confronting these challenges head-on and building important strategies into our data processing, we can leverage the experiences and data from today’s energy markets in an ethical and equitable way. In doing so, we will be able to identify significant opportunities that can move society more quickly towards our end-goal of net zero.

Find out more about our current research on decarbonisation of heat and transport at and