Creating a Nationally Representative Training Dataset

Published in

The Centre for Net Zero Tech Blog

7 min readJust now

Application: smart meter data

Centre for Net Zero is spearheading the generation of open energy demand data, starting with our own synthetic smart meter generator tool: Faraday. Trained on Octopus Energy (OE) customer’s smart meter data, the model generates demand profiles at half-hourly resolution for a given set of user-specified inputs, including low carbon technology (LCT), property type, season etc. Faraday produces realistic distributions of load profiles across a population made up of different consumer archetypes in the relative proportions specified by the user. Synthetic data has the benefit of producing accessible data, whilst keeping real smart meter data private.

This blog post maps out the steps we’ve taken to ensure that Faraday is trained on a Nationally Representative Dataset (NRD) for smart meters. This involves creating a dataset comprising half-hourly demand profiles which, on average, represents the true population and also accurately represents the variation of profiles among the population.

Step 1: Defining an evaluation metric

How will we know if NRD is Nationally Representative?

We need some sort of ground truth to measure against. The Department for Energy Security and Net Zero (DESNZ) releases average annual consumption statistics aggregated over individual geographical areas making up the UK, down to the postcode level.

We will be confident that NRD is Nationally Representative when the difference between the average annual energy consumption in NRD and DESNZ ~ 0 kWh.

To take NRD one step further, we want the dataset to be not just representative of the UK overall, but of all subregions too. We want the dataset to be sensitive to the fact that the typical electricity consumption of a more affluent village may not be the same as a deprived area of a city, for example. NRD is therefore constructed to be representative at the Middle Super Output Area (MSOA) level, where each MSOA comprises between 2,000 and 6,000 households.

The latest release of DESNZ electricity statistics is the 2022 dataset, therefore we can only evaluate a NRD up to 2022. We do not assume that characteristics existing in 2022 persist to 2023 and 2024 given the rapid growth of OE as an energy supplier*. Therefore, going forward, we will measure the drift of our training population from the DESNZ statistics as they are released and re-create NRD year-on-year as necessary.

*OE acquisition of Bulb and Shell energy occurred at the end of 2022 and 2023, respectively.

Step 2: Address differences with clustering, representing & resampling

Ultimately we want to understand what factors influence energy consumption and correct for those factors in the OE database. We achieve this using the following steps:

Clustering: grouping properties with similar attributes together into ‘archetypes’ using features constructed from household and locational data.
Representing: each area in the UK can be represented by a histogram showing the relative number of households within each cluster.
Resampling: sampling from the full OE database to recreate this histogram in each MSOA. We then measure the annual consumption of households in this resampled dataset to evaluate NRD.

We use k-means to cluster UK homes into 20 archetypes. We chose this unsupervised learning method to avoid making too many assumptions about the influence of different household attributes on energy consumption, given that it is extremely difficult to determine how and why different households use energy differently over time.

If energy consumption somehow depends on household archetype (i.e., cluster), then by sampling OE data to recreate the true distribution of households archetypes in each MSOA, we would expect to naturally recreate a representative energy sample.

Clustering

Data cleaning
Missing feature data is interpolated with the median value for homes in the surrounding postcode. If the data is missing for the whole of the postcode, we then take the median within the wider output area, and so on, until the dataset is complete.

Feature engineering & learning clusters
For cluster training, we construct the features listed in the table below using combinations of the property data, DESNZ and OE annual consumption data. We attempt to construct features which we expect to relate to energy consumption. Data for the 27M households in England & Wales is used for training.

Given the large dataset, and limited compute, we use the scikit-learn Mini Batch K-means package. Features are normalised by their L2 norms before clustering. K-means does not handle nonlinear data well (it uses a Euclidean distance metric) so we construct non-linear features using mean-encoding.

Features used for clustering. *AA refers to annualised advance — an estimate for annual consumption calculated by interpolated between two meter readings.

Note that we also try reducing the dimensionality with PCA, and learning clusters using batch-learning Gaussian Mixture Models to capture nonlinear relationships. We do not find significant improvements to the results using these methods.

Rules-based clusters
Once the clusters were fit, we combined any ‘outlier clusters’, i.e., clusters comprising only a very small proportion of homes, into other clusters which had similar features. Cluster similarities were determined by looking at the bar plot shown below. Furthermore, we constructed two additional clusters using a rules-based approach. This is because we found that K-means alone was unable to distinguish properties with very high electricity consumption from properties with very low electricity consumption when they had similar property values, thus very different energy consumption profiles were appearing in the same clusters. To achieve this, we created separate clusters for households in high consumption postcodes and low consumption postcodes:

Rules-based cluster 1: homes valued at > £500k within postcodes with average annual consumption > 6,000kWh according to DESNZ.
Rules-based cluster 2: homes valued at < £400k within postcodes with average annual consumption < 2,000kWh according to DESNZ.

Investigating clustering results
We visualise how the dataset has been clustered using the bar plots below. For each property feature, we show how property attributes are distributed among the clusters. For visualisation purposes, we bin continuous variables such as property value and floor area. As an example, we find large, high-value detached households with a medium energy efficiency are grouped in one cluster, and small energy-efficient flats in urban areas are grouped in another.

Distribution of household attributes among clusters in our Nationally Representative Dataset.

To understand the energy profiles of clusters, we can look at the annual consumption of homes within each cluster. Looking at the plot below, we find that our clusters look different in terms of their energy consumption as well as their property features (as shown above). In other words, our clustering has done a good job at grouping homes into different ‘energy types’ using the features we constructed.

Distribution of household annual consumption (kWh) among clusters in NRD.

Visualising the representing step

Distribution of households among clusters in a particular MSOA region.

Each MSOA can be represented by the proportion of homes in each cluster. For this particular MSOA, we find ~20% of homes fall in cluster 4.

Visualising the resampling step

To create each representative MSOA (making up NRD), we sample from each cluster in the correct proportions. In other words, we recreate the above bar plot using OE data. We do the sampling with replacement such that the same home can appear in multiple MSOAs within NRD.

Comparison between the distribution of households among clusters before (green) and after (light blue) resampling, where the aim of the resampling was to match the true distribution of households among clusters (dark blue) in the UK. This is how we create NRD.

For this particular MSOA, the OE sample underrepresented households in clusters 4 and 17, and over-represented households in clusters 7 and 19. We sample from the OE dataset to correct for these clusters.

Step 3: evaluating the results

To evaluate the results, we compare the average household annual consumption within each MSOA in our newly created NRD to the values published by DESNZ. The histogram below presents the % offset between NRD and DESNZ (in pink) and the % offset between the full 2022 OE sample and DESNZ (in blue). We find that the average MSOA AA (annual electricity consumption) values in NRD are consistent with DESNZ to +/-15% for 95% of households in NRD. The methods described in this blog allowed us to correct the OE dataset, and create a new representative dataset from which we can extract smart meter profiles to train our models.

Histogram shows the percentage offset of the average annual electricity consumption values (AA) in each MSOA region, between our sample and DESNZ. Using the methods outlined in the blog, we were able to transform our original sample (blue histogram) into a more representative dataset (pink histogram), where the energy consumption data better represents the true consumption, on average, in different regions of the UK.

Conclusions

To conclude, our method was successful in creating a ‘Nationally Representative Energy Dataset’. At present, Faraday excels in its ability to generate realistic smart meter profiles in terms of the timing, shape and magnitude of peaks and troughs, but on average the consumption profiles are above what we expect from nationally representative data. This is unsurprising given the skewness of the blue histogram above.

Using this new representative dataset (pink histogram above), we are re-training Faraday to generate synthetic smart meter data which is more representative of the general population.

Faraday will contribute to the OpenSynth community, making open synthetic smart meter data available globally for industry and research 🌏 💫

If you are interested in reaching out to learn more about our work, please email info@centrefornetzero.org