What I Learned about Data Analysis, Preparation and Imputation as a Top Finisher in the Global Challenge on Energy Consumption Prediction using Smart Meter Data

Published in

CognizantAI

7 min readJun 28, 2021

As Applied Data Scientists/Machine Learning (ML) Engineers, our focus is mostly on the ML algorithms and models used to solve a Data Science/ ML problem. In my experience, we often overlook the highly critical and quite often the game changer in almost every ML problem: data analysis, preparation and imputation.

In this article, I share a series of data preparation techniques that helped me to secure a leading position in a Global Smart Meter energy forecasting competition conducted by EON and IEEE (link). My submission was among the Top 15 finalists in the global competition.

Global competition overview: The goal of this competition was to predict the monthly electricity consumption for 3,248 households in an upcoming calendar year (January to December). Participants were provided with historical half-hourly energy readings for the 3,248 smart meters. To simulate a realistic use case, we were asked to take the 1st of January of a given year as the day from which to commence the predictions. This meant that the consumption data per smart meter differed based on when the smart meter was on-boarded by the utility.

Sample Dataset from the competition containing 3248 rows and 17521 columns

Solution summary: To predict the monthly electricity consumption for 3,248 households in a coming year (January to December) we chose to utilize an “imputation using customized neighbors based approach”. This included completing exploratory data analysis to find the missing values, computation of daily average consumption, segregation of the dataset, computing the missing values using nearest neighbors and forecasting 2018 data using most recent trends.

Data complexity: While slicing and dicing the data using various python libraries in Jupyter notebook, it became evident that IEEE and EON had framed the problem in a most challenging fashion by providing as much poor quality data as possible, in my opinion. As the entire dataset (3,248 meters) were derived by combining (3,248/12) ~270 meters for each month ( i.e. 270 meters had data only for January 2017. 270 meters had data for only January & February. Extrapolated, this meant that only 270 meters had data for entire 12 months of 2017!) In addition, there were numerous missing half hourly consumption entries for a duration of up to 22 days. This would skew the daily consumption average if derived without applying proper statistical techniques. There was no supporting data provided (eg. weather data) which added further complexity. And finally, the household profile data that was provided had ~95% of missing values for majority of the attributes. In my opinion, this rendered this data unusable.

In summary, the 2018 monthly level forecast and yearly level forecast had to be derived/predicted from 2017 data (bad-quality data).

Data preparation: The following data preparation steps were used to transform the dataset into a monthly level granularity to help reduce the noise seen at the half-hourly and daily levels. The daily average consumption values were derived for each month to be used as a basis for further calculations.

Identifying the missing consumption values: With exploratory data analysis, many meters were found to have missing consumption values at the half hour interval. A complete, non-missing observation should have 17,520 readings per year (48 readings/day * 365 days). Upon analysis, it was found that there were only 789 meters had >75% of total readings and 2,459 meters had <75% of total readings.
Computation of daily average consumption of all the meters: The data was grouped to monthly level to be consistent with the final prediction granularity. The daily average consumption values were computed at monthly level for each of the meters by dividing the total monthly consumption by the corresponding number of monthly readings (non-null). Simple average calculation would not account for the missing reads and hence would have resulted in lower number than the actual consumption.
Making meager values Null based on domain knowledge: There are instances where the daily average consumption values were available but at a very negligible scale. These values might occur when the household was unoccupied but the smart meter and other small devices were using the electricity. These values needed to be made null (and imputed through proper techniques) to get better forecast.

The resultant dataframe post data preparation showing daily average consumption at monthly granularity

Segregation of meters based on the average consumption: Based on the average consumption of the meters the entire dataset was segregated into three in order to apply different data imputation techniques:
1) Meters with >0.5 MWH daily average consumption for at least 1 month (total count: 3117 meters)

Meters with >0.5 MWH daily average consumption for at least 1 month

2) Meters with at least one instance of consumption spike/drop i.e. >200% increase/decrease in daily average consumption between successive months (total count: 117 meters)

3) Meters with < 0.5 MWH daily average consumption for all months (total count: 14 meters)

Meters with <0.5MWH daily average consumption for all months

Data imputation — Missing data imputation using “Nearest Neighbors’” average: This was the critical step of the solution where different techniques were applied on the above datasets in order to get a good approximation of the missing values. The missing consumption data for each month for every meter was derived/imputed using the customized nearest neighbor logic depicted below.

Customized Nearest Neighbors based imputation approach

The existing KNN algorithm was not used because of the need to eliminate outliers seen in specific months that might potentially skew the approximation.

Example: Effect of outliers in data imputation using neighbor’s average

The same neighbor logic was executed in sequence — first for dataset1 and then for dataset2 i.e. dateset2 population was excluded while finding the neighbors of dataset1 but the dataset1 population was included while finding the neighbors of dataset2. This was done to eliminate the sudden spikes and drops seen in dataset2 from creeping into the imputation of dataset1.
For those meters in dataset3, (as they lack meaningful data for almost all the non-missing months) a simple average based imputation (Average of all non-missing month’s data for that meter) was used to fill in the missing values.

Forecasting 2018 monthly consumption using the Trend adjusted monthly values of 2017: The last quarter of 2017 had to be given higher weightage compared to the older data in order to get a reasonable forecast for the next year. This trend adjustment was accomplished using the following steps.

The complete dataset of 2017 was projected as is, in order to get the 2018 base forecast.
The average consumption of last three months of 2017 and median consumption of all the months of 2018 were compared.
If last three months’ average consumption of 2017 was greater than the median consumption of all twelve months of 2018, then all those consumption values less than median values were replaced by null and then recalculated using neighbors logic explained above.
Else, if last three months’ average consumption of 2017 was less than the median consumption of all twelve months of 2018, then all those consumption values greater than median values were replaced by null and then recalculated using neighbors logic explained above.
Based on the domain knowledge a flat 5% reduction YoY was applied to the resultant 2018 forecast to accommodate the energy efficiency gained through the use of smart energy devices.

The 2018 forecast derived above was featured in the Top 15 leaderboard entries among several thousand submissions made.
These simple data preparation & domain specific tweaking were able to easily outstrip ultra complex algorithms/models used by other contestants.
In the end, the Applied Data Science is all about Data Preparation — Andrew NG

Conclusion: From the experience shared above, hope the importance of data analysis, preparation, imputation and domain knowledge is well established. Sooner or later, the AutoML components of various vendors/hyper-scalers may more or less surpass human data scientists in the field of algorithm/model selection, training and testing etc., In this AutoML era, the power of data analysis and preparation will only help us to stand out and be better than the Machines.

What I Learned about Data Analysis, Preparation and Imputation as a Top Finisher in the Global Challenge on Energy Consumption Prediction using Smart Meter Data

Written by Syed Abdul Syed Allaudeen