Optimization of the Distribution Electric Grid in Nigeria: Data Cleaning and Transformation

10 min readJul 30, 2023

A case study of applying machine learning techniques and data analytics operations to optimize an energy grid

Here are some truths about me you might need to know as we go deep into this article:

I am a Nigerian
I have lived all my life in Nigeria
I love Nigeria

Now that my truths have been established, let’s do a little digging into Nigeria. Nigeria is a West African country with the highest population (>200 million people) in Africa. With its abundant mineral, natural, and human resources, this country has not been able to fully tap into its potential.

Why is this so? A lot of people might want to argue about the history of how she gained her independence and the amalgamation of the northern and southern protectorates. The general consensus is that the people at the helm of affairs are corrupt and power-hungry, hence the deplorable state of things in the country.

In this article, I will be focusing on the deplorable state of the energy grid in the country and how machine learning techniques and data analytics can play a huge role in making things better. Here are some facts you also have to take with you throughout this article:

The major problem Nigeria is facing as regards electricity is that the supply of electricity cannot meet the available demand. Statista did more justice to the current electricity problem in the country so you might want to check this out: Here
Machine learning or Artificial intelligence cannot increase the supply of electricity. So, it's not necessary the solution to Nigeria’s electricity problem but a means to solving the problem.

So why are we then using machine learning for optimizing the grid, and what is optimization in this context? It is common knowledge that the maintenance culture in Nigeria is very poor, and the government spends heavily on repairs and replacements due to failures and persistent outages in the national grid.

How about we have a system that can provide insights to grid operators and other stakeholders on what is going to happen on the grid? In this article, the optimization I will be focusing on is the enhancement of the reliability and performance of a distribution grid in Nigeria. With this system in place, there will be better maintenance schedules, and most of the maintenance done on the grid will be preventive maintenance, which will improve the reliability of the grid and also reduce the cost of maintenance.

Several papers have been written on using various machine learning models for predicting failures in the energy grid, but for the sake of this article, I will be using Python and libraries like ScikitLearn, TensorFlow, Keras, Pandas, and NumPy to build three different machine learning models for predicting the type of outage that will occur in the future. With these models being deployed, grid operators and professionals will be able to make informed decisions that will avert future failures and disasters in the grid.

In the development cycle for the system, the first thing to do is select a location and collect the appropriate data for the study. Ibadan city in the South-western part of Nigeria will be considered for this case study, and I will be gathering data on all the feeders in the region from the Ibadan Electricity Distribution Company (IBEDC). The data that will be used in building the machine learning models is the interruption data on the different types of outages in all the feeders.

Here’s a representation of how the optimization of the grid will be done:

From the above cycle the interruption data will be collected from IBEDC and then cleaned and transformed to ensure data quality and consistency. This cleaned data will then be fed into the Microsoft Power BI desktop app for reporting and tracking of the grid in real time. In optimizing the grid, the cleaned and transformed data will then be used in building different machine learning and deep learning models for predicting future types of outages.

Let’s jump right into it.

What is data cleaning? Data cleaning is the process of identifying and correcting errors, inconsistencies, and inaccuracies in datasets to improve their quality and reliability. This step is crucial before performing any operations on the data.

What is data transformation? Data transformation refers to the process of converting and manipulating data from one format, structure, or representation into another to make it suitable for specific analysis or to meet certain requirements.

Here’s a breakdown of the step-by-step data cleaning and transformation processes that was carried out.

The interruption dataset from IBEDC was collected on a monthly basis. What this means is that there are 12 spreadsheets that will be fed into the machine learning models for predictions.

However, there is a lot of cleaning and transformation to be done. The IDE used for this project is the Google Colaboratory environment.

The first thing I did was to import some libraries for the project:

# Import libraries necessary for this project

import pandas as pd
import plotly.express as px
from copy import copy
from scipy import stats
import matplotlib.pyplot as plt
import numpy as np
import plotly.figure_factory as ff
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from tensorflow import keras
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import confusion_matrix, classification_report, r2_score
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM

For this article, I will be using the Pandas and NumPy libraries. The NumPy library is a powerful Python library for numerical computing. It stands for “Numerical Python”, widely used in data science, scientific computing, and machine learning applications. Pandas are used in data manipulation, data analysis, and data cleaning tasks.

After importing the libraries, it is just right to import the dataset, but since I’m using the Google Colab environment, I will have to import Google Drive.

from google.colab import drive
drive.mount('/content/drive')

For this project, I will be importing 12 different spreadsheets representing interruption data from the 12 months of the year 2021.

interruption_df1 = pd.read_excel('*****.xlsx')
interruption_df2 = pd.read_excel('*****.xlsx')
interruption_df3 = pd.read_excel('*****.xlsx')
interruption_df4 = pd.read_excel('*****.xlsx')
interruption_df5 = pd.read_excel('*****.xlsx')
interruption_df6 = pd.read_excel('*****.xlsx')
interruption_df7 = pd.read_excel('*****.xlsx')
interruption_df8 = pd.read_excel('*****.xlsx')
interruption_df9 = pd.read_excel('*****.xlsx')
interruption_df10 = pd.read_excel('*****.xlsx')
interruption_df11 = pd.read_excel('*****.xlsx')
interruption_df12 = pd.read_excel('*****.xlsx')

“*****” represents the path of the dataset in Google Drive.

Here’s a representation of the first DataFrame before any form of cleaning or transformation.

DataFrame before data cleaning and transformation

From the above image, it is easy to derive that the following data cleaning and transformations must be done in the following order:

Removal of the top 2 rows
Make the first row the header of the DataFrame
Remove the top row of the DataFrame

To remove the top 2 rows, make the first row the header of the DataFrame, and remove the top row of the DataFrame, the below Python codes were implemented.

interruption_df1 = interruption_df1.drop(interruption_df1.index[:2])

## To make the first row the header of the dataframe
interruption_df1.columns = interruption_df1.iloc[0]
interruption_df1 = interruption_df1[1:]

## To remove the top row of the dataframe
interruption_df1 = interruption_df1.drop(interruption_df1.index[:1])

It is also necessary to rename the “Date (For Carry Over Outages)” column to “Date”, so I used the below lines of code.

# To rename the date column
new_column_name = 'Date'
interruption_df1 = interruption_df1.rename(columns = {interruption_df1.columns[0]: new_column_name})

Here’s a representation of the DataFrame after the above cleaning and transformations:

DataFrame after renaming the Date column and other data transformations

There are some unwanted columns in the DataFrame that need to be dropped. At this stage, I will be dropping the 2nd column (NaN), the last column (NaN), and the Remarks column.

# To remove the first NaN column

column_index = 1  # Index of the column you want to remove
interruption_df1 = interruption_df1.drop(interruption_df1.columns[column_index], axis=1)

# To remove the last NaN column

column_index = -1  # Index of the column you want to remove
interruption_df1 = interruption_df1.drop(interruption_df1.columns[column_index], axis=1)

# To remove the Remarks column

column_index = -2  # Index of the column you want to remove
interruption_df1 = interruption_df1.drop(interruption_df1.columns[column_index], axis=1)

With the implementation of the above code, here is a representation of the DataFrame.

DataFrame after dropping some unwanted columns

The structure of the DataFrame seems to be fine. The next line of action is to pick each column and analyze it for any form of necessary data transformations.

For the Date column, there is lack of consistency with the rows having different data types. The first thing to do before performing any transformation on the Date column is to convert the data type to strings.

interruption_df1['Date'] = interruption_df1['Date'].astype(str)

Now that the Date column has been converted to strings, I would like to have an idea of the format of the column and how all the values are structured. This can be done by finding the unique values in the Date column.

distinct_dates = interruption_df1['Date'].unique()
distinct_dates

Output: array([‘2021–01–01 00:00:00’, ‘2020–12–31 00:00:00’, ‘2021–02–01 00:00:00’, ‘2021–01–02 00:00:00’, ‘2021–03–01 00:00:00’, ‘2020–01–03 00:00:00’, ‘2021–04–01 00:00:00’, ‘2021–05–01 00:00:00’, ‘2021–01–04 00:00:00’, ‘2021–01–05 00:00:00’, ‘2021–06–01 00:00:00’, ‘2021–01–06 00:00:00’, ‘2021–07–01 00:00:00’, ‘1900–01–07 00:00:00’, ‘2021–08–01 00:00:00’, ‘1900–01–08 00:00:00’, ‘2021–09–01 00:00:00’, ‘1900–01–09 00:00:00’, ‘2020–09–01 00:00:00’, ‘2021–10–01 00:00:00’, ‘2021–01–10 00:00:00’, ‘2021–11–01 00:00:00’, ‘2021–01–11 00:00:00’, ‘2021–12–01 00:00:00’, ‘2021–01–12 00:00:00’, ‘13/1/2021’, ‘2021–01–13 00:00:00’, ‘14/1/2021’, ‘2021–01–14 00:00:00’, ‘15/1/2021’, ‘2021–01–15 00:00:00’, ‘16/1/2021’, ‘17/1/2021’, ‘2021–01–17 00:00:00’, ‘18/1/2021’, ‘2021–07–18 00:00:00’, ‘19/1/2021’, ‘2021–07–19 00:00:00’, ‘20/1/2021’, ‘2021–01–19 00:00:00’, ‘2021–01–20 00:00:00’, ‘21/1/2021’, ‘22/1/2021’, ‘2021–01–21 00:00:00’, ‘23/1/2021’, ‘2021–01–22 00:00:00’, ‘2021–01–23 00:00:00’, ‘24/1/2021’, ‘2021–01–24 00:00:00’, ‘25/1/2021’, ‘2021–01–25 00:00:00’, ‘26/1/2021’, ‘2021–01–26 00:00:00’, ‘27/1/2021’, ‘2021–01–27 00:00:00’, ‘28/1/2021’, ‘2021–01–28 00:00:00’, ‘29/1/2021’, ‘2021–01–29 00:00:00’, ‘30/1/2021’, ‘2021–01–30 00:00:00’, ‘31/1/2021’, ‘2021–01–31 00:00:00’], dtype=object)

From the above output, you can deduce that there is a need to replace some values in the Date column and also drop some specific values that do not correspond to the month of January 2021. In that case, I will be ensuring that the format of the Date column is ‘dd/mm/yyy’. This can be done using the lines of code below.

# Replacing values in the Date column
interruption_df1['Date'] = interruption_df1['Date'].replace({
    '2021-01-01 00:00:00': '01/01/2021',
    '2021-02-01 00:00:00': '01/02/2021',
    '2021-01-02 00:00:00': '01/02/2021',
    '2021-03-01 00:00:00': '01/03/2021', '2021-04-01 00:00:00': '01/04/2021', '2021-05-01 00:00:00': '01/05/2021', '2021-01-04 00:00:00': '01/04/2021',
    '2021-01-05 00:00:00': '01/05/2021', '2021-06-01 00:00:00': '01/06/2021', '2021-01-06 00:00:00': '01/06/2021', '2021-07-01 00:00:00': '01/07/2021',
    '2021-08-01 00:00:00': '01/08/2021', '2021-09-01 00:00:00': '01/09/2021', '2021-10-01 00:00:00': '01/10/2021', '2021-01-10 00:00:00': '01/10/2021',
    '2021-11-01 00:00:00': '01/11/2021', '2021-01-11 00:00:00': '01/11/2021', '2021-12-01 00:00:00': '01/12/2021', '2021-01-12 00:00:00': '01/12/2021',
    '13/1/2021': '01/13/2021', '2021-01-13 00:00:00': '01/13/2021', '14/1/2021': '01/14/2021', '2021-01-14 00:00:00': '01/14/2021',
    '15/1/2021': '01/15/2021', '2021-01-15 00:00:00': '01/15/2021', '16/1/2021': '01/16/2021', '17/1/2021': '01/17/2021', '2021-01-17 00:00:00': '01/17/2021',
    '18/1/2021': '01/18/2021', '19/1/2021': '01/19/2021', '20/1/2021': '01/20/2021', '2021-01-19 00:00:00': '01/19/2021', '2021-01-20 00:00:00': '01/20/2021',
    '21/1/2021': '01/21/2021', '22/1/2021': '01/22/2021', '2021-01-21 00:00:00': '01/21/2021', '23/1/2021': '01/23/2021', '2021-01-22 00:00:00': '01/22/2021',
    '2021-01-23 00:00:00': '01/23/2021', '24/1/2021': '01/24/2021', '2021-01-24 00:00:00': '01/24/2021', '25/1/2021': '01/25/2021',
    '2021-01-25 00:00:00': '01/25/2021', '26/1/2021': '01/26/2021', '2021-01-26 00:00:00': '01/26/2021', '27/1/2021': '01/27/2021',
    '2021-01-27 00:00:00': '01/27/2021', '28/1/2021': '01/28/2021', '2021-01-28 00:00:00': '01/28/2021', '29/1/2021': '01/29/2021',
    '2021-01-29 00:00:00': '01/29/2021', '30/1/2021': '01/30/2021', '2021-01-30 00:00:00': '01/30/2021', '31/1/2021': '01/31/2021',
    '2021-01-31 00:00:00': '01/31/2021'
})

# Dropping specific values
drop_values =['2020-01-03 00:00:00', '2020-12-31 00:00:00', '1900-01-07 00:00:00', '1900-01-08 00:00:00', '1900-01-09 00:00:00', '2020-09-01 00:00:00',
              '2021-07-18 00:00:00', '2021-07-19 00:00:00']
interruption_df1 = interruption_df1[~interruption_df1['Date'].isin(drop_values)]

The next transformation is to extract numbers from the Load column. The idea is that all values in the Load column should be float or integer data types.

# Extract only the numbers from the Load(MW) column

import re
column_index = 9  # Index of the 10th column (indexing starts from 0)
interruption_df1.iloc[:, column_index] = interruption_df1.iloc[:, column_index].apply(lambda x: re.sub('[^0-9]', '', str(x)))

However, the data type of the “Load(MW)” column is still object. To convert it to a numeric column, this line of code can be used.

interruption_df['LOAD LOSS (MW)'] = pd.to_numeric(interruption_df['LOAD LOSS (MW)'], errors='coerce')

Now that this has been implemented, we might want to investigate the DataFrame for null values

interruption_df1.isnull().sum()

The output from the above line of code will then determine the next line of action. For this DataFrame, I will be dropping the null values in the “Nature/Cause of Outage”, “No of Customers restored”, and “Customer-Hours Interruption (CHI)” columns.

# To drop null values in the Nature/Cause of Outage column
interruption_df1.dropna(subset=["Nature/Cause of Outage"], inplace=True)

# To drop null values in the No of Customers restored and Customer-Hours Interruption (CHI) columns
interruption_df1.dropna(subset=["No of Customers restored"], inplace=True)
interruption_df1.dropna(subset=["Customer-Hours Interruption (CHI)"], inplace=True)

For the “Event No” column, I decided to handle the null values by filling them down. What this means is that the null values are being replaced by the preceding value.

# To fill empty rows in the Event No column
interruption_df1["Event No"].fillna(method='ffill', inplace=True)

The “Relay Target” column has to be transformed such that the values before the symbols are extracted.

interruption_df1['Relay Target'] = interruption_df1['Relay Target'].apply(lambda x: x.split(',')[0] if isinstance(x, str) and ',' in x else x)
interruption_df1['Relay Target'] = interruption_df1['Relay Target'].apply(lambda x: x.split('&')[0] if isinstance(x, str) and '&' in x else x)
interruption_df1['Relay Target'] = pd.to_numeric(interruption_df1['Relay Target'], errors='coerce').fillna(interruption_df1['Relay Target'])

This basically wraps up the data cleaning and transformations on the DataFrame. This can be reciprocated on the other DataFrames.

From this article, you already have an idea on not just the electricity problems we face here in Nigeria but how I am applying data analysis and data science in getting more insights on how to optimize the distribution grid.

I’ve gone through a step-by-step approach in cleaning and transforming the data. In the next articles, I will be delving into the exploratory data analysis, data visualization, data preprocessing, data storytelling with Power BI, and building machine learning models parts of this case study.

One more thing…you can follow me on LinkedIn where I make posts on data, artificial intelligence, ethics in AI, energy transition, etc. I also have some of my data and AI projects on my GitHub so you might want to check it out. Follow me on Medium for more of my articles on data, artificial intelligence, and energy transition.

Optimization of the Distribution Electric Grid in Nigeria: Data Cleaning and Transformation

Written by Samuel Shaibu