U.S. electric disturbance events analysis (part 1): data cleaning and preparation.

A data cleaning process on a real dataset using Python.

10 min readDec 27, 2023

In this article, I show how I cleaned and prepared the dataset for future analysis, and also explain how I handled some problems along the way.

Introduction — What is electric disturbance event?

In a general way, an electric disturbance event can be described as a term to categorize problems involved in energy interruptions, since a loss of electric service for a short period of time to major blackouts.

Different factors can cause power interruptions, including severe weather, human/vegetation interference near power lines, systems operations, and even, cyber attacks.

Importing the datasets

The data used in this analysis is a report published annually by the U.S. Department of Energy, and is available at DOE summaries. I downloaded the reports for the years between 2019 and 2022.

After a quick look at the data, I imported the datasets using the code below. I skipped the first row because it was the dataset title and used the dataset header as the column names. I did the same for the other datasets.

df_19 = pd.read_excel("datasets/2019_Annual_Summary.xls", skiprows=0, header=1)

In the next step, I concatenated the imported datasets (ignoring the index to keep only the index of the first dataset) and saved them in the .csv file.

frames = [df_19, df_20, df_21, df_22]
df = pd.concat(frames, ignore_index=True)

# Saving the data to a .csv file
df.to_csv('datasets/df_orig.csv', index=False)

Basic information about the dataset

I used the .head() function to see the first five rows of the data.

Here is the data description:

Month: Month in which the event occurred
Date Event Began: Date of the event
Time Event Began: Time of the event
Date of Restoration: Date when the electric disturbance was restored
Time Event Restoration: Date when the electric disturbance was restored
Area Affected: States and counties affected by the disturbance
NERC Region: Indicates the electrical region in which the event occurred
Alert Criteria: A description of the type of event
Event Type: Indicates the type of problem that caused the disturbance
Demand Loss (MW): Indicates how much load (residential, industrial, or other energy consuming equipment) was lost (in megawatts).
Number of Customers Affected: Indicates how many customers were affected

The dataset shape and column names have been verified as follows:

#dataset shape
print(f'The dataset has {df.shape[0]} rows and {df.shape[1]} columns.')

'
Output:
The dataset has 1438 rows and 11 columns.

# dataset columns
df.columns

'
Output:
Index(['Month', 'Date Event Began', 'Time Event Began', 'Date of Restoration',
       'Time of Restoration', 'Area Affected', 'NERC Region', 'Alert Criteria',
       'Event Type', 'Demand Loss (MW)', 'Number of Customers Affected'],
      dtype='object')

The dataset informs that there have been 1438 electric disturbance events in the U.S. in the last four years. It may seem like a small number for a dataset, but the power grid is designed to avoid interruptions in the electrical system, and when an interruption does occur, it needs to be restored quickly. Moreover, large blackouts are rare events.

Removing duplicate data

The first step I did in the cleaning process was to check the number of duplicates, and I found that there were 12 rows.

df.duplicated().sum()

'
Output:
12

So I created a Boolean mask with the duplicated data and removed it from the dataframe.

mask = df.duplicated()
df = df[~mask]
df.duplicated().sum()

'
Output:
0

Correcting column names

It is a good practice to remove the whitespace from column names to avoid problems with dataframe manipulation. In addition, using column names without spaces allows you to directly access the columns by typing, for example, df.column_name.

I replaced the spaces with underscores using the .str() method of the Pandas library.

df.columns = df.columns.str.replace(' ','_')
df.columns

'
Output:

Index(['Month', 'Date_Event_Began', 'Time_Event_Began', 'Date_of_Restoration',
       'Time_of_Restoration', 'Area_Affected', 'NERC_Region', 'Alert_Criteria',
       'Event_Type', 'Demand_Loss_(MW)', 'Number_of_Customers_Affected'],
      dtype='object')

Dataset dtypes and non-nulls

We can now see summary information about our dataset, such as columns dtype and non-null values, using the .info() method.

# dataset dtypes
df.info()

'
Output:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1426 entries, 0 to 1437
Data columns (total 11 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   Month                         1426 non-null   object
 1   Date_Event_Began              1426 non-null   object
 2   Time_Event_Began              1426 non-null   object
 3   Date_of_Restoration           1426 non-null   object
 4   Time_of_Restoration           1426 non-null   object
 5   Area_Affected                 1426 non-null   object
 6   NERC_Region                   1426 non-null   object
 7   Alert_Criteria                1426 non-null   object
 8   Event_Type                    1426 non-null   object
 9   Demand_Loss_(MW)              1426 non-null   object
 10  Number_of_Customers_Affected  1425 non-null   object
dtypes: object(11)
memory usage: 133.7+ KB

You can see that we have all columns classified as an object type, and we will correct this later. We also have 1426 non-null rows (remember that 12 of the 1438 rows were removed).

However, the last column has 1425 non-null values, indicating that this column has a null value. I verified this by using the .isnull() function and summing all the null values of each column.

df.isnull().sum()

'
Output:

Month                           0
Date_Event_Began                0
Time_Event_Began                0
Date_of_Restoration             0
Time_of_Restoration             0
Area_Affected                   0
NERC_Region                     0
Alert_Criteria                  0
Event_Type                      0
Demand_Loss_(MW)                0
Number_of_Customers_Affected    1
dtype: int64

As we expected, there is only one null value in the dataset. It is also possible that null values in this data set were entered as a string, so they would not appear in the results above. In fact, this is the case for this dataset.

Replacing ‘Unknown’ values

In this dataset, when there is no information about a specific feature of the event, the ‘Unknown’ values are used. I replaced them with ‘NaN’ values to make it easier to manipulate the pandas.

df = df.replace(to_replace='Unknown', value=np.nan)

Dropping columns

I decided to remove two columns from the dataset: ‘NERC_Region’ and ‘Alert_Criteria’, as follows:

df = df.drop(columns=['NERC_Region',
                      'Alert_Criteria'])

I dropped the ‘NERC_Region’ column because I want to emphasize the analysis on the US states. The ‘Alert_Criteria’ column is just a description of the ‘Event_Type’ column.

Column data changes and corrections

In this part, since the dataset has only 9 columns, I checked each of them and made some corrections and modifications.

1. Correcting the ‘Month’ column

I applied the .unique() function to see if there were only twelve months in the column. However, as the code below shows, there was a double ‘January’ month.

df['Month'].unique()

'
Output:
array(['January ', 'January', 'February', 'March', 'April', 'May', 'June',
       'July', 'August', 'September', 'October', 'November', 'December'],
      dtype=object)

Actually, there was a January month with a space at the end of the string. To correct this, I used the .loc method to find each incorrect ‘January’ string and replace it with the correct ‘January’ string.

df.loc[df['Month'] == 'January ','Month'] = 'January'
df['Month'].unique()

'
Output:
array(['January', 'February', 'March', 'April', 'May', 'June', 'July',
       'August', 'September', 'October', 'November', 'December'],
      dtype=object)

2. Concatenating date/time columns

In order to facilitate the manipulation of date and time in Pandas, I decided to concatenate the columns about the time and date of the electric events, i.e. to create one column to represent the date and time when the event began and another column to represent the date and time when the disturbance was restored.

df['Date_Time_Event_Began'] = df['Date_Event_Began'] +' '+ df['Time_Event_Began']
df['Date_Time_Restoration'] = df['Date_of_Restoration'] +' '+ df['Time_of_Restoration']

# Dropping the old columns
df = df.drop(columns=['Date_Event_Began','Time_Event_Began','Date_of_Restoration','Time_of_Restoration'])

df.head()

This is what the dataset looks like with the columns added.

These columns need to be converted to datetime (remember that all columns were classified as a dtype object). This was done using the .to_datetime() function. In the ‘errors’ parameter, I used ‘coerce’ to set the invalid parsing as NaT (Not a Time).

df['Date_Time_Event_Began'] = pd.to_datetime(df['Date_Time_Event_Began'], errors='coerce')
df['Date_Time_Restoration'] = pd.to_datetime(df['Date_Time_Restoration'], errors='coerce')

df.info()

'
Output:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1426 entries, 0 to 1437
Data columns (total 7 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   Month                         1426 non-null   object        
 1   Area_Affected                 1426 non-null   object        
 2   Event_Type                    1426 non-null   object        
 3   Demand_Loss_(MW)              1426 non-null   object        
 4   Number_of_Customers_Affected  1425 non-null   object        
 5   Date_Time_Event_Began         1426 non-null   datetime64[ns]
 6   Date_Time_Restoration         1232 non-null   datetime64[ns]
dtypes: datetime64[ns](2), object(5)
memory usage: 89.1+ KB

The ‘Date_Time_Restoration’ column has 1232 non-null values, indicating that some values in this column were not datetime.

3. Correcting the ‘Area_Affected’ column

I applied the .unique() function in the ‘Area_Affected’ column as follows:

df['Area_Affected'].unique()

'
Output:
array(['Washington:', 'California: Sacramento County;',
       'Washington: King County, Thurston County, Pierce County;',
       'Michigan:', 'New Mexico: San Juan County;',
       'Montana: Valley County;', 'Maine:',
       'Missouri: Jackson County; Kansas: Johnson County;',
       'Missouri: Nebraska:', 'New Hampshire:',
       'North Carolina: South Carolina:', 'Pennsylvania:', 'California:',
       'Texas: Nueces County;', 'Nebraska:', 'Colorado: Larimer County;',
       'Ohio:', 'Utah: Iron County;', 'Louisiana: Washington Parish;',
       'Illinois: Scott County;', 'Illinois: Pike County;', 'Nevada',
       'Massachusetts', 'California', 'Arkansas', 'Michigan',
       'Pennsylvania', 'Washington', 'New Jersey', 'Utah', 'Texas',
       'Wisconsin, Minnesota, Iowa, Illinois', 'Virginia',
       'Ohio, Virginia, West Virginia', 'Ohio',
                                                    ...

The column contains strings with the state that was affected by electric disturbances. Some of the rows also have the counties. So I decided to keep only the state name in the column, and for that I did the following steps:

1. Find all lines containing ‘:’, i.e. lines with county names. To do this, I used the .str.contains() method to find all the lines with ‘:’.

2. Split the strings at the ‘:’ delimiter. I used the str.split() method to do this.

3. Get the first position of the list, i.e. the state name. I used the str.get() method.

Below, the code to execute each of these steps.

state_names_extracted = df.loc[df['Area_Affected'].str.contains(':'),'Area_Affected'] \
                          .str.split(':')\
                          .str.get(0)

Then I replaced all the rows with the “:” string with the extracted state names. Now the column has only the U.S. state names.

df.loc[df['Area_Affected'].str.contains(':'),'Area_Affected'] = state_names_extracted
df['Area_Affected'].unique()

'
Output:
array(['Washington', 'California', 'Michigan', 'New Mexico', 'Montana',
       'Maine', 'Missouri', 'New Hampshire', 'North Carolina',
       'Pennsylvania', 'Texas', 'Nebraska', 'Colorado', 'Ohio', 'Utah',
       'Louisiana', 'Illinois', 'Nevada', 'Massachusetts', 'Arkansas',
       'New Jersey', 'Wisconsin, Minnesota, Iowa, Illinois', 'Virginia',
       'Ohio, Virginia, West Virginia', ...

I also noticed that some values did not follow the desired pattern. So I corrected them as follows:

df.loc[df['Area_Affected']=='District of Columbia','Area_Affected'] = 'Washington'
df.loc[df['Area_Affected']=='Northern and Central California;','Area_Affected'] = 'California'
df.loc[df['Area_Affected']=='Western NY','Area_Affected'] = 'New York'
df.loc[df['Area_Affected']=='Central Oklahoma','Area_Affected'] = 'Oklahoma'
df.loc[df['Area_Affected']=='Texas panhandle, SE NM','Area_Affected'] = 'Texas, New Mexico'

4. Correcting the ‘Event_Type’ column

In the ‘Event_Type’ column, I counted the number of values of each type of event, then I noticed some typos in the values, such as ‘Sever’ instead of ‘Severe’.

df['Event_Type'].value_counts()

'
Output:
Severe Weather                                         445
System Operations                                      292
Vandalism                                              235
Transmission Interruption                              143
...
Transmission Interruption/Distribution Interruption      1
Sever Weather                                            1
Name: Event_Type, dtype: int64

I have corrected these typos with the following code.

df.loc[df['Event_Type'] == 'Sever Weather','Event_Type'] = 'Severe Weather'
df.loc[df['Event_Type'] == 'Suspicious activity','Event_Type'] = 'Suspicious Activity'
df.loc[df['Event_Type'] == 'Severe Weather/Transmisison Interruption','Event_Type'] = 'Severe Weather/Transmission Interruption'

5. Adding the ‘Event_Duration_(hours)’ column

It would be interesting to know how long each electric disturbance lasted. So I added a column to show this information.

df['Event_Duration_(hours)'] = df['Date_Time_Restoration'] - df['Date_Time_Event_Began']

df['Event_Duration_(hours)'] = [round(x.days*24 + x.seconds/(60*60),2) for x in df['Event_Duration_(hours)']]

The first part of the code calculated the difference between the initial time and the recovery time. After that, I used a list comprehension sintax to convert the ‘deltatime’ format to hours in each row of the new column. It was necessary to include the number of days because some events occurred on different days and it could make the calculation wrong. Then I rounded the results to only two decimals.

When I checked this new column, I noticed that there was a negative event duration, as shown in Fig. 3.

df.loc[df['Event_Duration_(hours)'] <= 0]

This happened because the restoration date was entered incorrectly, i.e. the year should be 2019 instead of 2018. Then I fixed it.

df.loc[df['Date_Time_Restoration'] == '2018-08-18 23:00:00','Date_Time_Restoration'] = '2019-08-18 23:00:00'

I recreated the ‘Event_Duration’ column with the corrected data and checked for negative durations. Nothing wrong was found.

6. Converting columns to float type

In this part, the ‘Demand_Loss’ and ‘Number_of_Customers_Affected’ columns have been converted to the float datatype.

df['Demand_Loss_(MW)'] = df['Demand_Loss_(MW)'].astype('float64')
df['Number_of_Customers_Affected'] = df['Number_of_Customers_Affected'].astype('float64')

df.info()

'
Output:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1426 entries, 0 to 1437
Data columns (total 8 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   Month                         1426 non-null   object        
 1   Area_Affected                 1426 non-null   object        
 2   Event_Type                    1426 non-null   object        
 3   Demand_Loss_(MW)              1016 non-null   float64       
 4   Number_of_Customers_Affected  1241 non-null   float64       
 5   Date_Time_Event_Began         1426 non-null   datetime64[ns]
 6   Date_Time_Restoration         1232 non-null   datetime64[ns]
 7   Event_Duration_(hours)        1232 non-null   float64       
dtypes: datetime64[ns](2), float64(3), object(3)
memory usage: 100.3+ KB

7. Adding the ‘Year’ column and saving the final dataset

I created another new column in the dataset with the year of the electric disturbance events using a list comprehension sintax to get the year of each event in the ‘Date_Time_Event_Begin’ column. I chose this column because it is complete, i.e. it has no null values.

df['Year'] = [x.year for x in df['Date_Time_Event_Began']]

This information will be useful for some aggregations and especially for plotting. Then I saved the final version of the dataset in a .csv file.

df.to_csv('datasets/df.csv', index=False)

This dataset is available in this repository of my GitHub account, along with the code used in the cleaning process.

Conclusion

In this post, we cleaned and prepared a real data set using Python. Concepts such as Boolean masks, list comprehension, and vectorized operations were used in the process.

In the next post, I will try to explore the dataset and answer some questions about electric disturbances in the U.S. over the last four years.

Thanks for reading! Please feel free to leave your suggestions.

I hope you find this post usefull!

Linkedin — Connect with me on Linkedin
GitHub — Visit my Github to see my other projects

U.S. electric disturbance events analysis (part 2): exploratory data analysis

Using Python to find trends and patterns in the data.

medium.com

U.S. electric disturbance events analysis (part 1): data cleaning and preparation.

A data cleaning process on a real dataset using Python.

U.S. electric disturbance events analysis (part 2): exploratory data analysis

Using Python to find trends and patterns in the data.

Written by Diogo Reis