An Exploratory Data Analysis of Carbon Emissions

Hey folks, ever wondered what’s really going on with carbon emissions worldwide?

Varun Tyagi
Operations Research Bit
14 min readApr 22, 2024

--

Image generated using DALL-E 3

Introduction

Well, buckle up because we’re about to dive into the nitty-gritty of it all. Today, we’re kicking off our journey into the fascinating world of exploratory data analysis (EDA) of carbon emissions across the globe.

Now, before you start yawning, let me tell you why this matters. Carbon emissions are like the pesky villains in our environmental story, wreaking havoc on our planet’s health. They come from all sorts of sources, from factories to cars to cows (yep, even cows contribute!). And understanding where they’re coming from, how much there is, and where they’re headed is crucial if we want to tackle climate change head-on.

So, in this series of analyses, we’re going to roll up our sleeves and dig deep into the data. We’ll be uncovering trends, spotting patterns, and maybe even uncovering a few surprises along the way. Because knowledge is power, my friends, and when it comes to saving our planet, we could all use a little more of it. So, let’s get cracking and see what the numbers have to say about our carbon conundrum!

Let us drill down the code and conduct data analysis.

Code Breakdown

In the following code, we will analyze and visualize CO2 emissions data by country. During the code, we will explore trends over time and identify major contributors of CO2 emissions

Import Libraries

Here, we’re importing the necessary libraries for our analysis:

  • pandas for data manipulation and analysis.
  • zipfile for handling ZIP files.
  • requests for making HTTP requests to download data from the web.
  • matplotlib.pyplot and seaborn for data visualization.
import pandas as pd
import zipfile
import requests
import matplotlib.pyplot as plt
import seaborn as sns

Download the dataset

In this process, we will utilize the requests library to download the dataset and the zipfile library to extract its contents. Additionally, the os library will be employed to remove the zip file post-extraction. Finally, the ‘pandas’ library will be applied to read the dataset’s contents from the CSV file and store them in a dataframe.

# Download the ZIP file
url = "https://api.worldbank.org/v2/en/indicator/EN.ATM.CO2E.PC?start=2000&end=2023&downloadformat=csv"
response = requests.get(url)

# Save the ZIP file to a temporary file
with open("co2_emissions.zip", "wb") as zip_file:
zip_file.write(response.content)

# Unzip the file
with zipfile.ZipFile("co2_emissions.zip", "r") as zip_ref:
zip_ref.extractall()

# Delete the ZIP file
import os
os.remove("co2_emissions.zip")

# Load the CSV data into a DataFrame
df = pd.read_csv("API_EN.ATM.CO2E.PC_DS2_en_csv_v2_191.csv", skiprows=4)

Explore the first 10 rows of the dataframe

Let us check how our dataframe looks like. We can see that our dataframe is wide not long. It means that our dataset has all the years in columns as supposed to be under one column year. We also see that there are certain NaNs under year that we will tackle in the preprocessing phase

df.head(10)
The output

Data Preprocessing

As mentioned above, the data set that we have is in a wide format, meaning that the years are in columns as opposed to being in rows. In order to properly work with our dataset, we need to transpose all these years under one column name year and retain the numerical values of CO2 Emissions (metric tons per capita).

In order to achieve the desired result, we thankfully have a melt function in pandas library to reshape a DataFrame (df). The melt function is commonly used to transform wide-form data into long-form data.

id_vars: These are the columns that we want to keep as-is and not melt. In this case, columns like Country Name, Country Code, Indicator Name, and Indicator Code will remain as they are.

var_name: This parameter specifies the name of the new column that will store the variable names (in this case, Year). The variable names are the columns that we want to melt into a single column.

value_name: This parameter specifies the name of the new column that will store the values corresponding to the variable names. In this case, it’s CO2 Emissions (metric tons per capita).

So, the resulting melted_df dataframe will have a structure where the columns Country Name, Country Code, Indicator Name, and Indicator Code remain unchanged, and the columns that represent different years (previously in wide format) are melted into two columns: Year and CO2 Emissions (metric tons per capita).

This reshaping is often useful when we want to perform analyses or visualizations that are easier to handle with long-form data, especially when dealing with time series data.

Later we also use dropna function of pandas library to remove null values from our newly created CO2 Emissions (metric tons per capita)column. The subset parameter specifies the column(s) to consider for missing values, and inplace=True means that the changes are applied directly to the original DataFrame (melted_df) without the need to create a new DataFrame.

The following code snippet is removing rows from melted_df where the CO2 Emissions (metric tons per capita) column has missing values. It’s a common practice to handle missing data before performing further analysis or visualization to ensure accurate and meaningful results.

# Melt the DataFrame to a format suitable for analysis
melted_df = df.melt(id_vars=["Country Name", "Country Code", "Indicator Name", "Indicator Code"], var_name="Year", value_name="CO2 Emissions (metric tons per capita)")

# Handle missing values (replace with appropriate methods based on your analysis)
melted_df.dropna(subset=["CO2 Emissions (metric tons per capita)"], inplace=True)

# Explore basic information (optional)
print("Number of rows:", melted_df.shape[0])
print("Number of columns:", melted_df.shape[1])
print("List of columns:", melted_df.columns.tolist())
print("Data types of each column:", melted_df.dtypes)
The output

Check unique values of years

Let us check if we have the data in the right format. As you can see that we have all the years in the Year column and all the numeric values corresponding to those years in the CO2 Emissions (metric tons per capita) column.

# Print the unique values of years
print(melted_df['Year'].unique())

print(melted_df['Indicator Name'].unique())

print(melted_df['CO2 Emissions (metric tons per capita)'].unique())
The output

Now our dataframe is ready for analysis

Explore melted_df

Let us finally check our entire melted dataframe. As you can see now we have all the years under Year column and relevant Co2 emissions under CO2 emissions (metric tons per capita) column.

melted_df.head(10)
The output

Trends for India

Let us visualize how India has contributed to CO2 emissions over the years. You can see that there is a strong dip since 2017 owing to major policy decisions taken by the governments such as National Action Plan on Climate Change (NAPCC), International Solar Alliance (ISA), Renewable Energy Targets, UJALA Scheme (Unnat Jyoti by Affordable LEDs for All) , FAME India Scheme (Faster Adoption and Manufacturing of Hybrid and Electric Vehicles), Green Building Initiatives, Enhanced Focus on Public Transport, Afforestation and Reforestation have contributed to mitigating the effects of greenhouse gas emissions.

# Select the column containing CO2 emissions (assuming it's named "CO2 Emissions (metric tons per capita)")
co2_emissions_col = "CO2 Emissions (metric tons per capita)"

# Filter data for the desired country
country_name = "India" # Replace with the desired country
country_data = melted_df[melted_df["Country Name"] == country_name]

# Plot CO2 emissions over time for the specific country
country_data.plot(x="Year", y=co2_emissions_col, marker="o")
plt.title(f"CO2 Emissions in {country_name} over Time")
plt.xlabel("Year")
plt.ylabel(co2_emissions_col) # Add label for the y-axis
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
The Output

Trends for countries over years

We can also explore trends for other countries over the years and compare them against each other.

# Explore trends over time and identify major contributors
plt.figure(figsize=(20, 8))
sns.lineplot(x='Year', y=co2_emissions_col, hue='Country Name', data=melted_df)
plt.title('CO2 Emissions Trends Over Time')
plt.legend().set_visible(False)
plt.show()
The output

As the above chart is a bit cluttered. We can also show the trends for the major contributors where we can filter out the top contributors and show on the graph.

# Calculate total emissions for each country
total_emissions = melted_df.groupby('Country Name')[co2_emissions_col].sum()

# Identify major contributors (e.g., top 10)
major_contributors = total_emissions.nlargest(10).index.tolist()

# Filter the data for major contributors
filtered_df = melted_df[melted_df['Country Name'].isin(major_contributors)]

# Plot the data with legends for major contributors only
plt.figure(figsize=(20, 8))
sns.lineplot(x='Year', y=co2_emissions_col, hue='Country Name', data=filtered_df)
plt.title('CO2 Emissions Trends Over Time')
plt.legend()
plt.show()
The output

Descriptive statistics by time period

The following code is using the groupby function in pandas to group the data in the DataFrame melted_df by the Year column. After grouping, it is applying the describe function to calculate various descriptive statistics for each group. Specifically, it is calculating the mean, standard deviation, minimum, 25th percentile (Q1), median (50th percentile or Q2), 75th percentile (Q3), and maximum for the CO2 Emissions (metric tons per capita) column within each year.

# Calculate descriptive statistics by time period (As 'Year' is the time period column)
print("Descriptive statistics by time period:")
print(melted_df.groupby('Year')['CO2 Emissions (metric tons per capita)'].describe())
The output

Descriptive statistics by country

Let us do the same for country.

# Calculate descriptive statistics by country
print("\nDescriptive statistics by Country:")
print(melted_df.groupby('Country Name')['CO2 Emissions (metric tons per capita)'].describe())
The output

Interactive Maps

We will interact with our data in the geospacial way. Our code is setting up a Python environment for visualizing geospatial data using the geopandas (handle geospatial data), pandas (data manipulation and analysis), and folium (create interactive maps) libraries in a Jupyter notebook or IPython environment. Additionally, it’s importing the MarkerCluster plugin (display markers on a map efficiently, especially when there are a large number of markers in a small area) from folium and the display function from IPython.display (render and display visualizations in Jupyter notebooks).

But before doing so, let us download the countries geospatial data from naturalearth website using wget function.

!wget 'https://naciscdn.org/naturalearth/110m/cultural/ne_110m_admin_0_countries.zip'
The output

Display on the map

The following code uses the geopandas, pandas, and folium libraries to create an interactive map using the data related to CO2 emissions. Here is a breakdown of the code:

Importing Libraries

  • geopandas: For handling geospatial data.
  • pandas: For data manipulation and analysis.
  • folium: For creating interactive maps.
  • MarkerCluster: A plugin from folium for clustering markers on the map.
  • IPython.display: For displaying the map in the Colab notebook.

Summing Total Emissions per Region

The code groups the data in melted_df by Country Name and sums the CO2 Emissions (metric tons per capita) for each country. The result is a DataFrame named region_emissions.

Loading World Map Data

It loads a world map dataset named 110m from a local file path (/content/ne_110m_admin_0_countries.zip) using geopandas.read_file().

Merging World Map with CO2 Emissions Data

The code merges the world map data (world) with the region_emissions DataFrame based on the country codes (ADMIN and Country Name).

Filtering Rows

It filters out rows with a total CO2 emission count of 0.

Creating a Folium Map

It creates a Folium map centered at the mean latitude and longitude of the world map geometries.

Adding GeoJSON Data to the Map

It adds GeoJSON data (world map with CO2 emission data) to the Folium map, including a tooltip with information about each country.

Adding Marker Clusters

It adds marker clusters to the map using MarkerCluster(). Each marker represents a country, and markers are clustered to improve map readability.

Adding Markers with Popup Annotations

It iterates over the rows of the world DataFrame and adds markers with popups containing information about the country name (ADMIN) and the total CO2 emissions.

Displaying the Map

Finally, it displays the interactive map in the Colab notebook using IPython.display.display(m).

import geopandas as gpd
import pandas as pd
import folium
from folium.plugins import MarkerCluster
from IPython.display import display

# Sum the total emissions per region/country
region_emissions = melted_df.groupby('Country Name')['CO2 Emissions (metric tons per capita)'].sum().reset_index().round(2)
region_emissions.columns = ['Country Name', 'sum']

# Load the '110m' cultural vectors dataset from a local file path
world = gpd.read_file('/content/ne_110m_admin_0_countries.zip')

# Merge the world map with the sum co2 based on ISO country codes
world = world.merge(region_emissions, left_on='ADMIN', right_on='Country Name', how='left')

# Filter out rows with count 0
world = world[world['sum'] > 0]

# Create a Folium Map centered at the mean of latitude and longitude
m = folium.Map(location=[world.geometry.centroid.y.mean(), world.geometry.centroid.x.mean()], zoom_start=2)

# Add GeoJSON data to the map
folium.GeoJson(world, name='geojson', tooltip=folium.features.GeoJsonTooltip(fields=['ADMIN', 'sum'], aliases=['Country', 'sum'])).add_to(m)

# Add sum values as text annotations using MarkerCluster
marker_cluster = MarkerCluster().add_to(m)
for idx, row in world.iterrows():
folium.Marker(
location=[row.geometry.centroid.y, row.geometry.centroid.x],
popup=f"{row['ADMIN']}: {row['sum']:.0f}",
).add_to(marker_cluster)

# Display the map in the Colab notebook
display(m)
The output

Display on the map between specific years

This section of code is designed to visualize and analyze CO2 emissions data over a specified time frame. Initially, the data is filtered to include only the years between 2000 and 2020. Then, the total CO2 emissions per country are calculated and aggregated. Next, a world map dataset is loaded and merged with the aggregated CO2 emissions data based on country names. The merged dataset is further filtered to remove countries with zero emissions.

A Folium map is then created, centered at the mean latitude and longitude of the countries in the dataset. The map is enriched with GeoJSON data representing country borders and CO2 emissions information, displayed as tooltips. Additionally, text annotations showing the country names and total emissions are added using MarkerCluster, which clusters markers to improve visualization.

Finally, the map is displayed in the Colab notebook to provide a visual representation of CO2 emissions across different countries over the specified time frame.

import geopandas as gpd
import pandas as pd
import folium
from folium.plugins import MarkerCluster
from IPython.display import display

# Define the starting and ending year of the desired time frame
start_year = 2000
end_year = 2020

# Convert the 'Year' column to integers before filtering
melted_df['Year'] = pd.to_numeric(melted_df['Year'], errors='coerce')

# Filter the DataFrame for the specified time frame
filtered_df = melted_df[(melted_df['Year'] >= start_year) & (melted_df['Year'] <= end_year)]

# Calculate the sum of CO2 emissions per region/country
region_emissions = (filtered_df.groupby('Country Name')['CO2 Emissions (metric tons per capita)']
.sum()
.reset_index()
.round(2)
)

region_emissions.columns = ['Country Name', 'sum']

# Load the '110m' cultural vectors dataset from a local file path
world = gpd.read_file('/content/ne_110m_admin_0_countries.zip')

# Merge the world map with the sum co2 based on ISO country codes
world = world.merge(region_emissions, left_on='ADMIN', right_on='Country Name', how='left')

# Filter out rows with count 0
world = world[world['sum'] > 0]

# Create a Folium Map centered at the mean of latitude and longitude
m = folium.Map(location=[world.geometry.centroid.y.mean(), world.geometry.centroid.x.mean()], zoom_start=2)

# Add GeoJSON data to the map
folium.GeoJson(world, name='geojson', tooltip=folium.features.GeoJsonTooltip(fields=['ADMIN', 'sum'], aliases=['Country', 'sum'])).add_to(m)

# Add sum values as text annotations using MarkerCluster
marker_cluster = MarkerCluster().add_to(m)
for idx, row in world.iterrows():
folium.Marker(
location=[row.geometry.centroid.y, row.geometry.centroid.x],
popup=f"{row['ADMIN']}: {row['sum']:.0f}",
).add_to(marker_cluster)

# Display the map in the Colab notebook
display(m)
The output

Correlation Coefficients

This code section performs correlation analysis between CO2 emissions, GDP, and population for different countries.

  1. It selects relevant columns from the melted DataFrame (co2_df) and separate DataFrames for GDP (gdp_df) and population (population_df).
  2. Then, it merges these DataFrames based on the ‘Country Name’ column, creating a new DataFrame (merged_df).
  3. The column names in the merged DataFrame are renamed for clarity.
  4. Next, it calculates correlation coefficients between CO2 emissions, GDP, and population using Spearman method.
  5. Finally, it visualizes the correlation matrix as a heatmap using seaborn, annotating the correlation values.
co2_df = melted_df[["Country Name",'CO2 Emissions (metric tons per capita)']]
gdp_df = world[['Country Name','GDP_MD']]
population_df = world[['Country Name','POP_EST']]

# Merge DataFrames based on a common identifier (e.g., 'Country Name' or 'Country Code')
merged_df = co2_df.merge(gdp_df, on='Country Name') # Replace with the appropriate merge key
merged_df = merged_df.merge(population_df, on='Country Name')

# Rename columns in the merged DataFrame
merged_df = merged_df.rename(columns={
'CO2 Emissions (metric tons per capita)': 'CO2 Emissions',
'GDP_MD': 'GDP',
'POP_EST': 'Population'
})

# Calculate correlation coefficients
correlations = merged_df[['CO2 Emissions', 'GDP', 'Population']].corr(method='spearman')


# Plotting the correlation matrix as a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlations, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix')
plt.show()
The output

Scatter Plot with Regression Line

The code generates two scatter plots with a regression line depicting the relationship between CO2 emissions and GDP & CO2 emissions and Population. The x axis represents GDP (in million USD), while the y axis represents CO2 emissions (in metric tons per capita) Or Population. Each point on the scatter plot represents a country's GDP/Population and corresponding CO2 emissions. The regression line indicates the overall trend in the data, showing how changes in GDP/Population affect CO2 emissions. The plot is enhanced with a title, labels for the axes, and a specified figure size for clarity and readability.

# Scatter plot with regression line for CO2 Emissions vs. GDP
plt.figure(figsize=(12, 6))
sns.regplot(x='GDP', y='CO2 Emissions', data=merged_df, scatter_kws={'s': 50})
plt.title('CO2 Emissions vs. GDP with Regression Line')
plt.xlabel('GDP (Million USD)')
plt.ylabel('CO2 Emissions (metric tons per capita)')
plt.show()

# Scatter plot with regression line for CO2 Emissions vs. Population
plt.figure(figsize=(12, 6))
sns.regplot(x='Population', y='CO2 Emissions', data=merged_df, scatter_kws={'s': 50})
plt.title('CO2 Emissions vs. Population with Regression Line')
plt.xlabel('Population Estimate')
plt.ylabel('CO2 Emissions (metric tons per capita)')
plt.show()
CO2 Emissions vs GDP
CO2 Emissions vs Population

Conclusion

As we conclude our journey through the intricate web of carbon emissions data, it’s crucial to reflect on the profound implications of our findings. The numbers we’ve uncovered are more than just data points; they represent the collective impact of human activity on our planet. From the bustling streets of urban metropolises to the quiet corners of rural villages, every individual, every industry, and every nation plays a role in shaping the trajectory of our environmental future.

The stark reality laid bare by our analysis is that carbon emissions are not just a statistic to be glanced over; they are the silent agents of climate change, altering landscapes, disrupting ecosystems, and threatening the delicate balance of life on Earth. Behind every ton of CO2 emitted lies a story of human ingenuity, progress, and consumption, but also one of environmental degradation, ecological imbalance, and social injustice.

Yet, amidst the daunting challenges that lie ahead, there is also reason for hope. Our exploration has revealed not only the magnitude of the problem but also the potential for positive change. From the rapid growth of renewable energy sources to the emergence of sustainable practices in industries worldwide, there are signs of a global awakening to the urgency of addressing climate change. It’s a reminder that while the road ahead may be fraught with obstacles, it is also paved with opportunities for innovation, collaboration, and collective action.

As individuals, communities, and nations, we have a shared responsibility to take meaningful steps towards reducing our carbon footprint, mitigating the impacts of climate change, and safeguarding the future of our planet for generations to come. Whether it’s advocating for policy changes, adopting sustainable lifestyle choices, or supporting green initiatives in our local communities, each of us has the power to make a difference.

So, let us heed the call to action that resonates from the data before us. Let us harness the knowledge gained from our exploration to inform decisions, inspire change, and shape a more sustainable future. Together, we can rewrite the narrative of carbon emissions from one of destruction to one of resilience, from one of despair to one of hope. The time for action is now, and the opportunity to make a difference is in our hands. Let us rise to the challenge and chart a course towards a brighter, greener tomorrow.

Code

  1. eda_co2_emissions.py

In Plain English 🚀

Thank you for being a part of the In Plain English community! Before you go:

--

--