An Exploratory Data Analysis of Carbon Emissions
Hey folks, ever wondered what’s really going on with carbon emissions worldwide?
Introduction
Well, buckle up because we’re about to dive into the nitty-gritty of it all. Today, we’re kicking off our journey into the fascinating world of exploratory data analysis (EDA) of carbon emissions across the globe.
Now, before you start yawning, let me tell you why this matters. Carbon emissions are like the pesky villains in our environmental story, wreaking havoc on our planet’s health. They come from all sorts of sources, from factories to cars to cows (yep, even cows contribute!). And understanding where they’re coming from, how much there is, and where they’re headed is crucial if we want to tackle climate change head-on.
So, in this series of analyses, we’re going to roll up our sleeves and dig deep into the data. We’ll be uncovering trends, spotting patterns, and maybe even uncovering a few surprises along the way. Because knowledge is power, my friends, and when it comes to saving our planet, we could all use a little more of it. So, let’s get cracking and see what the numbers have to say about our carbon conundrum!
Let us drill down the code and conduct data analysis.
Code Breakdown
In the following code, we will analyze and visualize CO2 emissions data by country. During the code, we will explore trends over time and identify major contributors of CO2 emissions
Import Libraries
Here, we’re importing the necessary libraries for our analysis:
pandas
for data manipulation and analysis.zipfile
for handling ZIP files.requests
for making HTTP requests to download data from the web.matplotlib.pyplot
andseaborn
for data visualization.
import pandas as pd
import zipfile
import requests
import matplotlib.pyplot as plt
import seaborn as sns
Download the dataset
In this process, we will utilize the requests
library to download the dataset and the zipfile
library to extract its contents. Additionally, the os
library will be employed to remove the zip file post-extraction. Finally, the ‘pandas’ library will be applied to read the dataset’s contents from the CSV file and store them in a dataframe.
# Download the ZIP file
url = "https://api.worldbank.org/v2/en/indicator/EN.ATM.CO2E.PC?start=2000&end=2023&downloadformat=csv"
response = requests.get(url)
# Save the ZIP file to a temporary file
with open("co2_emissions.zip", "wb") as zip_file:
zip_file.write(response.content)
# Unzip the file
with zipfile.ZipFile("co2_emissions.zip", "r") as zip_ref:
zip_ref.extractall()
# Delete the ZIP file
import os
os.remove("co2_emissions.zip")
# Load the CSV data into a DataFrame
df = pd.read_csv("API_EN.ATM.CO2E.PC_DS2_en_csv_v2_191.csv", skiprows=4)
Explore the first 10 rows of the dataframe
Let us check how our dataframe looks like. We can see that our dataframe is wide not long. It means that our dataset has all the years in columns as supposed to be under one column year. We also see that there are certain NaNs under year that we will tackle in the preprocessing phase
df.head(10)
Data Preprocessing
As mentioned above, the data set that we have is in a wide format, meaning that the years are in columns as opposed to being in rows. In order to properly work with our dataset, we need to transpose all these years under one column name year and retain the numerical values of CO2 Emissions (metric tons per capita).
In order to achieve the desired result, we thankfully have a melt function in pandas library to reshape a DataFrame (df). The melt function is commonly used to transform wide-form data into long-form data.
id_vars
: These are the columns that we want to keep as-is and not melt. In this case, columns like Country Name
, Country Code
, Indicator Name
, and Indicator Code
will remain as they are.
var_name
: This parameter specifies the name of the new column that will store the variable names (in this case, Year
). The variable names are the columns that we want to melt into a single column.
value_name
: This parameter specifies the name of the new column that will store the values corresponding to the variable names. In this case, it’s CO2 Emissions (metric tons per capita)
.
So, the resulting melted_df
dataframe will have a structure where the columns Country Name
, Country Code
, Indicator Name
, and Indicator Code
remain unchanged, and the columns that represent different years (previously in wide format) are melted into two columns: Year
and CO2 Emissions (metric tons per capita)
.
This reshaping is often useful when we want to perform analyses or visualizations that are easier to handle with long-form data, especially when dealing with time series data.
Later we also use dropna
function of pandas library to remove null values from our newly created CO2 Emissions (metric tons per capita)
column. The subset parameter specifies the column(s) to consider for missing values, and inplace=True
means that the changes are applied directly to the original DataFrame (melted_df
) without the need to create a new DataFrame.
The following code snippet is removing rows from melted_df
where the CO2 Emissions (metric tons per capita)
column has missing values. It’s a common practice to handle missing data before performing further analysis or visualization to ensure accurate and meaningful results.
# Melt the DataFrame to a format suitable for analysis
melted_df = df.melt(id_vars=["Country Name", "Country Code", "Indicator Name", "Indicator Code"], var_name="Year", value_name="CO2 Emissions (metric tons per capita)")
# Handle missing values (replace with appropriate methods based on your analysis)
melted_df.dropna(subset=["CO2 Emissions (metric tons per capita)"], inplace=True)
# Explore basic information (optional)
print("Number of rows:", melted_df.shape[0])
print("Number of columns:", melted_df.shape[1])
print("List of columns:", melted_df.columns.tolist())
print("Data types of each column:", melted_df.dtypes)
Check unique values of years
Let us check if we have the data in the right format. As you can see that we have all the years in the Year
column and all the numeric values corresponding to those years in the CO2 Emissions (metric tons per capita)
column.
# Print the unique values of years
print(melted_df['Year'].unique())
print(melted_df['Indicator Name'].unique())
print(melted_df['CO2 Emissions (metric tons per capita)'].unique())
Now our dataframe is ready for analysis
Explore melted_df
Let us finally check our entire melted dataframe. As you can see now we have all the years under Year
column and relevant Co2 emissions under CO2 emissions (metric tons per capita)
column.
melted_df.head(10)
Trends for India
Let us visualize how India has contributed to CO2 emissions over the years. You can see that there is a strong dip since 2017 owing to major policy decisions taken by the governments such as National Action Plan on Climate Change (NAPCC), International Solar Alliance (ISA), Renewable Energy Targets, UJALA Scheme (Unnat Jyoti by Affordable LEDs for All) , FAME India Scheme (Faster Adoption and Manufacturing of Hybrid and Electric Vehicles), Green Building Initiatives, Enhanced Focus on Public Transport, Afforestation and Reforestation have contributed to mitigating the effects of greenhouse gas emissions.
# Select the column containing CO2 emissions (assuming it's named "CO2 Emissions (metric tons per capita)")
co2_emissions_col = "CO2 Emissions (metric tons per capita)"
# Filter data for the desired country
country_name = "India" # Replace with the desired country
country_data = melted_df[melted_df["Country Name"] == country_name]
# Plot CO2 emissions over time for the specific country
country_data.plot(x="Year", y=co2_emissions_col, marker="o")
plt.title(f"CO2 Emissions in {country_name} over Time")
plt.xlabel("Year")
plt.ylabel(co2_emissions_col) # Add label for the y-axis
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Trends for countries over years
We can also explore trends for other countries over the years and compare them against each other.
# Explore trends over time and identify major contributors
plt.figure(figsize=(20, 8))
sns.lineplot(x='Year', y=co2_emissions_col, hue='Country Name', data=melted_df)
plt.title('CO2 Emissions Trends Over Time')
plt.legend().set_visible(False)
plt.show()
As the above chart is a bit cluttered. We can also show the trends for the major contributors where we can filter out the top contributors and show on the graph.
# Calculate total emissions for each country
total_emissions = melted_df.groupby('Country Name')[co2_emissions_col].sum()
# Identify major contributors (e.g., top 10)
major_contributors = total_emissions.nlargest(10).index.tolist()
# Filter the data for major contributors
filtered_df = melted_df[melted_df['Country Name'].isin(major_contributors)]
# Plot the data with legends for major contributors only
plt.figure(figsize=(20, 8))
sns.lineplot(x='Year', y=co2_emissions_col, hue='Country Name', data=filtered_df)
plt.title('CO2 Emissions Trends Over Time')
plt.legend()
plt.show()
Descriptive statistics by time period
The following code is using the groupby function in pandas to group the data in the DataFrame melted_df
by the Year
column. After grouping, it is applying the describe function to calculate various descriptive statistics for each group. Specifically, it is calculating the mean, standard deviation, minimum, 25th percentile (Q1), median (50th percentile or Q2), 75th percentile (Q3), and maximum for the CO2 Emissions (metric tons per capita)
column within each year.
# Calculate descriptive statistics by time period (As 'Year' is the time period column)
print("Descriptive statistics by time period:")
print(melted_df.groupby('Year')['CO2 Emissions (metric tons per capita)'].describe())
Descriptive statistics by country
Let us do the same for country.
# Calculate descriptive statistics by country
print("\nDescriptive statistics by Country:")
print(melted_df.groupby('Country Name')['CO2 Emissions (metric tons per capita)'].describe())
Interactive Maps
We will interact with our data in the geospacial way. Our code is setting up a Python environment for visualizing geospatial data using the geopandas
(handle geospatial data), pandas
(data manipulation and analysis), and folium
(create interactive maps) libraries in a Jupyter notebook or IPython environment. Additionally, it’s importing the MarkerCluster
plugin (display markers on a map efficiently, especially when there are a large number of markers in a small area) from folium and the display
function from IPython.display
(render and display visualizations in Jupyter notebooks).
But before doing so, let us download the countries geospatial data from naturalearth website using wget function.
!wget 'https://naciscdn.org/naturalearth/110m/cultural/ne_110m_admin_0_countries.zip'
Display on the map
The following code uses the geopandas, pandas, and folium libraries to create an interactive map using the data related to CO2 emissions. Here is a breakdown of the code:
Importing Libraries
- geopandas: For handling geospatial data.
- pandas: For data manipulation and analysis.
- folium: For creating interactive maps.
- MarkerCluster: A plugin from folium for clustering markers on the map.
- IPython.display: For displaying the map in the Colab notebook.
Summing Total Emissions per Region
The code groups the data in melted_df
by Country Name
and sums the CO2 Emissions (metric tons per capita)
for each country. The result is a DataFrame named region_emissions.
Loading World Map Data
It loads a world map dataset named 110m
from a local file path (/content/ne_110m_admin_0_countries.zip
) using geopandas.read_file().
Merging World Map with CO2 Emissions Data
The code merges the world map data (world) with the region_emissions
DataFrame based on the country codes (ADMIN
and Country Name
).
Filtering Rows
It filters out rows with a total CO2 emission count of 0.
Creating a Folium Map
It creates a Folium map centered at the mean latitude and longitude of the world map geometries.
Adding GeoJSON Data to the Map
It adds GeoJSON data (world map with CO2 emission data) to the Folium map, including a tooltip with information about each country.
Adding Marker Clusters
It adds marker clusters to the map using MarkerCluster()
. Each marker represents a country, and markers are clustered to improve map readability.
Adding Markers with Popup Annotations
It iterates over the rows of the world DataFrame and adds markers with popups containing information about the country name (ADMIN
) and the total CO2 emissions.
Displaying the Map
Finally, it displays the interactive map in the Colab notebook using IPython.display.display(m)
.
import geopandas as gpd
import pandas as pd
import folium
from folium.plugins import MarkerCluster
from IPython.display import display
# Sum the total emissions per region/country
region_emissions = melted_df.groupby('Country Name')['CO2 Emissions (metric tons per capita)'].sum().reset_index().round(2)
region_emissions.columns = ['Country Name', 'sum']
# Load the '110m' cultural vectors dataset from a local file path
world = gpd.read_file('/content/ne_110m_admin_0_countries.zip')
# Merge the world map with the sum co2 based on ISO country codes
world = world.merge(region_emissions, left_on='ADMIN', right_on='Country Name', how='left')
# Filter out rows with count 0
world = world[world['sum'] > 0]
# Create a Folium Map centered at the mean of latitude and longitude
m = folium.Map(location=[world.geometry.centroid.y.mean(), world.geometry.centroid.x.mean()], zoom_start=2)
# Add GeoJSON data to the map
folium.GeoJson(world, name='geojson', tooltip=folium.features.GeoJsonTooltip(fields=['ADMIN', 'sum'], aliases=['Country', 'sum'])).add_to(m)
# Add sum values as text annotations using MarkerCluster
marker_cluster = MarkerCluster().add_to(m)
for idx, row in world.iterrows():
folium.Marker(
location=[row.geometry.centroid.y, row.geometry.centroid.x],
popup=f"{row['ADMIN']}: {row['sum']:.0f}",
).add_to(marker_cluster)
# Display the map in the Colab notebook
display(m)
Display on the map between specific years
This section of code is designed to visualize and analyze CO2 emissions data over a specified time frame. Initially, the data is filtered to include only the years between 2000 and 2020. Then, the total CO2 emissions per country are calculated and aggregated. Next, a world map dataset is loaded and merged with the aggregated CO2 emissions data based on country names. The merged dataset is further filtered to remove countries with zero emissions.
A Folium map is then created, centered at the mean latitude and longitude of the countries in the dataset. The map is enriched with GeoJSON data representing country borders and CO2 emissions information, displayed as tooltips. Additionally, text annotations showing the country names and total emissions are added using MarkerCluster, which clusters markers to improve visualization.
Finally, the map is displayed in the Colab notebook to provide a visual representation of CO2 emissions across different countries over the specified time frame.
import geopandas as gpd
import pandas as pd
import folium
from folium.plugins import MarkerCluster
from IPython.display import display
# Define the starting and ending year of the desired time frame
start_year = 2000
end_year = 2020
# Convert the 'Year' column to integers before filtering
melted_df['Year'] = pd.to_numeric(melted_df['Year'], errors='coerce')
# Filter the DataFrame for the specified time frame
filtered_df = melted_df[(melted_df['Year'] >= start_year) & (melted_df['Year'] <= end_year)]
# Calculate the sum of CO2 emissions per region/country
region_emissions = (filtered_df.groupby('Country Name')['CO2 Emissions (metric tons per capita)']
.sum()
.reset_index()
.round(2)
)
region_emissions.columns = ['Country Name', 'sum']
# Load the '110m' cultural vectors dataset from a local file path
world = gpd.read_file('/content/ne_110m_admin_0_countries.zip')
# Merge the world map with the sum co2 based on ISO country codes
world = world.merge(region_emissions, left_on='ADMIN', right_on='Country Name', how='left')
# Filter out rows with count 0
world = world[world['sum'] > 0]
# Create a Folium Map centered at the mean of latitude and longitude
m = folium.Map(location=[world.geometry.centroid.y.mean(), world.geometry.centroid.x.mean()], zoom_start=2)
# Add GeoJSON data to the map
folium.GeoJson(world, name='geojson', tooltip=folium.features.GeoJsonTooltip(fields=['ADMIN', 'sum'], aliases=['Country', 'sum'])).add_to(m)
# Add sum values as text annotations using MarkerCluster
marker_cluster = MarkerCluster().add_to(m)
for idx, row in world.iterrows():
folium.Marker(
location=[row.geometry.centroid.y, row.geometry.centroid.x],
popup=f"{row['ADMIN']}: {row['sum']:.0f}",
).add_to(marker_cluster)
# Display the map in the Colab notebook
display(m)
Correlation Coefficients
This code section performs correlation analysis between CO2 emissions, GDP, and population for different countries.
- It selects relevant columns from the melted DataFrame (
co2_df
) and separate DataFrames for GDP (gdp_df
) and population (population_df
). - Then, it merges these DataFrames based on the ‘Country Name’ column, creating a new DataFrame (
merged_df
). - The column names in the merged DataFrame are renamed for clarity.
- Next, it calculates correlation coefficients between CO2 emissions, GDP, and population using Spearman method.
- Finally, it visualizes the correlation matrix as a heatmap using seaborn, annotating the correlation values.
co2_df = melted_df[["Country Name",'CO2 Emissions (metric tons per capita)']]
gdp_df = world[['Country Name','GDP_MD']]
population_df = world[['Country Name','POP_EST']]
# Merge DataFrames based on a common identifier (e.g., 'Country Name' or 'Country Code')
merged_df = co2_df.merge(gdp_df, on='Country Name') # Replace with the appropriate merge key
merged_df = merged_df.merge(population_df, on='Country Name')
# Rename columns in the merged DataFrame
merged_df = merged_df.rename(columns={
'CO2 Emissions (metric tons per capita)': 'CO2 Emissions',
'GDP_MD': 'GDP',
'POP_EST': 'Population'
})
# Calculate correlation coefficients
correlations = merged_df[['CO2 Emissions', 'GDP', 'Population']].corr(method='spearman')
# Plotting the correlation matrix as a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlations, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix')
plt.show()
Scatter Plot with Regression Line
The code generates two scatter plots with a regression line depicting the relationship between CO2 emissions and GDP & CO2 emissions and Population. The x
axis represents GDP (in million USD), while the y
axis represents CO2 emissions (in metric tons per capita) Or Population. Each point on the scatter plot represents a country's GDP/Population and corresponding CO2 emissions. The regression line indicates the overall trend in the data, showing how changes in GDP/Population affect CO2 emissions. The plot is enhanced with a title, labels for the axes, and a specified figure size for clarity and readability.
# Scatter plot with regression line for CO2 Emissions vs. GDP
plt.figure(figsize=(12, 6))
sns.regplot(x='GDP', y='CO2 Emissions', data=merged_df, scatter_kws={'s': 50})
plt.title('CO2 Emissions vs. GDP with Regression Line')
plt.xlabel('GDP (Million USD)')
plt.ylabel('CO2 Emissions (metric tons per capita)')
plt.show()
# Scatter plot with regression line for CO2 Emissions vs. Population
plt.figure(figsize=(12, 6))
sns.regplot(x='Population', y='CO2 Emissions', data=merged_df, scatter_kws={'s': 50})
plt.title('CO2 Emissions vs. Population with Regression Line')
plt.xlabel('Population Estimate')
plt.ylabel('CO2 Emissions (metric tons per capita)')
plt.show()
Conclusion
As we conclude our journey through the intricate web of carbon emissions data, it’s crucial to reflect on the profound implications of our findings. The numbers we’ve uncovered are more than just data points; they represent the collective impact of human activity on our planet. From the bustling streets of urban metropolises to the quiet corners of rural villages, every individual, every industry, and every nation plays a role in shaping the trajectory of our environmental future.
The stark reality laid bare by our analysis is that carbon emissions are not just a statistic to be glanced over; they are the silent agents of climate change, altering landscapes, disrupting ecosystems, and threatening the delicate balance of life on Earth. Behind every ton of CO2 emitted lies a story of human ingenuity, progress, and consumption, but also one of environmental degradation, ecological imbalance, and social injustice.
Yet, amidst the daunting challenges that lie ahead, there is also reason for hope. Our exploration has revealed not only the magnitude of the problem but also the potential for positive change. From the rapid growth of renewable energy sources to the emergence of sustainable practices in industries worldwide, there are signs of a global awakening to the urgency of addressing climate change. It’s a reminder that while the road ahead may be fraught with obstacles, it is also paved with opportunities for innovation, collaboration, and collective action.
As individuals, communities, and nations, we have a shared responsibility to take meaningful steps towards reducing our carbon footprint, mitigating the impacts of climate change, and safeguarding the future of our planet for generations to come. Whether it’s advocating for policy changes, adopting sustainable lifestyle choices, or supporting green initiatives in our local communities, each of us has the power to make a difference.
So, let us heed the call to action that resonates from the data before us. Let us harness the knowledge gained from our exploration to inform decisions, inspire change, and shape a more sustainable future. Together, we can rewrite the narrative of carbon emissions from one of destruction to one of resilience, from one of despair to one of hope. The time for action is now, and the opportunity to make a difference is in our hands. Let us rise to the challenge and chart a course towards a brighter, greener tomorrow.
Code
In Plain English 🚀
Thank you for being a part of the In Plain English community! Before you go:
- Be sure to clap and follow the writer ️👏️️
- Follow us: X | LinkedIn | YouTube | Discord | Newsletter
- Visit our other platforms: Stackademic | CoFeed | Venture | Cubed
- More content at PlainEnglish.io