Using SQL To Understand Deforestation Trends

Patrick Okore
Learning SQL
Published in
14 min readNov 6, 2023
Photo by Lukasz Szmigiel on Unsplash

Introduction

In my journey learning data science, I found that more emphasis was placed on learning Python or R, and less on SQL. However, when I truly discovered SQL, and its advantages in data analysis over python, my mind was blown. Some tasks that would take several lines of code and maneuvering in python, were pretty easy to execute on SQL. I have therefore decided to share my experiences with SQL, to help others see the light .

In this blog we will explore SQL for data analysis using a case study on deforestation, with World Bank data⁴. The queries were written in PostgreSQL, but should be transferrable to other SQL versions with minimal edits. By the end of this analysis, you will have a grasp of SQL for the following:

  • Translating a business problem into an SQL problem
  • Breaking down a complex analysis problem into smaller, less complex parts, and solving the smaller parts to solve the larger problem
  • Using CREATE to create a view
  • Simple SQL queries for filtering (WHERE), aggregation (GROUP BY) and sorting (ORDER BY)
  • Combining separate tables on one or multiple columns using JOIN
  • Writing complex SQL commands using sub queries and common table expressions

In this analysis, we seek to understand deforestration trends from 1990 to 2016 at a global, regional and country level.

We have a database with 3 tables, as outlined in the schema below:

image by author: schema

To work through this project, we will follow the structure outlined below:

  • Data Preparation
  • Global Scenario
  • Regional Snapshot
  • Country Trends
  • Summary & Implications
  • Instructions To Setup Database

For those who want to follow along by running the queries, refer to the last section of the blog: Instructions To Setup Database. Now let’s dig in!

Data Preparation

Photo by Isaac Smith on Unsplash

We will start by checking for missing values — this will enable us to know whether we need to fill any missing values, depending on the extent of missing data. Further investigation revealed less than 6% missing data on either forest area or land area. We will exclude these countries in the analysis in our next step. From the query below, we see 20 countries missing either land or forest area

--check for missing values
SELECT country_name
FROM forest_area
WHERE forest_area_sqkm IS NULL
UNION
SELECT country_name
FROM land_area
WHERE total_area_sq_mi IS NULL;
image by author: 20 countries missing land or forest area

Next, we will create a view called forestation using JOIN clauses on the three tables from our schema. Our challenge here is that we need to use all the information from the 3 tables to conduct our analysis. However, it will be impractical and inefficient to write joins every other time we need to perform an analysis, hence views. A view⁶ allows you to create virtual tables from your queries, to perform complex operations, pull a wide range of data from different tables or select specific data. As we create the view:

  • We will create a new column with the percentage of the forested land area.
-- create view joining all tables and new column with % forest area
CREATE VIEW forestation AS
SELECT fa.country_code AS country_code,
fa.country_name AS country_name,
fa.year AS year,
fa.forest_area_sqkm ,
la.total_area_sq_mi,
-- create new column with % forest area in sq km
-- convert total_area_sq_mi into sq km. 1 sq mi = 2.59 sq km
((fa.forest_area_sqkm / (la.total_area_sq_mi * 2.59)) * 100)
AS fore_pct_km, r.region, r.income_group
FROM forest_area AS fa
INNER JOIN land_area AS la
USING(country_code, year)
-- use inner join since the country TWN in regions but not in la or fa
INNER JOIN regions AS r
ON la.country_code = r.country_code
-- exclude countries with missing values
WHERE forest_area_sqkm IS NOT NULL
AND total_area_sq_mi IS NOT NULL;

Next, we confirm that the filter applied to the view to exclude NULL values works:

--confirm no missing values
SELECT DISTINCT country_name
FROM forestation
WHERE forest_area_sqkm IS NULL
OR total_area_sq_mi IS NULL;
image by author: confirm 0 missing values in the created view

As we have completed data preparation, we will proceed to data analysis.

Global Scenario

Photo by Markus Spiske from Pexels

We will set the context by looking at how the world’s forestation levels has evolved from 1990 to 2016:

  • What was the total world forest area (in km²) 1990?
  • What was the total world forest area (in km²) in 2016?
  • Find the absolute change (in km²) in the world forest area from 1990 to 2016
  • Find the % change (in km²) in the world forest area from 1990 to 2016
  • If you compare the amount of forest area lost between 1990 and 2016, to which country’s total area in 2016 is it closest to?

Total Forest Area (in km²) of the World in 1990

We start off by exploring the contents of our view in 1990, to understand how best to answer the question

SELECT *
FROM forestation
WHERE year = 1990
ORDER BY country_name DESC
LIMIT 5;
image by author: view exploration

From the above, we notice that we have a country_name “World”, which aggregates forest area for all nations in the database. This makes our work easier as all we need to do is to filter by the year 1990, and by the name “World” as shown below:

-- create view to be used
-- in subsequent analyses
CREATE VIEW fa_1990 AS
SELECT country_name,
forest_area_sqkm AS fa_1990_sqkm
FROM forestation
WHERE country_name = 'World'
AND year = 1990;

-- show output of view
SELECT *
FROM fa_1990;
image by author: total forest area of the world in 1990

Total Forest Area (in km²) of the World in 2016

We replicate the above query and simply change the year to 2016

-- create view to be used
-- in subsequent analyses
CREATE VIEW fa_2016 AS
SELECT country_name,
forest_area_sqkm AS fa_2016_sqkm
FROM forestation
WHERE country_name = 'World'
AND year = 2016;

-- show output of view
SELECT *
FROM fa_2016;
image by author: total forest area of the world in 2016

Change in the World’s forest area from 1990 to 2016

To solve this, we will use the 2 views created above for forest area:

  • subtract the output of 1990 from that of 2016 to calculate absolute change
  • divide the output of 2016 by that of 1990 to calculate % change
SELECT fa_1990_sqkm, fa_2016_sqkm,
(fa_2016_sqkm - fa_1990_sqkm) AS change,
ROUND(((fa_2016_sqkm / fa_1990_sqkm)::NUMERIC - 1) * 100, 3)
AS pct_change
FROM fa_1990
JOIN fa_2016
USING (country_name);
image by author: change in global forest area from 1990–2016

We can therefore summarize that the global forest area has declined from 41,282,694.9 km² in 1990 to 39,958,245.9 km² in 2016. This is a decrease of 1,324,449 km² or 3.208%.

If You Compare Forest Area Lost Between 1990 and 2016, To Which Country’s Total Area In 2016 Is It Closest To?

To solve this problem, we need to break it down into smaller parts to understand the steps we need to take:

  • convert loss in forest area to square miles (country total area is in mi²)
  • filter the data by year to only deal with 2016
  • subtract loss in forest area from each country’s total area. This gives us a comparison of loss in forest area vs. equivalent country area. For an exact match, the difference would be zero. Thus we need to look for the country with the smallest difference vs. loss in forest area
  • filter the output by a sensible range to identify any country areas closest 1,324,449 km² (above and/or below)
-- use common table expression for
-- better query design
WITH closest AS(
SELECT country_name,
total_area_sq_mi::NUMERIC,
-- calcualte loss in forest area, as delta_mi
-- convert loss in forest area to square miles
(((SELECT fa_1990_sqkm::NUMERIC FROM fa_1990) -
(SELECT fa_2016_sqkm::NUMERIC FROM fa_2016)) / 2.59) AS delta_mi
FROM forestation
WHERE year = 2016
AND total_area_sq_mi
--identify sensible range and convert to sqmi
BETWEEN 1000000/2.59 AND 1400000/2.59
)

SELECT country_name,
total_area_sq_mi, ROUND(delta_mi, 1),
/* subtract forest area, delta_mi from
each country's total area */
ROUND(delta_mi - total_area_sq_mi, 1) AS diff
FROM closest
ORDER BY diff ASC
LIMIT 3;

As you can see, translating these steps into a query involved a few other maneuvers that I expound on below:

Common Table Expressions (CTE)¹ — help to improve readability as queries increase in length and complexity. A CTE is declared before the main query. It is an excellent tool for organizing information and placing it in a logical order. In the above query it takes the form WITH closest AS () .

Sub queries in FROM clause² allow for more complex workings. They are an effective way of answering detailed questions that requires filtering or transforming data before including it in your final results. The content in the bracket WITH closest AS () is a sub query in a FROM clause.

Sub queries in SELECT³ help to bring summary values into a detailed dataset — useful since you can’t return aggregate values in un-grouped SQL queries. They help in performing complex mathematical equations on your dataset — e.g. by how much does the global loss in forest area differ from each country’s total area?

If you compare the forest area lost between 1990 and 2016, it is closest to Peru’s total area in 2016

Regional Snapshot

Photo by Bernd 📷 Dittrich on Unsplash

We will now move one level down, to get a regional snapshot:

  • Find the proportion (in %) of forested area of the world in 2016. Find the regions with the highest and lowest percentage of forested area in 2016
  • Find the proportion (in %) of forested area of the world in 1990. Find the regions with the highest and lowest percentage of forested area in 1990
  • Identify regions of the world that declined in forest area from 1990 to 2016

Highest & Lowest Regions by % Forested Area in 1990 & 2016

When we created the forestation view, we also created a column fore_pct_km, which shows % forested area for each country, as shown below:

-- explore 2016 data
SELECT country_name,
region,
ROUND(fore_pct_km::NUMERIC, 1) fore_pct_km
FROM forestation
WHERE year = 2016
LIMIT 5;
image by author: data exploration

From the output we see that we need to:

  • aggregate all percentages for each region, for 1990 and 2016
  • to aggregate the percentages, we will use AVG()
  • filter for 1990 or 2016
  • GROUP BY region
  • JOIN the views created for each year to display both years in one table
CREATE VIEW pct_2016 AS
SELECT region,
-- aggregate all percentages
ROUND(AVG(fore_pct_km)::NUMERIC,2) AS reg_pct_16
FROM forestation
WHERE year = 2016
-- show percentages for each region
GROUP BY region;

CREATE VIEW pct_1990 AS
SELECT region,
-- aggregate all percentages
ROUND(AVG(fore_pct_km)::NUMERIC,2) AS reg_pct_90
FROM forestation
WHERE year = 1990
-- show percentages for each region
GROUP BY region;

-- show both outputs in 1 table by joining the views
SELECT region, reg_pct_16,
reg_pct_90
FROM pct_2016
INNER JOIN pct_1990
USING(region)
ORDER BY 2 DESC;
image by author: highest & lowest regions by % forested area in 1990 & 2016

From the output above, we note that East Asia & Pacific has the highest % of forested area (47% in 1990 & 50% in 2016), while Middle East & North Africa has the lowest (2.7% in 1990 & 3.2% in 2016).

Regions That Declined in Forest Area From 1990 To 2016

Using the 2 views created in the previous section, we create a column change, which shows whether % forested area has increased or decreased from 1990 to 2016. This is done using a sub query in the FROM clause, enclosed in a common table expression below:

WITH pct_changes AS (
SELECT region, reg_pct_90, reg_pct_16,
(reg_pct_16 - reg_pct_90) AS change
FROM pct_2016
INNER JOIN pct_1990
USING(region)
)

SELECT region,
reg_pct_90,
reg_pct_16,
change
FROM pct_changes
ORDER BY change;
image by author: regions that declined in % forest area from 1990 to 2016

The only regions of the world that decreased in percent forest area from 1990 to 2016 were, Sub-Saharan Africa (dropped from 35.26% to 31.28%) and Latin America & Caribbean (43.34% to 41.64%). All other regions actually increased in forest area over this time period. However, the drop in forest area in the two aforementioned regions was so large, the percent forest area of the world decreased over this time period from 32.42% to 31.38%.

Country Trends

Photo by Ricardo Gomez Angel on Unsplash

Lastly we will deep dive into the country trends below:

  • Find the 5 countries with the highest absolute decline in forest area from 1990 to 2016
  • Find the 5 countries with the highest % decline in forest area from 1990 to 2016
  • Find all of the countries with % forested area greater than 75% in 2016
  • Find the top 5 countries with the lowest forest cover in %

Five Countries With Largest Absolute Drop in Forest Area From 1990–2016

Which countries are seeing deforestation to the largest degree? We can answer this question in two ways. First, we will look at the absolute square kilometer decrease in forest area from 1990 to 2016. To tackle this problem:

  • we need to create 2 tables, one with 1990 forest area, and one with 2016
  • subtract 1990 forest area from 2016 forest area for each country
  • order the result by ascending order of difference
  • the first 5 countries will be those with the largest drop
WITH fa_1990 AS (-- create 1990 table
SELECT country_name, region,
ROUND(forest_area_sqkm::NUMERIC,2) AS fa_90_sqkm
FROM forestation
WHERE year = 1990 AND forest_area_sqkm IS NOT NULL
),
-- create 2016 table
fa_2016 AS (
SELECT country_name, region,
ROUND(forest_area_sqkm::NUMERIC,2) AS fa_16_sqkm
FROM forestation
WHERE year = 2016 AND forest_area_sqkm IS NOT NULL
)

SELECT country_name,
fa_1990.region,
fa_16_sqkm,
fa_90_sqkm,
-- subtract 1990 from 2016 forest area
(fa_16_sqkm - fa_90_sqkm) AS change
FROM fa_1990
INNER JOIN fa_2016
USING(country_name)
WHERE country_name != 'World'
ORDER BY change ASC
-- top 5 countries with largest drop
LIMIT 5;
image by author: top 5 countries with largest drop in absolute forest area from 1990–2016

From the above table, we see that the top 5 countries with the largest drop in absolute forest area come from 3 regions:

  • Brazil (541,510 km²) from Latin America & Caribbean
  • Indonesia (282,194 km²) & Myanmar (107,234 km²) from East Asia & Pacific
  • Nigeria (106,506 km²) & Tanzania (102,320) from Sub-Saharan Africa

Five Countries With Largest % Drop in Forest Area From 1990–2016

Which countries are seeing deforestation to the largest degree? We can answer this question in two ways. The second way, is to look at the % square kilometer decrease in forest area from 1990 to 2016. We use the same approach as above, and just substitute from the column with absolute values, to that with % values

WITH fa_1990_pct AS (
SELECT country_name, region,
ROUND(fore_pct_km::NUMERIC,2) AS fa_90_pct
FROM forestation
WHERE year = 1990 AND fore_pct_km IS NOT NULL
),
fa_2016_pct AS (
SELECT country_name, region,
ROUND(fore_pct_km::NUMERIC,2) AS fa_16_pct
FROM forestation
WHERE year = 2016 AND fore_pct_km IS NOT NULL
)

SELECT fa_1990_pct.country_name, fa_1990_pct.region, fa_16_pct,
fa_90_pct, (fa_16_pct - fa_90_pct) AS change
FROM fa_1990_pct
INNER JOIN fa_2016_pct
USING(country_name)
WHERE country_name != 'World'
ORDER BY change ASC
LIMIT 5;
image by author: top 5 countries with largest drop in % forest area from 1990–2016

The countries with the largest percentage drop are different from those with largest absolute drop.

Find All Of The Countries With % Forested Area Greater Than 75% In 2016

SELECT country_name, region, 
ROUND(fore_pct_km::NUMERIC, 2) fore_pct_km
FROM forestation
WHERE year = 2016 AND fore_pct_km > 75
ORDER BY fore_pct_km DESC;
image by author: countries with highest forest %

Top 5 Countries With Lowest Forest Cover In %

SELECT country_name, region, 
ROUND(fore_pct_km::NUMERIC, 2) fore_pct_km
FROM forestation
WHERE year = 2016
ORDER BY fore_pct_km
LIMIT 5;
image by author: top 5 countries with the lowest forest cover in 2016

The 5 countries with the lowest forest cover are in 2 regions, namely Europe & Central Asia and Middle East & North Africa.

Summary

Photo by UX Indonesia on Unsplash
  • The world lost 1,324,449 km² (3.208%) of its forest area from 1990 to 2016. This is equivalent to the land area of Peru
  • There is some disparity in forest cover across regions. East Asia & Pacific has the highest % forest cover (50% in 2016), while Middle East & North Africa has the lowest (3% in 2016)
  • Latin America & Caribbean and Sub-Saharan Africa experienced decreases in % forest cover from 1990 to 2016. This was significant enough to negatively impact the global forestation percentage.
  • Brazil, Indonesia, Myanmar, Nigeria & Tanzania have seen the largest absolute square kilometer decreases in forest area.
  • Honduras, Korea (Dem. People’s Rep.), Zimbabwe, Cambodia, and Timor-Leste have experienced the largest percent decreases in forest area.
  • Suriname, Micronesia, Gabon, Seychelles, Palau, American Samoa, Guyana, Lao PDR, and Solomon Islands, are among the countries with highest % in forest cover in 2016

Implications

  • Target awareness campaigns and conservation efforts in regions experiencing the largest decreases in forest area, such as Brazil and Indonesia.
  • Collaborate with countries experiencing severe deforestation, like Honduras and Zimbabwe, to identify strategies for conservation.
  • Support programs and initiatives that have successfully increased forest area. Promote policies and practices to maintain high levels of forestation in countries like Suriname, Micronesia, and Gabon.
  • Spearhead efforts to conserve and rehabilitate forests in regions with the greatest potential, such as East Asia & Pacific.
  • Advocate for sustainable forestry practices and afforestation projects

Instructions To Setup Database

This section is for anyone who wants to follow along by running the queries on their machine. As they were written using PostgreSQL, those using other versions of SQL need to make necessary adjustments. Go here to download the csv data from my GitHub account. The queries below should set up the relevant tables and import the required data. Make sure to change the file paths in the FROM clauses below. You might want to save the downloaded data in the program folder where you installed your SQL software. Initially you might get a “Permission Denied” error, if you save it in a folder that your SQL has no read access. I found this to be the easiest solution.

-- Create the land_area table
CREATE TABLE "land_area" (
"country_code" VARCHAR(3),
"country_name" VARCHAR(255),
"year" INT,
"total_area_sq_mi" NUMERIC
);

-- Import data from land_area.csv into the land_area table
COPY land_area
FROM 'C:\Program Files\PostgreSQL\16\data\forestation_data\la.csv'
DELIMITER ',' CSV HEADER;

-- Create the forest_area table
CREATE TABLE forest_area (
country_code VARCHAR(3),
country_name VARCHAR(255),
year INT,
forest_area_sqkm NUMERIC
);

-- Import data from forest_area.csv into the forest_area table
COPY forest_area
FROM 'C:\Program Files\PostgreSQL\16\data\forestation_data\fa.csv'
DELIMITER ',' CSV HEADER;

-- Create the regions table
CREATE TABLE regions (
country_name VARCHAR(255),
country_code VARCHAR(3),
region VARCHAR(255),
income_group VARCHAR(255)
);

-- Import data from regions.csv into the regions table
COPY regions
FROM 'C:\Program Files\PostgreSQL\16\data\forestation_data\r.csv'
DELIMITER ',' CSV HEADER;

If this project piqued your interest in exploring other SQL projects, check out:

References

  1. Mona Khalil, Data Manipulation in SQL — Common Table Expressions, datacamp
  2. Mona Khalil, Data Manipulation in SQL — Sub Queries in FROM, datacamp
  3. Mona Khalil, Data Manipulation in SQL — Sub Queries in SELECT, datacamp
  4. world bank data
  5. world bank data
  6. Avinash Navlani, Views in SQL, datacamp

Go from SELECT * to interview-worthy project. Get our free 5-page guide.

--

--

Patrick Okore
Learning SQL

A data storyteller and insights generator with a passion for understanding and unpacking consumer behaviour