Data Analysis Portfolio Project using SQL and Tableau: NYC Tree Census and Income project

Elizaveta Chvorda
10 min readFeb 29, 2024

--

Introduction

In this data analysis portfolio project I delve into the fascinating intersection of urban forestry and socioeconomic factors in New York City. I explore data extracted from the NYC Tree Census conducted in 2015, along with income statistics across different zip codes in the city.

Step 1. Asking Questions

Research question: I there relationship between tees and income in NYC?

In this project, I aim to resolve the following questions:

  • How many trees are in metro NYC?
  • How many trees are in each borough?
  • What is the income of each borough?
  • What is the income across different zip codes?
  • What is the tree concentration across different zip codes?
  • Does tree health vary across rich vs pood zip codes?
  • Do certain boroughs have more living trees?
  • Do certain borough have newer/younger trees?

Step 2. Data Cleaning and Exploration in SQL

Trees Table

The ‘trees’ table offers a comprehensive look at street trees in Metro NY, providing insights into various aspects of tree health, distribution, and location:

  1. tree_id: Unique identifier for each tree.
  2. tree_dbh: Diameter at breast height of the tree.
  3. stump_diam: Diameter of the stump.
  4. status: Indicates whether the tree is alive, standing dead, or a stump.
  5. health: Indicates the user’s perception of tree health (Good, Fair, Poor).
  6. borocode: Borough code (1 for Manhattan, 2 for Bronx, 3 for Brooklyn, 4 for Queens, 5 for Staten Island).
  7. boroname: Name of the borough in which the tree point is located.
  8. nta_name: Neighborhood Tabulation Area name.
  9. state: State of the tree location.
  10. latitude: Latitude coordinate of the tree location.
  11. longitude: Longitude coordinate of the tree location.

Income Trees Table

The ‘income_trees’ table includes statistical information about income for each zipcode where trees are located, including:

  1. zipcode: Zipcode of the tree location.
  2. Estimate_Total: Total income estimate.
  3. Margin_of_Error_Total: Margin of error for the total income estimate.
  4. Estimate_Median_income: Median income estimate.
  5. Margin_of_Error_Median_income: Margin of error for the median income estimate.
  6. Estimate_Mean_income: Mean income estimate.
  7. Margin_of_Error_Mean_income: Margin of error for the mean income estimate.

I imported the datasets into a PostgreSQL database and performed initial data exploration. This included examining the first 10 rows of each table, understanding the column names and data types to gain insights into the data structure and content.

-- Creates the 'trees' table
CREATE TABLE trees (
index INTEGER PRIMARY KEY,
tree_id INTEGER,
tree_dbh INTEGER,
stump_diam INTEGER,
status VARCHAR,
health VARCHAR,
spc_latin VARCHAR,
spc_common VARCHAR,
address VARCHAR,
zipcode INTEGER,
borocode INTEGER,
boroname VARCHAR,
nta_name VARCHAR,
state VARCHAR,
latitude FLOAT,
longitude FLOAT
);

The table contains 683788 rows.

Then I checked the table for duplicayed rows.

-- Counts the number of duplicate tree IDs
SELECT tree_id, COUNT(*)
FROM trees
GROUP BY tree_id
HAVING COUNT(*) > 1;

No duplicates in the data.

— Retrieves the first 10 rows from the ‘trees’ table
SELECT *
FROM trees
LIMIT 10;

-- Creates the 'income_trees' table
CREATE TABLE income_trees (
zipcode INTEGER,
Estimate_Total INTEGER,
Margin_of_Error_Total INTEGER,
Estimate_Median_income INTEGER,
Margin_of_Error_Median_income INTEGER,
Estimate_Mean_income INTEGER,
Margin_of_Error_Mean_income INTEGER
);

218 row in the table.

-- Counts the number of duplicate zipcodes
SELECT zipcode, COUNT(*)
FROM income_trees
GROUP BY zipcode
HAVING COUNT(*) > 1;

No duplicated zipcodes in the table.

-- Retrieves the first 10 rows from the 'income_trees' table
SELECT *
FROM income_trees
LIMIT 10;

Step 3. Data Analysis using SQL and Data Visualization in Tableau

1. How many trees are in metro NYC?

-- Counts the total number of trees
SELECT COUNT(tree_id)
FROM trees;

683788 trees.

2. How many trees are in each borough?

-- Counts the number of trees in each borough and calculates their percentage of the total
SELECT boroname AS borough_name, COUNT(*) AS tree_count,
ROUND(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM trees),2) AS percentage_of_total
FROM trees
WHERE boroname IS NOT NULL
GROUP BY boroname
ORDER BY 2 DESC;

Analysis

The table presents the distribution of trees across the boroughs of New York City. It shows the total count of trees in each borough along with the percentage of the total number of trees in the dataset that each borough represents.

  • The borough Queens represents the highest number of trees, accounting for approximately 36.64% of the total trees in the dataset. This suggests that Queens has a significant amount of greenery and tree cover compared to other boroughs.
  • Meanwhile, Manhattan has the fewest number of trees among the boroughs, comprising only 9.57% of the total. This is understandable given Manhattan’s dense urban landscape and limited space for large-scale greenery.

3. What is the income of each borough?

To get this information I create a temporary table ‘joined_trees’ by joining ‘trees’ and ‘income_trees’ tables.

-- Creates a temporary table 
CREATE TEMP TABLE joined_trees AS
SELECT
t.index,
t.tree_id,
t.tree_dbh,
t.stump_diam,
t.status,
t.health,
t.spc_latin,
t.spc_common,
t.address,
COALESCE(t.zipcode, it.zipcode) AS zipcode,
t.borocode,
t.boroname,
t.nta_name,
t.state,
t.latitude,
t.longitude,
COALESCE(it.Estimate_Total, 0) AS estimate_total,
COALESCE(it.Margin_of_Error_Total, 0) AS margin_of_error_total,
COALESCE(it.Estimate_Median_income, 0) AS estimate_median_income,
COALESCE(it.Margin_of_Error_Median_income, 0) AS margin_of_error_median_income,
COALESCE(it.Estimate_Mean_income, 0) AS estimate_mean_income,
COALESCE(it.Margin_of_Error_Mean_income, 0) AS margin_of_error_mean_income
FROM
trees t
FULL OUTER JOIN
income_trees it ON t.zipcode = it.zipcode;


-- Calculates median and average income for each borough
SELECT
boroname as borough,
ROUND(AVG(estimate_mean_income), 0) AS mean_income,
ROUND(AVG(estimate_median_income), 0) AS median_income,
COUNT(tree_id) AS count_trees
FROM
joined_trees
WHERE estimate_median_income > 0 and estimate_mean_income > 0 AND boroname IS NOT NULL
GROUP BY
boroname
ORDER BY 4 DESC;

Analysis

This table offers a comparison of income levels (mean and median) alongside the count of trees across different boroughs in New York City.

  • Despite having the highest income levels, Manhattan has the lowest tree count among the boroughs. This suggests a potential trade-off between urban development and green space preservation in densely populated areas.
  • Queens has the largest count of trees compared to the other boroughs, while its mean and median income levels are not significantly different from those of other boroughs.
  • The boroughs that have a substantial greenery presence (such as Queens and Brooklyn) do not exhibit notably higher or lower income.

4. What is the income across different zip codes?

-- Retrieves the top 5 zipcodes with the highest median income
SELECT
DISTINCT zipcode,
estimate_median_income,
boroname as borough
FROM
joined_trees
WHERE boroname IS NOT NULL
ORDER BY estimate_median_income DESC
LIMIT 5;
-- Retrieves the top 5 zipcodes with the lowest median income (excluding zero values)
SELECT
DISTINCT zipcode,
estimate_median_income,
boroname as borough
FROM
joined_trees
WHERE boroname IS NOT NULL AND estimate_median_income != 0
ORDER BY estimate_median_income
LIMIT 5;
WITH Borough_Ranks AS (
SELECT
boroname as borough,
zipcode,
estimate_median_income,
ROW_NUMBER() OVER (PARTITION BY boroname ORDER BY estimate_median_income) AS poorest_rank,
ROW_NUMBER() OVER (PARTITION BY boroname ORDER BY estimate_median_income DESC) AS richest_rank
FROM
joined_trees
WHERE boroname IS NOT NULL and zipcode != 83
)
SELECT
boroname,
MIN(CASE WHEN poorest_rank = 1 THEN zipcode END) AS poorest_zipcode,
MAX(CASE WHEN richest_rank = 1 THEN zipcode END) AS richest_zipcode
FROM
Borough_Ranks
GROUP BY
boroname;

The tables indicate 5 first richest zipcodes and 5 poorest in Metro NY. And the last one showcase the poorest and richest zipcode and each borough.

5. What is the tree concentration across different zip codes?

6. Does tree health vary across rich vs pood zip codes?

-- Calculates various statistics for each borough tree count, and percentage of trees in different health states
WITH Boroughs AS (
SELECT DISTINCT boroname
FROM joined_trees
),
Borough_Health_Tree_Counts AS (
SELECT
j.boroname,
j.health,
COUNT(*) AS tree_count
FROM joined_trees j
WHERE j.health IS NOT NULL AND j.health != '' /
GROUP BY j.boroname, j.health
),
Borough_Health_Total_Trees AS (
SELECT
bh.boroname,
SUM(bh.tree_count) AS total_trees
FROM Borough_Health_Tree_Counts bh
GROUP BY bh.boroname
)
SELECT
b.boroname AS Borough,
bh.health AS Health_State,
SUM(bh.tree_count) AS Total_Trees,
ROUND(SUM(bh.tree_count * 100.0) / bt.total_trees, 2) AS Percentage
FROM
Boroughs b
CROSS JOIN
(SELECT DISTINCT health FROM Borough_Health_Tree_Counts WHERE health IS NOT NULL) AS h
LEFT JOIN
Borough_Health_Tree_Counts bh ON b.boroname = bh.boroname AND h.health = bh.health
JOIN
Borough_Health_Total_Trees bt ON b.boroname = bt.boroname
GROUP BY
b.boroname, bh.health, bt.total_trees
ORDER BY
b.boroname, bh.health;

The table provides information about the distribution of tree health states (Fair, Good, and Poor) within different boroughs of New York City, along with the corresponding total tree count and percentage of trees in each health state.

Analysis:

  • Queens has the highest proportion of “Good” health state trees, followed closely by Brooklyn and Manhattan.
  • Staten Island has the highest proportion of “Fair” health state trees compared to other boroughs.
  • The distribution of “Poor” health state trees is relatively consistent across boroughs, with a small percentage in each.
  • The majority of street trees across all boroughs are in “Good” health, indicating generally favorable conditions for urban tree growth and maintenance.
  • The chart indicates that the wealthiest zip codes tend to have a slightly higher proportion of trees rated as ‘Good’ in health (with the exception of Brooklyn). The most significant disparity is observed in Staten Island borough, where it reaches approximately 6%.

7. Do certain boroughs have more living trees?

--Computes the percentage of total trees in each borough categorized by their status
SELECT boroname AS borough,
status,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (PARTITION BY boroname), 2) AS percentage_of_total
FROM trees
GROUP BY boroname, status;

Analysis

This table provides the distribution of tree statuses (Alive, Dead, and Stump) across different boroughs in New York City, represented as a percentage of the total number of trees in each borough.

  • The highest proportion of living trees is in Staten Island, accounting for 96.32% of the total.
  • A small percentage (1.77% and 1.78%) of trees are classified as dead in Staten Island and Queens.
  • The highest proportion (2.75%) of trees are categorized as dead in Manhattan.

8. Do certain borough have newer/younger trees?

The age of trees can often be estimated based on the diameter of the trunk.

-- Calculates the average trunk diameter of trees (only selects the top 10 boroughs with the largest average tree diameter) in each borough 
SELECT borough,
trunk_diameter
FROM (
SELECT boroname AS borough,
ROUND(AVG(tree_dbh), 2) AS trunk_diameter,
RANK() OVER (PARTITION BY boroname ORDER BY AVG(tree_dbh) DESC) AS rank
FROM trees
GROUP BY boroname
) AS ranked_trees
WHERE rank <= 10
ORDER BY 2 DESC;

The table provides information on the average trunk diameter of trees in different boroughs of New York City.

Analysis

  • Queens has the highest average trunk diameter at 18 inches, indicating potentially older and larger trees on average compared to other boroughs.
  • Manhattan has the smallest average trunk diameter at almost 11 inches, indicating a prevalence of younger or smaller trees, possibly due to limited space for tree growth in urban environments.

All charts and dashboards can be viewed interactively in Tableau.

Conclusion

Research Questions: Is there a relationship between trees and income in NYC?

Key Findings:

  1. Tree Population in Metro NYC:
  • The analysis revealed a total of 683788 trees across metro NYC.

2. Tree Distribution by Borough:

  • Queens has the largest number of trees, followed by Brooklyn, Staten Island, Bronx, and Manhattan, respectively.

3. Income Disparity by Borough:

  • The boroughs that have a substantial greenery presence (such as Queens and Brooklyn) do not exhibit notably higher or lower income. Meanwhile, Manhattan with its highest income levels, has the lowest tree count among the boroughs.

4. Income Variation Across Zip Codes:

  • Income varies significantly across different zip codes within each borough (from 20210$ in Bronx till 250000$ in Manhattan), as indicated by the differences in median income estimates.

5. Tree Concentration by Zip Code:

  • Certain zip codes exhibit a higher concentration of trees compared to others, suggesting variations in greenery distribution across the city.

6. Relationship Between Tree Health and Income:

  • The chart indicates that the wealthiest zip codes tend to have a slightly higher proportion of trees rated as ‘Good’ in health (with the exception of Brooklyn). The most significant disparity is observed in Staten Island borough, where it reaches approximately 6%.

7. Tree Vitality Across Boroughs:

  • In Staten Island, the highest proportion of trees (96.32%) are alive, while a small percentage (1.77% and 1.78%) are classified as dead in Staten Island and Queens, respectively. Manhattan has the highest proportion of dead trees at 2.75%.

8. Tree Age Distribution:

  • Queens as potentially having older and larger trees on average, while Manhattan has smaller trees, likely due to limited space for growth in urban environments.

Thank you for taking the time to read through this project. If you happen to notice any errors or have suggestions for improvement, please don’t hesitate to leave a comment. Your feedback is highly valued!

Connect with me on LinkedIn and explore more of my projects on GitHub.

Happy coding! 😊

--

--

Elizaveta Chvorda

A doctor interested in Statistical Method in Data Analysis. More content about Biostatistics and Bioinformatics coming soon ;)