Analyzing Mental Health Trends Across United States in 2024

Published in

INST414: Data Science Techniques

5 min readSep 15, 2024

Introduction

In the context of rapidly advancing technology, mental health conditions and challenges are growing in prevalence. Students, in particular, are facing heightened levels of anxiety and depression, influenced by factors ranging from academic pressures to the constant demands of digital connectivity. To curb the raising levels of these mental health conditions, there is a need for immediate understanding, addressing, and intervention to these issues. As the incidence of mental health disorders continues to rise, analyzing state-level data can help us identify areas of support and assistance, and can collectively target towards improved mental well-being across the nation.

Research Questions and Stakeholders

Questions

Which states have reported the highest levels of depressive and anxiety symptoms in the year 2024?
Which states have reported the lowest levels of depressive and anxiety symptoms in the year 2024?
What is the average highest and lowest mental health condition reported in 2024?

Stakeholders

Mental health researchers: Interested primarily in understanding and developing improved studies on the variation in mental imbalance.
Public health officials: Who are looking towards positive intervention in the states that results in higher and lower percentages of symptoms of anxiety and depression.

Informed Decisions

The dataset and code are designed to provide stakeholders with insights into the states exhibiting the highest and lowest levels of depressive and anxiety symptoms. These results can inform the implementation of targeted campaigns and initiatives aimed at benefiting the general public and raising awareness about mental health issues.

Tools and Technologies

To curate and publish this initial research, the following tools and software’s were utilized:

Apple Numbers
Microsoft Excel
Visual Studio Code
Medium
Programming Language: Python (pandas, matplotlib)

Data

The data that is used to answer the specific question is gathered from the U.S. Department of Health and Human Services. The dataset includes several variables that provide valuable insights:

Observations: [16,326 x 14]

Variables:

Indicators: Specifies the type of mental health condition being measured (anxiety, depression, anxiety or depression).
Group: The category of data aggregation.
State: Identifies the U.S. state where the data was collected.
Subgroup: Identifies the U.S. state where the data was collected (redundant variable).
Phase: Data collection stage/phase.
Time period: Duration (in days) of the data.
Time Period Label: Describes the time frame during which the data was recorded.
Time Period Start Date: Starting date of the time period (format: m/d/yy).
Time Period End Date: Ending date of the time period (format: m/d/yy).
Value: Represents the percentage of individuals exhibiting symptoms of the condition.
Low CI: Lower bound of the confidence interval.
High CI: Upper bound of the confidence interval.
Confidence Interval: Provides a range within which the true value is likely to fall, reflecting the precision of the measurement.
Quartile Range: Values classified in different ranges.

While these variables play a vital role in underlying the mental implications of each state, and help us answer questions about average mental prevalence, not all variables will be included for our primary analysis.

Data Subsets and Filtering

Data Cleaning

Manual Cleaning: Removed all null and empty values in any observational unit or variable columns. Additionally, ensured no data is duplicated or misrepresented during the conversion from CSV to XLSX format.

Data Filtering

Filtering: Utilized Microsoft Excel’s feature “filter” to extract data specific that answers the question:

Grouping the data by “State.”
Selecting data from Phase 4 and above.
Filtering data collection on or after 2024.

Data Wrangling

Variable Selection: Selected target variables that focused on answering the preliminary question, primarily being:

Indicator
Group
State
Time Period Label
Value

Common Bugs

Inconsistent state names: Ensure that the state names are consistent throughout the dataset.
Outliers: Handle efficiently; there are no major outliers that might skew the data and mislead the conclusions and results.

Exploratory Data Analysis (EDA) Plan

Data Overview: Ensure proper loading and reading of the dataset and understanding all the variables and values of the dataset prior to analysis.
Distinct Indicators: Understand the distinct values that lie among the key variable “Indicator.”
Maximum and Minimum occurrences: Filter the dataset based on “State," and counter the maximum and minimum occurrences of each indicator to understand the intensity of mental health in each state (Q1).
Time Period Analysis: Understand which time period had the highest recorded symptoms/mental cases, filtered with each state (Q2).
Average Mental Health cases: Utilize bar graphs to represent the effects of anxiety and depression in each state (Q3).

Data Analysis and Findings

The top 5 states that recorded highest anxiety, depression, anxiety or depression symptoms were:
- Alabama: 4
- Pennsylvania: 4
- Nevada: 4
- New Hampshire: 4
- New Jersey: 4
The top 5 that recorded lowest anxiety, depression, anxiety or depression symptoms were:
- Maryland: 4
- Massachusetts: 4
- Michigan: 4
- Minnesota: 4
- Wyoming: 4
States with highest and lowest anxiety, depression, anxiety or depression percentage were:
- Highest anxiety percentage: Alabama = 18.525
- Highest depression percentage: Alabama = 15.275
- Highest anxiety or depression percentage: Alabama = 22.475

- Lowest anxiety percentage: Wyoming = 16.875
- Highest depression percentage: Wyoming = 12.275
- Highest anxiety or depression percentage: Wyoming = 20.15

Visualizations

Bar Graph: Average Percentage of Anxiety (left) & Depression (right) symptoms recorded in each state

Limitations and Biases

Limitations

Uniform Sample Size: The initial dataset consists of a uniform sample size of 5,442 across all three indicators. This uniformity may lead to inefficient data collection methods, as it does not account for variability in different subsets of the population.
Flawed Variables: The variable “Time period” does not correlate with the total number of days between the “Time period label” or the “Time Period Start Date” and “Time Period End Date.” This discrepancy raises questions about the exact time frame being referred to and whether the “Time period” has relevance in other contexts.
Redundant Variables: The variables “Time period label” and “Time Period Start Date — Time Period End Date” provide the same information in different formats. This redundancy could increase the data’s size unnecessarily, especially with a large sample, and should be consolidated into a single variable.
Duplicated Meanings: The “State” and “Subgroup” variables convey the same meaning. This redundancy should be addressed to enhance clarity and reduce unnecessary complexity in the dataset.

Biases

Uniformity: All states have four recorded symptoms for each indicator, which may suggest that data was collected simultaneously across states, or it could indicate a uniform data reporting frequency.
Data Collection Methodology: Different states may use varying methods for data collection and reporting, potentially leading to inconsistencies in the dataset.
Sampling Bias: The dataset may not accurately represent the entire population if certain groups are underrepresented or if there are differences in the willingness or ability of individuals to report their symptoms.

Github Repository

To view the project, view my GitHub.