Predicting High School Dropout Rates in NY State: A Data Science Approach

7 min readMay 3, 2024

Contributor: Abdulla Mamun, Mapalo Lukashi, Rachael Ojopagogo

Introduction

In recent years, the educational landscape has increasingly turned to data science to unravel the complex factors influencing student success and retention. Among these factors, high school dropout rates serve as a critical indicator of both individual and systemic educational challenges. Recognizing the importance of this issue, our project delves into analyzing high school dropout rates in New York State, utilizing a dataset provided by the NY State Education Department for the 2018–2019 school year.

Spanning over 73,000 observations, this dataset offers a comprehensive view of graduation metrics across various NY State school districts, presenting a unique opportunity to explore the underlying causes of student dropout. Through this exploration, our primary objective emerges: to construct and evaluate a series of regression models capable of predicting dropout counts with precision. By achieving this goal, we aim not only to shed light on the factors contributing to high school dropouts but also to empower educational institutions and policymakers with data-driven insights to forge effective interventions.

Embarking on this endeavor, we utilized Python and Jupyter Notebooks — a testament to the power and flexibility of open-source tools in conducting sophisticated data analyses. This blog post outlines our journey through the data science project lifecycle, from the initial data collection to the final model evaluation, highlighting the methodologies, challenges, and key findings along the way.

Join us as we navigate through the intricate process of transforming raw data into actionable knowledge, demonstrating the pivotal role of data science in enhancing educational outcomes and fostering a brighter future for students across New York State.

Methodology

Our project’s journey through the data science lifecycle was meticulously charted with the following milestones:

We began our analysis with Python, leveraging its robust libraries that serve as the bedrock for data science:

Pandas: For data manipulation and analysis.

NumPy: For numerical computations.

Matplotlib and Seaborn: For crafting informative data visualizations.

Scikit-learn: For applying machine learning techniques and data preprocessing.

Statsmodels: For building and interpreting complex statistical models.

Data Loading: The raw data, encapsulating high school graduation metrics for the 2018–2019 academic year, was sourced from the New York State Education Department. We used Pandas to ingest and conduct preliminary processing of this expansive dataset, setting the stage for our comprehensive analysis.

● Exploratory Data Analysis (EDA): Our in-depth EDA was pivotal in detecting patterns, anomalies, and identifying key variables. This step was instrumental in painting a clear picture of the factors influencing dropout rates.

● Data Preparation: We meticulously prepared our dataset for modeling. This included imputation strategies for missing data, outlier management to ensure data quality, and encoding categorical variables into a machine-readable format.

● Regression Modeling: We delved into multiple regression models, including Linear Regression and Poisson Regression. These models were further refined and interpreted using SHAP values, enhancing the transparency and understanding of our predictive outcomes.

● Model Evaluation and Selection: The final phase involved a critical comparison of models using metrics like MSE, RMSE, and MAE. This quantitative evaluation allowed us to identify and select the most accurate model for predicting dropout counts.

Exploratory Data Analysis (EDA)

Our EDA revealed critical insights into the dataset’s characteristics and the factors influencing high school dropout rates in New York State. Key findings include:

● Trends and Patterns: We observed significant variations in dropout rates across different school districts, with some areas showing higher dropout rates than others. This suggested that location and possibly socioeconomic factors play a role in student retention.

● Anomalies and Outliers: The analysis identified outliers in the data, particularly in districts with exceptionally high or low dropout counts. These outliers were scrutinized to ensure they didn’t skew our model’s predictive accuracy.

● Variable Relationships: A closer examination of the variables revealed that certain factors, such as enrollment counts and graduation rates, had a pronounced impact on dropout rates. This guided our feature selection in the modeling phase.

Data Preparation

To ensure our dataset was primed for accurate modeling, we undertook several crucial steps in the data preparation phase:

● Handling Missing Values: Employed KNN imputation to fill in missing data, preserving the integrity and richness of our dataset.

● Addressing Outliers: Implemented statistical techniques to manage outliers, ensuring they did not adversely affect our model’s performance.

● Feature Encoding: Transformed categorical variables into numerical format through one-hot encoding, enabling their inclusion in our regression models.

● Standardization: Applied standardization to numerical features to normalize their scales, facilitating more stable and interpretable model training.

These preparation steps were pivotal in enhancing the quality of our dataset, setting a strong foundation for the modeling phase.

This graph illustrates the distribution of the ‘dropout count’ variable before (above) and after (below) outlier management, highlighting the effectiveness of our data preparation phase in normalizing the data for more accurate modeling.

Post-Outlier Management Boxplot

Regression Modeling

In the regression modeling phase, we transitioned from data preparation to predictive analytics. Here’s how we approached building and refining our models:

Model Construction:

We experimented with various regression techniques to best capture the nuances of our data. The models we tested included:

Linear Regression: To capture linear relationships between features and the target variable.

Comparative performance plots of our regression models, illustrating their predictive capabilities.

Poisson Regression: Ideal for count data, this model was used given the nature of dropout counts.

Comparative performance plots of our Poisson models, illustrating their predictive capabilities.

Model Interpretability with SHAP:

To understand the contribution of each feature to our models’ predictions, we utilized SHAP values. This interpretability tool is essential in demystifying the model’s decision-making process.

Model Evaluation and Selection

A bar chart showcasing the MSE, RMSE, and MAE across different models.

Upon building our models, we embarked on a rigorous evaluation process to discern the most accurate and reliable one. This involved:

● Error Metrics: We computed Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE) for each model to gauge their performance quantitatively.

● Visual Assessments: Performance plots illustrated the spread of predicted versus actual dropout counts, shedding light on the models’ predictive accuracies.

● Statistical Significance: Coefficient p-values from the models’ summaries provided insights into the statistical significance of each predictor.

The selection was made not solely on the numerical scores but also on the interpretability and relevance of the models to stakeholders. The Linear Regression model emerged as the top performer, offering a balance of accuracy and clarity.

Conclusion

Our comprehensive data science initiative has come full circle, demonstrating the impactful use of regression analysis to predict the dropout rates in New York State high schools. Diving into a detailed dataset from the 2018–2019 academic year, we’ve distilled complex educational data into actionable insights about student retention.

The project’s thorough exploratory data analysis (EDA) revealed critical patterns, informing our robust data preparation and modeling. Our deployment of various regression models, enhanced by SHAP analysis, allowed us to quantify the influence of each feature on dropout rates.

The standout Linear Regression model showcased exceptional accuracy and interpretability, underlining the importance of careful data management and clear model understanding. These findings hold significant potential for informing real-world educational strategies, providing a data-driven foundation for initiatives aimed at improving graduation rates. This aligns with a broader commitment to education, highlighting data science’s role in addressing and mitigating academic challenges, ultimately striving for a future where every student has the resources and support to succeed.

Linkedin profile links:

Abdulla Mamun: https://www.linkedin.com/in/abdulla-mamun-4222b11a/

Mapalo Lukashi: http://linkedin.com/in/mapalo-lukashi-0129b17b/

Rachael Ojopagogo : www.linkedin.com/in/rachael-ojopagogo

Github Links:

Abdulla Mamun : https://github.com/mamun21616

Mapalo Lukashi: https://github.com/Mapalo2023

Rachael Ojopagogo : https://github.com/Rakel2311