Google Summer of Code-Human AI: Enhancing Program Evaluation Research by Leveraging AI for Integrated Analysis of Mixed-Methods Data_Shao_Jin_Final
Project Name: Enhancing Program Evaluation Research by Leveraging AI for Integrated Analysis of Mixed-Methods Data
Project Abstract: The objective of this project is to utilize AI and machine learning to improve the analysis of mixed-methods research involving surveys and focus group designs for program evaluation. By leveraging advanced technologies, we aim to gain a more comprehensive understanding of the data, extract meaningful insights, and enhance the efficiency of analysis processes.
Input Data: 2012, 2014, and 2016 Alabama Youth Tobacco Survey data.
My Work:
- Reverse engineering 2016 Alabama Youth Tobacco Survey to understand the sampling and weighting process and use the same way to sample and weight 2024 Alabama student input data.
- Based on the 2024 Alabama student input data, I need to predict the 2024 Alabama Youth Tobacco Survey data by different machine learning models (Classification Model and Regression Model)and gain some insights by data visualization.
- I design an qualitative question for tobacco usage and synthesized corresponding answers to demonstrate the potential of machine learning to code, analyze, and summarize large numbers of open-ended responses. In addition, I design an interactive website where users can submit their own answers, and the model predicts the category their response falls into.
Important Note: In order to run all my following code, you need to follow the guide in this link to upload the input data files to Google drive, and my full codebase of this project is in this link.
Detailed Explanation for part 1:
a. I delved into the Alabama Youth Tobacco Survey data set and categorized the columns into three sections: the weighted part (Weight, Strat, PSU), the student information part (CR5: Race, CR2: Gender, CR3: Grade, Year), and the survey answer part (CR6-CR80).
b. By studying the selection and weighting process detailed in the State-YTS-Methodology-Report, I attempted to replicate the 2016 Alabama Youth Tobacco Survey using the 2016 raw data from publicly available demographic data including Alabama regions of school enrollment, school’s race, gender and grade distribution. Then I based on the each school enrollment to do the sampling process and adjust weight by school, class student non-response. My reproduced data showed lower mean and lower standard deviation in weights, likely due to certain assumptions I made: I assumed every selected class participated in the survey, hence no class non-response adjustment, and I assumed uniform student response rates across all schools due to a lack of specific response rate data.
c. To generate the 2024 Alabama weighted and student information part, I first gathered 2024 school enrollment data, along with race, grade, and gender distributions in Alabama from publicly available demographic data. I then applied the sampling and weighted processes from 2016 to this new data.
The code for this process is available at the following link: https://colab.research.google.com/drive/1wXhm9tkE7FvF072RskqTQiA6BkxlldJw?usp=sharing
Detailed Explanation for part 2:
Classification Model excluding feature “Year”: Choose best model between the KNN Classifier and Random Forest Classifier to train all columns of survey answers.
KNN Classifier: A non-parametric, supervised learning classifier that uses proximity to make classifications or predictions about the grouping of an individual data point. Random Forest Classifier: A meta estimator that fits multiple decision tree classifiers on various sub-samples of the dataset and uses averaging to improve predictive accuracy and control overfitting.
When I first attempted to train my model, I noticed that the predicted values for the 2024 dataset, such as in column CR26, were all ‘2’. This prompted me to review each output column in my training set, where I discovered that most values were dominated by a single number, indicating an uneven distribution. To address the underrepresented groups, I decided to use SMOTE (Synthetic Minority Over-sampling Technique). SMOTE is an oversampling technique for binary or multiclass tasks, designed to handle class imbalance issues. It generates new synthetic samples in the feature space to increase the number of minority class samples.
During the training process, I used GridSearch to choose the best hyper parameters for KNN and Random Forest.
Best Model: KNN Best Hyperparameters: {‘metric’: ‘euclidean’, ‘n_neighbors’: 7, ‘weights’: ‘distance’} ; KNN Validation Accuracy: 0.8186604886267904; KNN Test Accuracy: 0.8085509688289806.
The code for Classification Model is available at the following link: https://colab.research.google.com/drive/1tdgzzgPODIzrQ47CG1PKw9Gi6WHrONWf?usp=sharing
Regression Model including feature “Year” to show time effect towards data: Choose best model between RandomForestRegressor and XGBRegressor to train each column of survey answer.
XGBRegressor is based on the gradient boosting framework to capture complex relationship in dataset. RandomForestRegressor is a regression model that uses an ensemble of decision trees to make predictions to handles large datasets.
During the training process, I also used GridSearch to choose the best hyper parameters for models of each column. Detailed hyper parameters choice is at the code link below.
The code for Regression Model is available at the following link: https://colab.research.google.com/drive/1zEqpN39AQh5SE6btBdRnHyOywf2zkWLJ?usp=sharing
Data visualization for input and output:
Input: student information part (CR5: Race, CR2: Gender, CR3: Grade). I present graphs showing their variation over the years(2012, 2014, 2016, 2024). The code for input visualization: https://colab.research.google.com/drive/12qRQFlqTPYuKGgMGxtZWMIiXzI1qtht3?usp=sharing
Output: survey answer part. For each survey response column, I presented several graphs illustrating the variation in choices across different races, grades, and genders over the years. These visualizations provided insights into how different demographic groups have responded to the survey questions across multiple years, highlighting trends and patterns within the data.. The code for output visualization: https://colab.research.google.com/drive/1sQf8LjHUj6b0iyI0LKA6Rr-aZcj9T-29?usp=sharing
Detailed Explanation for part 3:
I designed an open-ended qualitative question: What is your opinion on the health impacts of traditional smoking versus vaping? Since I didn’t have real input data, I utilized GPT tools to generate synthetic responses. I categorized each answer into four categories: 0 (Negative effects of traditional smoking), 1 (Negative effects of e-cigarettes), 2 (Positive effects of e-cigarettes), and 3 (Non-committal or unsure). After preprocessing the data using NLP techniques, I compared Logistic regression and K-means models for analysis. Logistic regression was selected due to its higher accuracy of 0.64, outperforming K-means. Finally, I developed an interactive website where users can submit their own answers, and the model predicts the category their response falls into.
The code for qualitative question is in the following link:https://colab.research.google.com/drive/1joFvQBdigNmxiMvioL4SnDhUIfvic9LG?usp=sharing and the interactive website is in the following link:https://huggingface.co/spaces/Shao11111/qualitative
I have a really great experience in GSoC this year, and I improved my machine learning and programming skills. I am grateful for the opportunity to have been a part of it.
I would like also to thank my mentor, Sarah Dunlap, for her guidance and support.