The relationship between income and diabetes prevalence

Kirstin Nichols
INST414: Data Science Techniques
5 min readFeb 9, 2024

As someone curious about methods to improve health equity, I wanted to determine whether there is a correlation between income and a diagnosis of diabetes/prediabetes. It can be assumed that those with a higher income probably fare better with social determinants of health, such as health care, nutritious food, and time/equipment to exercise. However, some may argue that no matter one’s income, they can still make healthy choices when it comes to food and activity level, therefore improving their odds of good health outcomes. I decided to analyze a dataset about diabetes diagnosis and other lifestyle information to see if there is a relationship between diabetes and income level. Based on these results, we can make inferences about the correlation between income and health, as certain types of diabetes are typically seen as the result of poor health behaviors.

Stakeholders interested in this question are likely those working in public offices, whether that be in political positions, government agencies, or even hospitals. The relationship between income and prevalence of diabetes may help to inform decisions about where to allocate funds for health initiatives. For example, if there is a relationship between a certain income level and a higher prevalence of diabetes, government workers may set policies that promote healthy nutritional/agricultural policies in areas known to have many people within that income bracket. There are even governmental programs aimed specifically at diabetes, such as the American Diabetes Association (ADA) Education Recognition Program, that may be interested in having a clearer idea of where to target their efforts. Decision makers may want to allocate efforts like health education campaigns and nutritional resources towards areas with higher concentrations of those more likely to have diabetes.

The data I decided to analyze indicates diabetes with three classes. A classification of 0 means no diabetes or only during pregnancy, 1 means prediabetes, and 2 means diabetes. Other categories of data included are information about high blood pressure, high cholesterol, cholesterol checks, BMI, cigarette usage, stroke history, heart disease, physical activity, consumption of fruits and vegetables, alcohol consumption, health care, ability to pay for doctor appointments, general health, mental health, physical health, difficulty walking, sex, age, education, and income level. For many of these factors, such as consumption of fruits/veggies and health care, I would assume that income would play a role in access. This is why I want to study only the correlation between income and diabetes; the result could provide a broader idea of whether the ability to spend more money on a healthy lifestyle truly is related to health status. In this dataset, 1 is the lowest income bracket at less than $10,000, and 8 is the highest at $75,000 or more.

I collected this data using Kaggle. The dataset I used is from a health-related telephone survey called the Behavioral Risk Factor Surveillance System (BRFSS). This survey is collected annually by the CDC and receives participation from over 400,000 Americans on health topics. This dataset in particular holds responses from 441,455 individuals and has 330 features, including direct questions and calculated variables. I decided to use the dataset with 3 classes of diabetes so that I could better visualize how prevalence of diabetes is spread across income levels. In particular, I thought it would be interesting to see if there was a difference between the income correlated with those who are diabetic and those who are prediabetic. According to an article by Reuters, lifestyle changes can help one stay prediabetic or even revert to a normal blood sugar. A correlation between income and rates of diabetic/prediabetic people could indicate something about the impact income has on the ability of people to take preventative measures when a health risk is imminent.

I analyzed the income and diabetes variables in a Jupyter Notebook. I started by making a bar chart comparing median income and diabetes classification. The median income for those without diabetes was 7, whereas the median income for those who were prediabetic and diabetic was 6.

I then cleaned the data to only include the columns for diabetes and income. Once this was done, I created a histogram comparing the ‘percent per unit’ of different diabetes levels across income levels. By far, the highest prevalence of no diabetes (over 50%) was at the highest income group, followed by over 20% in the second highest income group. Also in these two income groups (as well as group 6), prediabetes was just barely above diabetes in terms of percent per unit. At all other income levels, diabetes had the highest percent per unit if there was data provided.

I then looked at the income mean for each diabetes label. I found mean incomes of 6.21 for no diabetes, 5.35 for prediabetes, and 5.21 for diabetes. After seeing this difference, I began a hypothesis test on the data, with the null hypothesis that there is no relationship between income and prediabetes/diabetes and an alternative hypothesis that there is a relationship between income and prediabetes/diabetes.

My hypothesis produced a p-value of 0, and compared with an alpha level of 0.05, we can reject the null hypothesis. We can therefore conclude that there is a relationship between income and diagnosis of diabetes or prediabetes. Looking at our previous statistical summary, it looks as though there is a correlation between higher income and not having diabetes, and prediabetes seems to be more prevalent than diabetes in higher income groups as well. However, we should do tests looking specifically at prediabetes vs diabetes in terms of income to get a better understanding. Our findings may be due to the fact that people with higher incomes can afford healthier lifestyles, but we would need to do additional research on variables such as spending habits to determine this. This data may be biased in that it was from a telephone survey, so only those with access to telephones and the time to participate could provide their answers. This may eliminate those of certain groups, potentially those of lower income groups, from the data. Although there are some shortcomings, overall, our findings indicate that those involved in agencies that work on health initiatives may want to aim campaigns at lower income areas due to a higher correlation with diabetes.

Here is the GitHub link to an HTML of my Jupyter Notebook: https://github.com/kirstinnichols/INST414/blob/f07cfdd24c7571364bb35e625267a1e9e17a52ac/assignment1%20(1).html

--

--