Read For A Better Future!
Have you ever wondered how education and literacy influence economic and social outcomes? My hypothesis is that higher levels of education and literacy are associated with higher employment and income levels and lower occurrences of social issues, such as crime and poverty. This hypothesis is based on the assumption that individuals who excel at reading and writing are able to get jobs, make an honest living, and stay out of trouble. This research article seems to agree with my hypothesis. They did a systematic study by tracking participants’ academic performance when they were in school and following up with them to see how they did financially later in life.
The University of Wisconsin’s County Health Rankings is an excellent source for a large number of county-level variables. Specifically, the dataset contains information about the 3000+ counties in the United States and reports close to 700 variables for each county. For my analysis, I picked variables “Reading scores raw value” and “Math scores raw value” to represent measures of education and literacy levels. I picked the variables “Median household income raw value” and “Unemployment raw value” to represent economic indicators. For social outcomes, I picked the variables, “Disconnected youth raw value”, “Juvenile arrests raw value”, “Children in poverty raw value”, and “Violent crime raw value”. My hypothesis is that these variables are the most relevant measures of interest for my analysis.
The distribution of each variable and the relationship between each pair of variables is shown in the chart below. This is based on the United States county data from County Health Rankings. The histograms for each variable are shown along the diagonal and it is easy to see that each distribution is roughly normal. Each scatter plot considers the relationship between a pair of variables. Each point in a scatter plot reflects data about one particular county. By plotting data about the 3000+ counties on each scatter plot, we look for patterns and correlations.
As seen by the various scatter plots, some pairs of variables have a positive relationship (going in the same direction), while others have a negative relationship (going in opposite directions), and a few pairs do not have any obvious relationship. The strongest positive relationship is between Reading and Math scores, which makes sense intuitively since they both measure literacy and education. The strongest negative relationship is between Median household income and Children in poverty. This also makes sense intuitively since counties with higher income would likely have a lower percentage of children living in poverty.
Another useful way to visualize the relationships between variables is by using a heatmap, as shown in the chart below. Yellow color represents a strong positive relationship, while dark blue color represents a strong negative relationship between a pair of variables. The variables of interest are plotted on each axis and a matrix is created, where each cell represents the relationship between the variable on the x-axis and the variable on the y-axis. The chart below shows that, roughly speaking, the cells in the top left quadrant of the heat map have yellow and green colors since they influence each other in the same direction — Reading scores, Math scores, and Median household income. The same is true for the cells in the bottom right quadrant of the heat map — unemployment, poverty, crime, and arrests are expected to have the same trend, meaning if one goes up, the other goes up too. On the other hand, the cells in the top right and bottom left are colored shades of blue, indicating a negative relationship. This also makes sense intuitively. For example, for counties with higher Reading or Math scores, we would expect to see lower unemployment, crime, or poverty.
After looking at the scatter plot matrix and the heat map above, it is now time to zoom in. First, we look at a pair of variables with a positive relationship. This is followed by looking at a pair of variables with a negative relationship. The chart below shows the positive relationship between Reading scores and Median household income. Counties with higher Reading scores also have higher Median household incomes. We should be careful not to conclude that higher Reading scores cause higher incomes — correlation is NOT causation. There are many possible explanations for this correlation. Perhaps, higher Household incomes cause higher Reading scores since families in those counties are more educated and value education for their kids, resulting in higher Reading scores. Also, perhaps there is an indirect relationship between these variables. For example, higher incomes result in something else that results in higher Reading scores. The correlation coefficient for these two variables is 0.41, which represents a strong correlation.
The following chart shows the negative relationship between Reading scores and Disconnected youth. Counties with higher Reading scores have a lower proportion of Disconnected youth. As mentioned earlier, the correlation should not be equated with causation. Perhaps counties with a lower proportion of Disconnected youth have higher Reading scores since there is a strong sense of family that places high importance on education, which results in higher Reading scores. Another explanation could be that the counties with higher Reading scores would have more successful youth, hence they feel less disconnected. Or there could be an indirect relationship between the two variables. The correlation coefficient for these two variables is -0.35, which represents a strong correlation.
The analysis in this post showed the correlation between variables that measure literacy, financial well-being, and social issues across all of the counties in the United States. Counties that have higher Reading and Writing scores also have higher Median household incomes and lower unemployment numbers. They also have lower occurrences of social issues such as poverty, crime, arrests, and disconnected youth.