The old adage from Émile Borel’s 1913 analogy claims that an infinite number of monkeys with an infinite number of typewriters and an infinite amount of time could eventually replicate all the works of Shakespeare. Today’s cheap and readily accessible computing power won’t prove this true yet — but with machine learning it feels really possible!
To date, I’ve taken on some difficult subjects with machine learning. From fantasy football to chronic diseases — from start-up companies to global terrorism — machine learning has brought some great insights for myself and my clients.
Let’s tackle another difficult issue: poverty. Our goal is to determine the best predictors of higher or lower poverty rates across counties in the United States.
I’m using data from census.gov married to the county characteristics political database I used when discussing the 2016 presidential election.
Here’s the routine I intend to take (using our excellent tools at BigML.com):
- Process all of the fields into a dataset with poverty as a % of the county population as the predictive target field.
- Run some basic correlations based on some of my expectations: median income, race, population size, education level.
- Split the model into a training and testing set.
- Build a simple random forest model out of the training dataset and evaluate against the test dataset.
- Trim down a new dataset filtering out the “noise”
- Build another model and a boosted tree ensemble of models and evaluate for performance against the test dataset.
Let’s go through a few simple correlations using the dynamic scatterplot tool to understand what influences poverty rate. (the tool is in the upper left hand button on the dataset page).
First, poverty rate (on the y-axis) compared to county population on the x-axis. We’re looking for the 2 numbers at the bottom to get closer and closer to 1. So, in this situation, the relative size of the county population is not a good indicator of the poverty rate.
Next, let’s look at median age and median household income.
Here’s our first relatively strong correlation. This makes logical sense. In truth, we’re going to find that this is data noise… a pretty direct correlation to poverty because you’re measuring relatively similar data points. What we’re after however is data points that aren’t directly related.
Next we can look at race showing the % of white and black populations in the counties relative to poverty. There’s a story to be told about the inverse relation of these data comparisons (higher % white population potentially indicates a lower poverty rate and vice versa for % black) but the score is pretty low to draw any conclusions.
Next we can look at education. Here there seems to be a stronger correlation but still not strong enough to warrants a key factor. More and more I’m thinking we’re gonna need use of the random forest tree to mold this.
If the county has a higher % of uninsured they are more likely to have a higher poverty rate.
But sometimes our assumptions can we very misleading. For example, the highest agr correlation with poverty was the group ages 20–24 and while higher education typically indicated lower poverty the chart flattens out when it gets to the counties with the highest % of graduate degree holders. As economist Thomas Sowell noted several decades ago, college towns (like Standford) have a lower income and sometimes higher poverty rates because of “starving students.” These same poverty level folks will make 6 figures in the years to come.
Employment status and labor force participation rate, as you might expect, have a moderate correlation but it’s still below .7 .
Let’s cut to the chase and get into the models. Right away we can see that our assumptions around median household income is correct. The first tree uses it over and over again as bookendhighs and lows. We’ll filter that out.
I also filtered out anything that indicated a higher % of financial payouts associated with poverty. Let’s run a simple model again and run a summary (the clipboard button with the down arrow at the top of the model).
I could show you the tree but I can also show you the top fields ranked by importance:
1. Employment_%: 37.90%
2. Married_%: 17.96%
3. Disability_%: 6.17%
4. Ed9_12_%: 4.97%
5. White_%: 3.33%
6. EdBachelorDegree_%: 2.51%
7. Age20_24_%: 2.41%
8. MedianHousingCosts: 2.10%
9. Uninsured_%: 1.93%
10. EdK8_%: 1.51%
Employment status and marriage move to the top with disability topping 5%. Interesting.
Now I can evaluate the models against the testing dataset (the 20% that was set aside earlier):
The key number is in the lower left — our R squared result: .52
That’s not gonna cut it. Too low.
Next I ran a basic ensemble model with 10 trees — basically, it creates a model 10 different ways and then puts the results meshed together.
Much better! .75 is a solid number:
Small tweaks around field importance and moving uninsured to the top 5 fields seems to have done the trick:
Field importance from the ensemble models
1. Employment_%: 36.08%
2. Married_%: 17.22%
3. Disability_%: 4.92%
4. Ed9_12_%: 4.85%
5. Uninsured_%: 4.32%
6. White_%: 3.60%
7. EdBachelorDegree_%: 2.26%
8. EdK8_%: 2.19%
9. MedianHousingCosts: 1.76%
10. Age20_24_%: 1.72%
Still, we could have done any of this with good old Excel. Let’s get our machine learning up and learning!
Let’s run the model again but this time we’ll do a boosted tree. The best way to describe a boosted tree is a series of trees built from the previous tree. If I asked you to guess how many cats were still alive in a shoebox, then I let you look, then I asked you to make a better guess… that’s analogous to what a boosted tree ensemble model is doing.
My first pass with the default settings at 64 iterations gave us a slight drop to .74 and used about TWICE as many fields for more modeling (25 fields over 1% importance).
My next pass I chose to do a boost with 250 iterations. (From the dataset page choose ensemble and change the type to “Boosted Trees”).
You can also play with the learning rate. Sometimes… to get the best results… it’s a LOT of testing but the machine does all of the heavy lifting in the end.
Running the boosted trees the result returns the top two fields determining poverty. In this case, MedianAge and Employment:
The lighter color predicts a higher poverty rate indicating lower employment in relation to Median Age. Here are the fields in order of importance. This one had a score of .79 !
1. MedianAge: 6.78%
2. Employment_%: 6.14%
3. Married_%: 5.31%
4. Ed9_12_%: 3.56%
5. Age45_54_%: 2.95%
6. White_%: 2.89%
7. Black_%: 2.70%
8. Female_%: 2.50%
9. Population25Plus_%: 2.33%
10. MedianHousingCosts: 2.32%
11. Uninsured_%: 2.30%
12. EdBachelorDegree_%: 2.03%
13. Disability_%: 2.01%
14. AmericanIndianAlaskaNative_%: 1.98%
15. Age20_24_%: 1.77%
16. NeverMarried_%: 1.68%
17. EdAssocDegree_%: 1.62%
18. Age5_9_%: 1.61%
19. EdK8_%: 1.58%
20. LaborForce_%: 1.52%
21. TotalEmp1980: 1.48%
22. Unemployment_%: 1.41%
23. Age35_44_%: 1.39%
24. EdCollNoDegree_%: 1.26%
25. Male_%: 1.24%
26. Age0_4_%: 1.17%
27. Age60_64_%: 1.15%
28. Separated_%: 1.11%
29. Age10_14_%: 1.07%
30. TotalEmp2015_%: 1.03%
So are we’re done? Not quite. We know the weighted fields used in identifying the entire span of the poverty rate across 3100+ counties to accurately predict the poverty rate with nearly 80% confidence. What we don’t know is what drives HIGHER rates of poverty.
Using Excel and the STDEV function across the poverty rate we can determine that the standard deviation from the mean of rates is 6.5% — think of a mountain chart (a distribution graph) that peaks near the middle (where the highest number of counties are).
So anything to the right of the middle bars is above the average poverty rate (16.7% using these 2015 numbers). Using the standard deviation as a starting point we’ll narrow out list of 3100+ counties to any counties above a 23.3% poverty rate. 443 counties to be precise.
We can split that dataset again into training and testing sets and run out best model (the 250 boosted tree ensemble). But this would be a mistake. Can you guess why?
Using a rich set of 100+ fields to determine the full span of poverty rates will not help us because we’ll just be narrowing our focus for a new span with fewer instances of data.
Instead, we’re going to use the original dataset, create a seperate field in Excel to denote the 400+ high poverty counties and use that as a category field to predict things. Be sure you treat it as a category field which will give you some binary predictions:
89.2% accuracy. I like where this is headed.
Here’s the field importance list:
1. Ed9_12_%: 7.56%
2. Married_%: 6.87%
3. Employment_%: 5.43%
4. MedianHousingCosts: 3.81%
5. Uninsured_%: 3.10%
6. EdK8_%: 2.96%
7. Disability_%: 2.80%
8. EdBachelorDegree_%: 2.52%
9. NeverMarried_%: 2.41%
10. LaborForce_%: 2.16%
11. Age20_24_%: 2.03%
12. EdCollNoDegree_%: 2.00%
13. Female_%: 2.00%
14. Age15_19_%: 1.80%
15. Black_%: 1.78%
16. Separated_%: 1.56%
17. AmericanIndianAlaskaNative_%: 1.49%
18. Unemployment_%: 1.38%
19. Age45_54_%: 1.36%
20. White_%: 1.27%
21. MedianAge: 1.26%
22. 2010 Land Area: 1.25%
23. TotalEmp2015_%: 1.17%
24. ANSI Code: 1.11%
25. Age0_4_%: 1.11%
26. Widowed_%: 1.06%
27. EvangelicalAdherents_%: 1.03%
In this best model we can see that issues of marital status, employment, housing, and education have the biggest influence on poverty.
Even though race plays a factor it’s fairly small by comparison. Also curious, I have some 20+ fields on political results and NONE of them are present.
We can see that the youth aspect we saw earlier (Age20–24) peered its head into the mix as well showing that college counties might have an impact on “poverty.” This supports Sowell’s assertion that poverty is more transient than you might think.
In short, poverty rates indicate counties that have both generational poverty issues and counties where life is in start-up mode and income is just low. Education levels are pretty darn important in these considerations.
We’ll do some more soon!