Sometimes it’s fun to poke around in odd bits of data and see what interesting information it’s possible to extract. All kinds of data sit out there in the open for the taking; data.gov is a favorite go-to of mine for playing around.
Of course with any data, and especially openly distributed data, there comes to the chance that the data are malformed, missing fields, inconsistent, or easily misinterpreted.
When I stumbled on the NEH grants dataset and started poking into their XML file for recent grants, I ran into and nearly missed an obvious issue within minutes. NEH releases its data by decade, though updated monthly, and so when you compute averages, you might not notice that your outcome gets skewed significantly by the absence of any data in the 2019 column as of October 2018 — for obvious reasons!
After a bit more poking around, and a more careful attempt to understand the data for what it is, I came up with the following bits of information.
A very few words count for the large majority of the focus of NEH grant titles (excluding filler words like “a”, “the”, etc):
Looking at this same data another way makes the trend even more clear:
Examining the grant descriptions provides similar results, but also some new key terms:
The graph of the descriptions denotes an even stronger focus than in the titles.
(Granted, the most commonly used term is “project,” which is less than surprising.)
Mapping out the organizations receiving NEH grants introduces an interesting visual element. It can be difficult to clearly demarcate the true layout due to clustering, but an interesting view results nonetheless:
This map was created with Google Docs and FusionTables plugin, in case of interest.
Finally, we can perform a simple breakdown of grants awarded by year. As 2018 has not yet closed at time of writing, the bar for that year should be taken with a grain of salt.
Whether any of these findings extend to other decades, or to other types of grants and other agencies is hard to assume. Obviously, some aspects of this data show high correlation to the purpose for which they were created — it’s natural for instance that many NEH-funded project proposals would include the term “history,” whereas NIH grants might have a very different focus. Yet it’s interesting to observe how certain terms cluster, and how grants are awarded, etc.
There are always plenty of learning opportunities lurking in the piles of data, as long as you’re willing to put in the work — and not make too many assumptions!