User-Generated Content in Rural Areas

Preamble: as I begin writing more summative documents about my research (proposal + dissertation), I have been doing a lot of thinking about some of my previous research projects and what sticks with me about them. At the same time, I always have said that I wanted more accessible synopses of this work.


Starting with two papers from CHI 2016 about measuring representation online:

Isaac Johnson, Allen Yilun Lin, Toby Jia-Jun Li, Andrew Hall, Aaron Halfaker, Johannes Schöning, and Brent Hecht. Not at Home on the Range: Peer Production and the Urban/Rural Divide

Isaac Johnson, Subhasree Sengupta, Johannes Schöning, and Brent Hecht. The Geography and Importance of Localness in Geotagged Social Media

These two papers examine very different online communities (peer-production vs. social media), but both raise important questions about how we measure online representation and the balance of the more idealistic view of human-generated content where local individuals converse and contribute knowledge they have of their communities vs. a reality in which non-local contributors or bots produce a sizable proportion of content.

This was by no means the first work to attempt to measure representation across these communities, but we were concerned that the rich qualitative literature on online representation and the promise of user-generated content was not being leveraged when it came to quantifying representation. Research or surveys often focused on the number of users or contributions as opposed to the salience or richness of these contributions. To measure representation online, we felt that it had to be in the context of how these contributions were being “consumed” by users, researchers, or algorithms.

In Peer Production and the Urban/Rural Divide, we examine not just the quantity of Wikipedia and OpenStreetMap content in urban and rural communities, but also the process through which it was generated and the resulting quality of that content. We find large amounts of content in rural areas, but that a disproportionate amount of that content is produced through rapid edits or bots that pull from census data. That is, very little of this content is produced by local contributors who might have a connection to these communities (and make “more interesting” edits as we found). While quality census data and power editors make for a lot of data about rural areas in the United States (yay!), we find that the quality of this content is extremely low. The quantity itself non-existent when we look to China, where there is a lack of governmental data to support content generation.

In Localness of Social Media, we look at Flickr, Twitter, and Foursquare in the United States and find similar patterns. Rural areas have lower contribution rates than urban areas, and we find that this disparity is especially evident when we look at the proportion of users who are “local” to the area, as opposed to passing through. While plenty of tourists visit cities, there are also many more local individuals using social media. For a rural community, though, people tweeting from a highway that passes through their county can easily outnumber their contributions. The importance of this is highlighted when we look at how researchers might study these populations via social media. We replicate a method for inferring the happiness of a community from their tweets and show that this method draws false conclusions if we do not remove non-local contributions. For instance, angry highway commuters (Iroquois County, IL, population 30K and Interstate 57) or happy hikers (Mariposa County, population 18K and an entrance to Yosemite National Park) can overshadow the people who lived in rural areas. And even in cities, we saw more serious conversations — e.g., in Baltimore and St. Louis — being drowned out by talk of birthdays and sports from those visiting.

Thus, across these two works, we see that bots and non-local contributors are incredibly important to content production in rural areas. And this is probably fine in the context of ensuring that there is some record of these areas or for earthquake detection via Twitter where it matters little who tweets that the ground is shaking. But when it comes to studying communities or producing more interesting and varied content about these areas, there often is no replacement for local contributors and, in these situations, representation is lacking.

While rural counties (top-left; light pink = rural) have more English Wikipedia articles per capita than urban counties (top-right), a lower proportion of human-generated content comprises these articles (bottom-left; human content = not produced by bots or through semi-automated tools like AutoWikiBrowser) and geotagged social media content tends to be less likely to be produced by local individuals (bottom-right).

How should we approach the fact that 15+ years into Wikipedia and 10+ years into OpenStreetMap, certain rural areas of the United States still have almost no contributions outside of bots or semi-automated editors?

Ideally, we would see greater outreach to these rural communities and focused editing efforts. For instance, WikiProject Women Scientists has made huge gains and WikiProject Systemic Bias raises similar concerns about under-covered areas. These approaches would have to be massively successful to work for rural areas, however, due to the structural challenge of low population density — i.e. relative lack of contributors in rural areas and huge number of towns, as we outlined at the 2015 AAAI Spring Symposium.

Here are just a few complementary approaches that I can see then:

Restructuring articles: one might argue that census data is enough. I’d argue you’re wrong (after many hours of work that required asking my wonderful mother to do some research for me at the local historical society and digging through newspaper archives, I wrote two paragraphs on the history of my town of 173, but this single section, and the only non-governmental data in the entire article as of July 2018, provides insight into how these rural communities were established across Pennsylvania, workers cooperatives, the struggle for women’s equality, and the slow removal of governmental services from rural areas). But it is true that the article for, say, Chicago (population 2.7M) naturally will be much richer than the article for my small town. Wikipedia has made adaptations for these large topics. To ensure that the article for Chicago is not too long, some sections, like the History of Chicago, also have links to sub-articles that go into more depth about various aspects of the history of Chicago. The opposite challenge arises with rural areas, where even though nearby communities often have shared history and characteristics, the information can be spread out between articles about many, small population centers. While templates at the end of the articles provide links to all of the communities in the same county, these are not specific to topics that a reader might be looking for. Instead, we might consider not sub-article links but super-article links within rural articles, that, for example, link to the history of the county under the history section for the town. This approach would not necessarily increase the amount of content, but it could help surface the existing content.

More bots: the initial bots that have populated articles about rural areas have focused on population demographics and political boundaries such as neighboring counties. Expanding this to include, for example, USDA statistics about major crops would provide additional information about rural counties. Likewise, AI-assisted mapping within OpenStreetMap — e.g., road tracing — has shown promise for increasing coverage of basic features in under-mapped regions. These approaches are limited in their ability to provide richer information though, such as the history of a town on Wikipedia or amenity details on OpenStreetMap.

Support from adjacent communities: Wikipedia does not exist within a bubble. In research at CHI 2018 (Examining Wikipedia with a Broader Lens), my colleague Nick Vincent led a paper in which we explored how Wikipedia relates to Reddit and StackOverflow. Using causal analysis methods, we ascertained that while Wikipedia content was of great value to both Reddit and StackOverflow, there was little evidence that these external links to Wikipedia were having a substantive impact on readership or editors within Wikipedia. A bright spot in this research though is that this was not fully true for low-quality Wikipedia content — i.e. links to low-quality Wikipedia content led to significant increases in new edits for these articles. This suggests that while re-use of Wikipedia content outside of Wikipedia might not broadly increase direct engagement with Wikipedia, it could have benefits for under-covered topics.