Stories from the city, stories from the cloud: an introduction to city open data portals in the United States
What stories can you tell with a city’s open data? And what’s missing from the data?
Perhaps you want to understand the relationship between rates of poverty and emergency response times, or the distribution of curb cuts across neighborhoods, or the latest public real estate purchases. Maybe you want to hunt for lack of oversight in the list of recent food safety violations, or seek geometric patterns in distributions of city trees. Or maybe you want to visualize access to education within a single zip code.
Whatever your area of focus, I am imagining you as I write as somebody who cares about cities the people who live in them. Given this, where might you start looking for data to pursue your questions?
It’s hard to provide a simple answer to that very reasonable question! Just a few months ago, Google launched a new service that attempts to take on this problem of helping folks engage meaningfully with data that is often heterogeneous and widely distributed:
There are many thousands of data repositories on the web, providing access to millions of datasets; and local and national governments around the world publish their data as well. To enable easy access to this data, we launched Dataset Search, so that scientists, data journalists, data geeks, or anyone else can find the data required for their work and their stories, or simply to satisfy their intellectual curiosity. (Google)
Google’s solution is focused squarely on the discoverability of datasets – if datasets are described in a cohesive way, then Google’s service can scoop up datasets from across the Internet and serve up optimal results for the user based on their choice of keywords. Note how, in this formulation, Google characterizes local and national governments as the groups that publish datasets, and repositories as services that provide access to datasets. In other words, both play a similar infrastructural role in delivering to you, the user, the optimal datasets.
As a data librarian who zeroes in on feminist & critical data science, it is my professional duty to get extremely worried when folks talk too much about the finding of data without also discussing context, power, and omission. Far from being just the infrastructure that delivers datasets, local government organizations and their data repositories, along with civic tech activists and other community members, shape and constrain what experiences and issues can be expressed via open data.
In this post, I will introduce you to the producers and gatekeepers of civic data that Google describes as they play out in practice. When it comes to cities, these questions tend to revolve around the city open data portal: a model adopted nation-wide that facilitates tens of thousands of datasets, a buzzing cross-section of hyper-local civic data activity, and a service almost exclusively built on just one tech company’s platform.
If you’ve searched for open data before, you’ve likely come across the federal government’s open data portal: data.gov
Data.gov was launched in early 2009 by our first Chief Information Officer, Vivek Kundra, who served under President Obama. It hosts over 200,000 datasets, with a primary focus on federal agencies and federal-level data. If you searched for city-level data for a number of cities, like Chicago and Seattle, you would likely find hundreds of datasets as well. Data.gov accepts submissions from city and state level datasets on an ad hoc basis — and several cities appear to automatically submit to data.gov as well and/or push regular refreshes of their existing datasets to the platform.
If you looked at data.gov around when I’m publishing this, however, you’re out of luck. Due to the 2018–2019 Federal Government Shutdown, here’s what the portal looks like currently:
Let’s consider this an opportunity to look beyond data.gov, which absolutely has its share of omissions. Data.gov only captures a sliver of the city-level open data activity happening across the United States. An April 2018 article in Forbes lists over 90 open data portals at the city level, although the current number is likely in the hundreds. Below are two examples of open data portals for Chicago, Illinois and Saint Paul, Minnesota.
In some cases, city open data portals are the outcomes of particular policy imperatives and executive orders. Chicago began changing its practices in sharing data online in 2010 to make more easily accessible the results of Freedom of Information Act (FOIA) requests:
The Portal’s earliest beginnings were in May 2010, when the Daley Administration added FOIA request logs, statements of financial interest, and other records to the City’s regular website to make them more accessible. A year later, this small set of data became a major focus for the Emanuel Administration. (source)
Open data activity built up in Chicago over the ensuing couple of years, until an executive order formally began the process of constructing an Open Data Portal:
In December 2012, Mayor Emanuel issued a rare Executive Order that mandated city agencies to publish public data sets under their control, and update them on a regular basis. The Order also called for the creation of a Chief Data Officer (CDO), who would develop datasets and further the mission of the Data Portal and open government. (source)
Other cities have noticed the trend towards open data and joined in, leveraging the experiences of their peers and funding from organizations like the Sunlight Foundation. Saint Paul Mayor Chris Coleman announced his city’s data portal in such a context in a 2016 State of the City address:
Thanks to a grant from Bloomberg Philanthropies What Works Cities, we have had access to some of the best open data consultants in the country from the reputable Sunlight Foundation. While we aren’t the first city to launch an open data portal, the Sunlight Foundation will tell you that we are leapfrogging ahead of what other communities have done. (source)
The landscape of city open data portals consists of just these sorts of hyper-local stakeholders and perspectives. New portals emerge that respond to previous portals. One constant across these portals, however, is the near-universal adoption of Socrata as the technology behind the portals themselves.
Socrata is a Seattle-based company that offers a platform and collection of data tools designed to help governments at all levels offer open data services (according to Socrata, they currently work with over 1,000 government organizations). They provide services from the beginning of the open data thought process to the largest data portals like the City of New York’s. You can get a sense of the scope of their work in their recent blog post, “Biggest Socrata Client Headlines of The Year.”
What you’ll notice as you explore city open data portals is that, despite varying buttons designs and category names, nearly all of these open data portals use the same platform for facilitating access to the datasets. This means that, across the country, datasets appear in similar configurations of data, metadata, and supporting tools (like APIs and data format conversion), at the expense of a presentation style that emphasize local context & histories. Socrata even published a tool that scans across their entire open data network all at once — a phenomenal, if somewhat unweildy resource.
How can we understand the influence of the local context on these varied open data platforms, especially given their largely homogeneous presentation? This is a question matters not only to folks seeking data, but those deeply invested in the open civic data movement across the United States.
Every year, the Sunlight Foundation (the same group that sent data consultants to the the City of Saint Paul in 2016) partners with other civic tech/open data groups to hold the US Open Data Census. The Census mobilizes this community to evaluate (1) the comprehensiveness of open data offerings across key categories like Transit, Property Deeds, etc. and (2) the degree of openness of the data within that category (e.g. is it actually freely accessible?)
The Census reveals that of the 99 cities with adequate reviewer participation, there is a huge range of openness and comprehensiveness. You can see a cluster of the first 15 or 30 cities showing at least a fairly comprehensive spread of topic areas, followed by a long tail of 70–85 cities with more scattershot offerings. The Census also provides a peek into the internal conversations happening within civic tech communities.
It’s hugely impressive work. However, there are limitations to this type of manual community scoring. The Census’ coverage is not complete — for instance, St. Paul receives a score of zero and has no datasets scored in any categories (boo!) The Census highlights the scoring metric, both in terms of the overall score and within categories, while we as investigators and data-seekers may still wish to better interpret a specific dataset that already exists, or better place it in the context of a portal’s offerings.
At some point poking around Socrata’s platform, I realized the reliance on a single portal platform actually opens up some interesting opportunities for comparative analysis. How do these different open data portals compare to one another if you peak under the veil a bit? I decided to do a little data mining/sleuthing.
The Socrata platform is very clever about including all kinds of options for grabbing/downloading/exporting/transforming/visualizing data. Their commitment to public data and a public API for everything is wonderful, and that extends to the metadata describing the datasets. I noticed that there was a common way to dump all of the metadata out of a given city’s datasets. I used Python to grab this data for the 15 highest-ranked cities according to the 2017 US Open Data Census (Austin, TX to Providence, RI).
Below, you can see that while the highest-ranked cities tend to have more datasets on the whole, there is considerable variation between city open data portals:
To move beyond just counting datasets, I aggregated keywords that appear for datasets within a given city. My intuition was that observing the most frequently-used keywords across all of the datasets for a city might help characterize the areas of emphasis for a given city’s open data portal as a whole. I was curious whether these highly-ranked cities would all show similar patterns in their most common keywords, or if there would be noticeable variation between cities. It turns out to be very much the latter!
For instance, take a look at the keywords that most frequently describe datasets for Seattle, WA:
In Seattle’s datasets, you can observe a clear preference for environmental and spatial data. Parks and recreation rank very highly, as do more generic geospatial terms like gis, map, etc. You can also observe a cluster of planning and utilities concerns in keywords like spu (Seattle Public Utilities), planning, draining, etc.
In contrast, here is the distribution of dataset keywords for Austin, Texas:
Austin has a very noticeable and high emphasis on emergency services (EMS), ambulance, safety and policing, fire, etc. Utilities are also scattered in; however new areas like housing, court, etc. emerge as common dataset focal points as well.
And here are common keywords used to describe the (much smaller) collection of Las Vegas datasets:
Here you can see a unique focus on homelessness, which is absent from Seattle and Austin’s most frequent keyword list, as well as a range of topics like parks & rec, cultural affairs, information technology, schools, etc.
It’s important to mention that this analysis relies upon keywords that are freely entered, as opposed to themes, which Socrata fixes across data portals. There is a chance that this variation is influenced by variations in data entry practices, for instance.
Even so, the analysis illustrates how we should not neglect the choice and composition of datasets, and certainly not assume these datasets form a random or representative sample of all city data. Even with a large data portal like NYC or Austin, the topics focused on represent a sub-section of all possible open data focus points, and this colors the types of interpretations possible.
Data mining and metadata text analysis is one way to explore these variations and emphases. I’ve shared my analysis as a Jupyter notebook if you’d like to keep working with it (and plan to do so myself!)
I want to leave you with one final example of the importance of context and open data portals. Consider the most frequently downloaded dataset from Detroit, Michigan’s portal:
The highest-ranked datasets demonstrate a clear focus on the built environment in Detroit, and specifically ownership and demolition of buildings. From digging into the history of open data in Detroit, I found evidence that the presence of these datasets is the result of specific community activism and civic data efforts by groups like Data Driven Detroit (D3) as well as the city’s Detroit Demolition Program, which claims to be the “largest and most transparent demolition program in the country.”
However, the specific histories of activism and civic work may not be readily legible for newcomers to the data portal or folks who do not seek out additional context. We might also consider these frequent downloads in the context of a pernicious trend of folks highlighting Detroit’s aesthetics of decline and blight at the expense of good-faith engagement with urban issues. In “The Case Against Disaster Porn”, Noreen Malone writes:
Detroit’s problem might not be on the level of war or famine, but it is serious — and yes, there are versions of those problems, writ smaller, in the city. More vitally, it is an ongoing problem, not a done deal. Those pictures are a funereal celebration, for a place that is sick but not dead. And it is a focus that irks activists in Detroit, who’ve been desperate for more coverage of their efforts to revive the city. (source)
To what extent are the frequent downloads of demolition-focused datasets the product of a fascination with blight, and to what extent are they integrated into local efforts to make meaningful change in urban communities? It would be impossible to determine from the download count indicator alone. And even as I attempt to identify the local context, I am a limited observer myself (living in Ann Arbor, MI for two years is nowhere near sufficient experience to make any definitive claims!)
Despite the difficulty of unearthing these hidden contexts and histories, I think this is exactly the type of uncertainty we should keep probing and education ourselves about as we increasingly turn to open data sources. There’s extremely good data out there — but we cannot assume this data is a stand-in for an objective reality or lived experiences within urban communities. We need to look deeper, and anticipate that context, power, and omission may be part of the story too.
Hopefully this provides a few directions for you in your explorations, sparks some new questions, and gives you some new critical & feminist data science tools. Happy exploring!