Open Data, and How We Preserve It

19 min readOct 30, 2018

This is the keynote talk I gave on October 12, 2018 at the workshop Big Data, Meager Returns? Fairness, Sustainability and Data for the Global South organized by Yasodara Córdova and Lorrayne Porciuncula at the Harvard Kennedy School. I’m grateful to them for the invitation, and for the audience’s warm reception.

I usually end up writing out my remarks as part of my talk prep, which makes it relatively easy for me to recreate the audience experience electronically. I have removed some of the worst jokes. Also, you will need to imagine a bunch of stammering and audiovisual problems if you wish to achieve true verisimilitude.

Hi. I’m Tom Lee, policy lead at Mapbox. We’re a map and location services company, reaching about 400 million users every month through the apps of customers like The Weather Company and Snap. I’m very pleased to be here today, because open data and mechanisms for sustaining it has been a fascination of mine for a long time — before Mapbox I was CTO at the Sunlight Foundation for a number of years, and using and preserving open data was among our most important preoccupations.

My job at Mapbox is about understanding how institutions outside of the marketplace affect us — governments and open source projects, for example. The answer turns out to be “a lot”, because mapping is crucially dependent on data, and gathering data about the world is a huge and expensive undertaking — one that almost invariably involves government. In fact, I’d argue that mapping data is among the most useful ways to understand how governments think about the creation and preservation of knowledge. If you go back to Scott’s Seeing Like a State, a surprising number of the book’s examples involve mapping in one form or another. So this is the lens through which I want to explore these questions today. Also, I have found that being in the mapping industry makes for much prettier slides than I got to use when I spent my time talking about federal spending data.

I’d like to talk about a few different geodata systems today, exploring how they’re created and governed, in the hopes that it provides some inspiration for the questions you’ve gathered here to examine. Landsat is the one to start with. It’s a special one. It’s been around long enough to let us be confident about how its data is used (and pessimistic about how it’s governed). This is a photo of Landsat 1.

That satellite launched in 1972. We are now up to Landsat 8, with a few different generations of technology. You can see that Landsat 6 did not have a great career, and Landsat 7 suffered from diminished capability in one of its instruments for part of its tenure. But a lot of the satellites have vastly outperformed their expected lifespan. Landsat 9 will launch in 2020, and NASA and USGS, which jointly run the Landsat mission, are currently exploring what Landsat 10 will look like.

This is Landsat’s orbit, more or less — approximately polar, with the earth rotating underneath it.

Landsat captures 11 spectral bands at a maximum resolution of about 30 meters per pixel across a variety of different instruments. MSS is the multi-spectral scanner; TM is the thematic mapper; and so on. Only the bands clustered at the left of this image are visible to humans. The others are useful for specific scientific applications like studying clouds, coastal aerosols or vegetation.

This is an example of a multispectral Landsat scene. Obviously those 11 dimensions have been jammed down into false-color using the 3 we can see. But it does give you some idea of why the distinctions are useful. Those clouds really pop, and it’s easy to see areas of human settlement.

This is Mapbox’s Landsat Live viewer showing the Yucatan Peninsula, which shows the latest “scenes” of RGB data as they arrive from the satellite. Each of these rectangles is captured at a different time, which is why there are hard edges to the clouds. You can also get a sense of the orbital path from this. Google for “Landsat Live” and you’ll find this in the top few links. It’s a nice way to get a sense of what the satellite is up to, since the website tells you how old each image is and when the region will next be visited.

We used Landsat imagery to create our satellite layer at lower zoom levels. There are higher-resolution sources of satellite and aerial imagery, of course, but no archive goes back as far as Landsat. That lets us pick through its history and find the pixel that strikes our machine learning models as being least-cloudy for every spot on earth. That, plus some magic from my colleagues that I can’t pretend to really understand, lets us present a cloud-free view of the entire earth. This is Spain, Morocco and Gibraltar.

We’re not the only company that uses Landsat, of course. It’s calibrated to scientific levels of precision, both in terms of the pixels it captures and how they’re reconciled against positions on the Earth, which makes them useful as a sort of ground truth for many companies working with imagery.

The continuity of the archive is a quality unique to Landsat, and is among the things that makes it most valuable. Landsat was crucial for understanding the environmental disaster unfolding around the sinking Aral Sea, for instance. And it continues to be an essential tool for studying climate change. Here’s a visualization that makes this point: the growth of Las Vegas over the mission’s lifespan. You can see how water levels in Lake Meade can be tracked. You can also see the resolution of the imagery jump mid-way through as a better satellite came online.

Landsat is used for a lot of different things, and quantifying its full benefits is probably impossible. But USGS has studied some of them, and tallying up the top 16 governmental uses — mostly around things like aquifer or fire management — brings an estimate of $350 million to over $436 million in benefits per year. The satellites have gotten more expensive over time, but this is still a very good deal.

Aside from continuity, the other thing that makes Landsat invaluable is its availability: the data is open to everyone. In fact this was sort of baked into the program from the start, as the US didn’t have a global array of receivers for Landsat data. Countries that wanted to amass their own Landsat archive could set up receiving stations in exchange for kicking in some money and giving the US access to what they downloaded.

The US government sold Landsat data to those who wanted it in both digital and printed forms. Anyone could place an order, but the process had costs and was technically constrained: this is an example of how the data was exchanged.

That changed in 2008 with the move to a free and open access policy. Data use immediately skyrocketed. Today you can get data from the USGS EROS center, which is the central archive for Landsat. But it’s more typical to download it from Microsoft or Amazon’s public data programs, which allow storage costs to be shared and simplify a lot of the more confusing aspects of working with Landsat data. Or, of course, you can use something more like a finished product, often without realizing it. Mapbox is one example.

So: huge open data success story, right? It is. But it’s also not quite that simple.

Landsat is administered by USGS, which is part of the Department of Interior. DOI has recently asked USGS and the Landsat Advisory Group (the “LAG”) to investigate cost-sharing and public-private partnership opportunities for Landsat. This is perfectly reasonable, of course: it’s worth asking if we’re spending public money as well as we could be.

To long-time Landsat fans, though, this all might sound oddly familiar. Didn’t the LAG study this issue in 2012 and decide that charging for Landsat would inhibit its use, fail to cover new costs, and also be illegal? Well, sure. But you should think of this less as duplicative work and more like a franchise reboot. After all, this 2012 report came a scant ten years after an attempt to move Landsat to a private data buy fell apart.

And that effort came ten years after passage of The Land Remote Sensing Policy Act of 1992. I pause here so we can admire the haircuts and shoulder pads from that era.

But actually, the 1992 law that was debated on this day wasn’t bad: it was put in place to reverse The Land Remote Sensing Commercialization Act of 1984. I am only beating up on these poor congresspeople because C-SPAN’s digital archives don’t go back to the ’84 bill.

The 1984 law was the first attempt to privatize Landsat. My sense is it was the result of confusion as much as anything else. NASA had created the Landsat system, and it was an incredible technical achievement. But NASA is an experimental agency. Once the system was deemed to have achieved “operational” status, there was no mechanism by which they could keep running it, either organizationally or legally. This is why USGS got roped in (NOAA was involved at one point, too).

But even then, it was an open question whether running a fleet of space robots was a duty the federal government should take on in perpetuity. The founders didn’t have much to say about space robots. No agency wanted to put an enormous new and permanent expense on its books. The military was queasy about satellite imagery becoming available to civilians and from there to other countries. Landsat even prompted debates at the UN about whether you needed to get a country’s permission before taking pictures of it from space. These were new questions and no one knew the answers to them.

As tempting as it is to buy into the big government liberal vs. big business conservative framing, it was the Carter administration that began the process of privatization (though the Reagan administration did accelerate the timeline considerably). Landsat wasn’t the only satellite mission under consideration for transfer in this period: the feds were also examining if they should sell off our country’s weather satellites.

Ultimately they decided that weather should stay in-house but that Landsat should not. So Congress passed a law that allows for a vendor to be selected and given data exclusivity that, they hoped, would create a market for Landsat and eventually take its costs off their books. EOSAT was the consortium of aerospace companies that won the bidding process.

These charts show the result. They’re the work of Kathleen Eisenbeis, who literally wrote the book on Landsat privatization. Landsat data came in different forms over the years as technology evolved. These charts show how privatization affected data availability. On the left is a budget-constrained buyer who opts for the cheapest product available. On the right is a better-funded buyer who purchases fancy Landsat output. You can see that the low end offerings suffered, experiencing large price hikes, while EOSAT focused on new product offerings and developing the high-end, more lucrative side of the market (there is a big, pre-EOSAT price jump on the right side too, but this reflects a new product that came on a much more expensive form of digital media). The survey work that Dr. Eisenbeis did within the academic community is even more striking: the remote sensing experts surveyed were nearly uniform in complaining about the price and consequent availability of Landsat data under EOSAT.

By 1992 it was clear that a self-sustaining market for Landsat data had not emerged, and that serious damage was being done to scholarship, so the decision was reversed, even though this took another act of Congress. They tried again in 2002, but the private sector bidder wound up backing out and the effort fell apart. In 2012 they investigated the question again, but were warned off by their own advisory group, as you saw. We seem to be revisiting the privatization question more frequently, but less deeply each time.

There are a number of ways to think about these questions. I want to briefly introduce two that I’ve found useful, even though by doing so I will probably make any real economists in the room mad at me.

The first is the idea of a public good. Sometimes we use this term in a vague way to evoke civic-mindedness, but of course it has a specific definition: something that is non-rivalrous, which means it doesn’t get used up by more people enjoying its benefit — and non-excludable, which means you can’t stop people who haven’t paid for it from benefiting.

I’d argue that Landsat is interesting not only because it’s a public good, but because it’s something that became a public good over its history. Digital technology took care of the rivalry dimension: Landsat data used to be shipped around on tapes, with substantial costs associated with delivery, processing equipment, expertise and even the actual media. Imagery processing isn’t child’s play, but it has been dramatically democratized by improving technology. Landsat is, for all practical purposes, no longer a rivalrous good. Technology figured out that axis of the graph for us.

At that point policymakers had to decide if it should be an excludable good or not. These privatization episodes are about that axis.

The other economic frame that might be useful is a basic equilibrium from ECON101. The market is supposed to clear at that point in the middle, but the magic happens in the triangle to its left: that’s the region where everyone is getting more out of a deal than they’re putting in. It’s how trade makes us wealthier, rather than just being a zero-sum shuffling of resources. In this case government is the supplier, so the shaded portion — the consumer surplus — is what’s most interesting. It’s the public benefit, more or less. You can think of digital technology as expanding supply — sliding that blue S curve over to the right — making that shaded triangle bigger and bigger as the equilibrium price point falls. A lot of otherwise foregone use becomes viable!

These models only go so far — if you extend the logic of digital distribution and consumer surplus to an absurd degree, you’ll wind up thinking that government should be funding all of our media consumption with tax dollars. I love NPR as much as the next guy, but I think that probably says more about these models’ limits than about what a desirable policy agenda is. Still, it’s worth considering how much bigger that triangle can get if we let data loose, embracing its infinite reproducibility.

This brings us to the second public dataset. What you see here is a small part of the National Bridge Inventory, a dataset managed by the US Department of Transportation (USDOT). Every bridge in the country — all 600,000 of them — is visited once every two years (four for low-risk bridges) and assessed by a team of structural engineers, support personnel, and divers when there’s water involved. There’s a huge inspection manual that they have to comply with, and the data they generate is used to prioritize repairs, plan new construction, and populate those press releases about crumbling infrastructure that you hear covered in the news.

Bridges are very relevant for companies that offer directions services to business, like Mapbox, because along with hazmat, load and other legal restrictions, truck fleets need to know clearances along a route. So the NBI is potentially useful for things beyond bridge repair.

The closest bridge to us today is the Anderson Memorial Bridge, as you can see on this beautiful hi-res imagery. Mapbox doesn’t offer boat routing yet but I’m going to use it as an example anyway.

I downloaded the 2017 edition of the National Bridge Inventory’s Massachusetts file, which has just under 5200 records. Extracting the latitude and longitude is a little bit of a pain since NBI uses an archaic encoding scheme, but after writing some bad Python I was able to determine that the Anderson’s identifier is B160114F2DOTNB.

From there we can take a look at the NBI record format. This is some COBOL-era fixed-width stuff — folks hoping for something as fancy as a comma-separated value will be disappointed. But it’s dense, containing as many as 138 metrics for each recorded structure and details like whether the bridge has guardrails, what its components are made of, and even whether it has historical significance.

The NBI contains eight different types of clearance values that get captured in its records, but Item 39 — Navigation Vertical Clearance — is the one that’s of particular interest to a mapping company.

So, having found the record we simply extract columns 190 to 193, and there! 3.7 meters. This can then be fed into our routing engine. Repeat this for every bridge in America and you will have quickly have a comprehensive dataset of bridge clearances. All you need to do is brief your B2B sales people, scale up the API server for your flood of new customers, and then wait for the first phone call about this:

We do not use the NBI for truck routing. And if we ever do offer boat routing we probably won’t use it for that, either. To their credit, the NBI people do their very best to make sure no one makes this mistake: the documentation prominently notes that the data must not be used for navigation, despite having field names like “Navigation Clearance”.

So we are sending taxpayer-funded inspection teams to hundreds of thousands of bridges every year. We’re capturing a ton of data about them, using a uniform methodology that includes navigation clearance height. But we can’t use it for navigation. Why?

The answer is basically federalism. USDOT rolls up bridge data in the NBI, but data is submitted by states. States might perform bridge inspections centrally, or they might split them up by locality. The same division of labor applies to construction work that might change the clearances of those bridges at any time without a uniform system for reporting those changes, propagating them to the NBI — not that the NBI is published more than once a year, mind you — and notifying end users. USDOT doesn’t have lawyers funded to deal with this or personnel assigned to making sure it happens. It’s not their job. They’re here to keep the bridges from falling down, not to keep trucks from running into them. That’s someone else’s job.

And in fact there are people doing that job. Mapbox offers truck routing. So do Bing, HERE and a bunch of vendors you probably haven’t heard of if you don’t own a fleet of semis. They’ve captured the data, or hired someone else to, and have mechanisms for monitoring for changes. That extra effort costs money, and this cost is passed on to fleets big enough to need software and their customers. This starts to look a bit like a user fee: the person getting the benefit is the one paying for it. When this works, it’s great: it connects use to investment, reducing the potential for waste. And it can be a more progressive funding mechanism than broad-based taxation.

Is this the ideal solution? Maybe not. But for now I think it’s probably the best option on the table. Coordinating this effort across at least dozens and probably thousands of governments isn’t simple or cheap.

OK, one more example: NAIP. The National Agricultural Imagery Program is run by USDA and, since 2002, has been publishing aerial imagery of the continental U.S. on a 2 to 3 year collection cycle. The program is designed to meet USDA’s own imagery needs — leaf-on collection, with spectral parameters optimized for looking at vegetation — but the data is freely available, which means I can share striking images like these:

Braided flats at the mouth of the Skagit River, Washington

A section of the Escalante River in Utah

NAIP is widely used by state governments, federal agencies outside of USDA, researchers and nonprofits.

It’s also used by businesses, which is what this slide is about. This is a microsat from the startup Planet (formerly Planet Labs). These tiny satellites aren’t the most capable, but they’re very cheap, allowing Planet to put a ton of them into orbit and provide revisit times as low as a day. That opens up new use cases, like detecting when crops are under stress quickly enough to do something about it.

One downside to these satellites being so cheap is that the imagery they produce isn’t very well rectified relative to the geography it represents. That’s where NAIP comes in. It’s very well aligned, and Planet can extract control points like the ones seen here to line their own imagery up in a post-processing step, producing a very high quality product.

NAIP’s used all over — it’s the “Hello World” dataset in the Microsoft Azure image processing tutorials, to give you some idea.

Unfortunately, NAIP is funded by a “pass the hat” contribution scheme within government. Agencies at the state and federal agency that use the data are supposed to kick in money, but there’s no real binding mechanism and USDA has to chase them down. This hasn’t worked very well. Late last year USDA began considering whether to procure licensed imagery for its needs rather than sourcing the imagery and distributing it. Here’s the slide deck that scandalized the open imagery data community!

Meeting the 3-year collection cycle is USDA’s top priority. Buying licensed data is one way to make the numbers work.

But this approach would rule out redistribution to the public or even use by other agencies and levels of government. Even if we only count the cost of government imagery use, ignoring public benefits, the total cost to taxpayers could rise as current governmental NAIP users are forced to go out and procure replacements. Sadly, this mismatched set of budgetary incentives is typical of how we fund public goods.

Of course, the smart folks at USDA know how valuable NAIP is and what could be lost, and they understood the likely reaction to this proposal. That’s where we came in! We helped to share this news and get people interested. That, in turn, piqued the interest of some heroic bureaucrats, who joined the cause and eventually found money to support NAIP as-is for the next two years.

What happens after that, I couldn’t say. But it’s interesting to consider the recurring patterns across these examples.

Technology has made information non-rivalrous. It’s infinite, if we want it to be.

Now we have to decide whether to make it exclusive. Whether we make it scarce in order to ensure it is sustained. If we do, we can keep ourselves grounded and our funding fair — at the cost of some use.

On the other hand, if we make it fully open, we maximize use and benefits. But it can become a bit of a zombie: hard to find resources to sustain thanks to free-riders and potentially unmoored from its mission.

Either way, diffuse responsibility is bad. It makes it hard to adapt to changes and to secure resources.

There’s one last dimension I want to pull out. I mentioned that Landsat was interesting because changing technological conditions had altered its status and how we thought about it. The same is true of the other datasets I discussed today, and countless others.

The public/private solution we’ve arrived at for bridge data works in part because signs like this still exist. I will be glad to talk to you at length about the amazing benefits to be had from our routing API, but a human driver at the wheel of a truck can still make his or her way around our built infrastructure.

It’s not clear how much longer that will be true. This is video of our Vision SDK, working to extract structured data from an environment that’s built to be legible to humans. We are automating away these demands on human attention, and at some point it’s not going to make sense to prioritize them over machines for these tasks. Where will the necessary data be? I can tell you that right now we are not on track for it to wind up in public hands.

Let me leave you with this. It’s a cover of the report that the National Commission on Libraries and Information Science wrote during the first Landsat privatization debate. To Preserve the Sense of Earth From Space— surely the most poetic title to a report on government funding mechanisms I’ve ever seen. The famous Blue Marble photograph of Earth was twelve years old, taken on the Apollo 17 mission — the last lunar mission, the last time a human ventured far enough to take that shot. That photo is famous because of the sense of shared responsibility and opportunity that it represented. Landsat was part of that dream.

In 1984 that sense of Earth from space was still fresh enough to matter, but it was already sinking into the language of cost/benefit analyses and bureaucratic turf wars. In the time since the basic questions of how we produce and share knowledge haven’t changed — the sophistication of the participants in that first debate shouldn’t be underestimated. Only the surrounding rhetoric has advanced, growing more baroque and disconnected from the underlying question.

So: what are we trying to do? What knowledge will we need, how will we produce it, preserve it, share it? I’m pleased to have the chance to explore these questions with you. Thank you.

Open Data, and How We Preserve It

Written by Tom Lee