tl;dr — I got the open data I’d been requesting for 15 years from the U.S. Congress.
But is open data what I really wanted?
2001: A legislative odyssey (ugh sorry, I know, I know)
I began my journey into open legislative data back in 2001. I was curious about the mechanics of government, idealistic about the value of information to society, and inspired by how companies like CQ (yes even them), and also nonprofits like the Center for Responsive Politics (aka opensecrets.org), could transform government information into useful tools.
The U.S. Congress’s official portal at the time, THOMAS.gov (which was finally shut down earlier this week, having been replaced by Congress.gov), had information about all of the bills introduced in Congress. It was six years old when I first encountered it in 2001. Well, the whole Web was barely more than six years old then, but THOMAS.gov was already arcane. It didn’t answer basic questions like-
what is the next legislative step for this bill?
what’s changed about this bill since the last time I looked?
who voted for it?
So I emailed THOMAS.gov and asked whether I could get access to their back-end database about the status of legislation. I thought there was a missed opportunity to take legislative data and put it to new and interesting uses beyond what a government website could — or should — do.
They said no:
I was mad. This was public information that the public should have access to!
There was no open legislative data.
That wasn’t even really a concept back then.
I wanted Congress instead to publish a spreadsheet of bills, with sponsors, the bill’s status, etc. An Excel file would have been a fine start. It wasn’t so hard, I thought! This data, in bulk, is what you need to create large-scale visualization, analysis, and tools.
So I built GovTrack.us and published the first comprehensive open database about bills and representatives & senators in Congress. GovTrack became as widely used as Congress’s own website, and even Members of Congress began using GovTrack for its congressional district maps and for its open data. (e.g., Sen. Elizabeth Warren’s website uses GovTrack’s API to show her legislative record.) An ecosystem of projects formed around the open legislative data that I collated and republished. I launched two startups based on the data, and through a successful Kickstarter last year hired a small team to write legislative summaries.
The high point of this arc for me was John Oliver doing a personification-impression of the site earlier this year:
(Jim Harper’s WashingtonWatch.com was the first free and independent legislative research website, launching a little before GovTrack. Because there was no open data community back then, it was years before Jim and I found out about each others’ efforts and began collaboration. PPF’s OpenCongress.org launched a few years later, using GovTrack’s open data, but shut down recently.)
Getting to open data
For the next 15 years, after I wrote that email to THOMAS.gov, I ran “screen scrapers” — little programs to reverse-engineer the content on the pages of the THOMAS website to figure out the legislative metadata in THOMAS’s database — to assemble the legislative data that GovTrack used and made open. (Beginning in 2012, the scrapers were community-developed through a project on github with Eric Mill, Derek Willis, the Sunlight Foundation, and others.)
But reverse-engineering THOMAS wasn’t the way things should have been. I became the authoritative source, and a gatekeeper, for legislative data. It wasn’t a role I wanted.
And so I joined with other advocates to push the cause of open data throughout government in The Open House Project (2007, led by John Wonderlich), the Open Government Data Working Group (2007, led by Carl Malamud), CMF’s Communicating with Congress project (~2010), Open Government Data Licensing Best Practices (2013, with Eric Mill and others), Code for DC’s recommendations to the Mayor (2o13, led by Justin Grimes and Matt Bailey), the Congressional Data Coalition (2014-, led by Daniel Schuman), and many other causes long forgotten.
Why was 15 years of advocacy necessary just to get a database of public information? Things that seem obvious now, even in government, weren’t before.
- Whose responsibility was it to publish data? The Senate told us in 2008 that it was the senators’ individual responsibilities to publish open data about their own votes if they should choose to do so.
- Who would use it? Regular Americans can’t code. (Journalists, advocates, researchers…)
- What if hackers alter the data (maliciously)? (More truth makes misinformation campaigns harder, not easier.)
- How can lawyers and judges be certain that the web pages they are looking at are correct/accurate? (Digital signatures, but don’t worry about that.)
- How much is it going to cost government to build open data and an API and a mobile app and… (Yikes, I just want a data file.)
Those advocacy efforts, led by my colleagues, leaned on the hard work of our allies within government who took risks within their agencies to push forward what they knew was best for their agency and the public. (I’d list their names but they probably wouldn’t appreciate it — staff are expected to keep a low profile. But you know who you are — thanks!)
These efforts changed our government and governments throughout the world. Open government data is now a real thing.
I’ve been particularly excited about progress within the DC city government, which I’ve also been involved in. In just the last year or so DC opened up its legal code (a project I did early work on with Dave Zvenyach and Tom MacWright), created an API for its legislative data (before the U.S. Congress did it!), cleaned up its open data policy, and proposed city-wide open source and open licensing guidelines (thanks again to Matt Bailey). I now serve on the city’s new-ish Open Government Advisory Group.
And of course the U.S. Congress has caught up too.
The road to open legislative data in Congress looks so certain in hindsight — but it was far from it along the way.
There were small victories every year or so. The U.S. Senate began publishing roll call vote results in an open data format, XML, in 2009 in response to the work of the Open House Project, and the Senate added more XML for other aspects of the legislative process over the following years. The Republican take-over of the House of Representatives in 2011 marked a major shift toward transparency. They began making much more data available, especially about the work of the House’s committees, and promised data about legislative status. A yearly official legislative transparency conference hosted by the House began, and Congress hosted two “hackathons,” in 2011 and 2015.
But progress was always two steps forward, one step back. Things got really bad when in 2012 a congressman called for reducing the information Congress was publishing! That at least led to a very flattering Washington Post article about my work. The Post story lit the fire under House leadership and led to the formation of the House Bulk Data Task Force. Advocates formed a new Congressional Data Coalition in 2014, spearheaded by Daniel Schuman (then at CREW), and we secured favorable language in the FY2015 legislative branch appropriations bill to keep the pressure on the Task Force to make legislative data available. The Task Force during this time made some progress, but without the engagement of the Senate its impact was limited (to half of what Congress does), and it was not appearing that the Senate would engage with open data any time soon.
And then the Senate said ok.
If this post seems familiar, it’s because I celebrated success in a blog post in December 2014 when the Senate finally said yes to open legislative data.
What changed? Who made the decision? I truly have no idea. Some web manager, or a partisan committee staffer, finally relented? I have no idea.
The Senate’s OK signaled that a culture shift had completed — that the institutional staff, their managers, and the relevant partisan staff working for the representatives and senators who are really in charge, were no longer afraid of greater public access to congressional records.
Over the next year, staff from the House’s Office of the Clerk, counterparts in the Senate, staff from the U.S. Government Publishing Office and the Library of Congress worked together in a remarkable cross-agency collaboration to publicly publish comprehensive information on the status of legislation pending before Congress. (This is information they always had. It was just the public publication in a structured data format that was new.)
If that makes it sound like it should have been an easy task, it wasn’t. Legislative information is stored in at least six separate systems in different legislative branch agencies, some of which go back to 1973. Making all of those systems work together to create something really wonderful for the public was no easy task. They did it though. They created some of the best open government data I have seen.
Rep. Steny Hoyer announced the availability of the new data at a small meeting on the Hill, and the data came online in February of this year.
This is a major milestone for open government.
With the new data available and, well, because THOMAS was shut down (more on that from Daniel), this week I turned off the THOMAS.gov scraper and GovTrack began ingesting the new open data from Congress.
If all goes well, nothing will change for GovTrack’s users. It’s the same information we’ve always had, but we’re getting it in a better way.
It’s also a way that makes it easier for other people to build their own GovTrack with data straight from the source. THOMAS.gov was just 6 years old when I started. GovTrack is now 12 years old! I hope to be disrupted any time now.
Is this what I wanted?
Open government data has come a long way. It’s a global and distributed movement now.
I realize now that I made a mistake early on.
My natural tendency with new ideas is to break things down to principles. I spent a lot of time writing about the defining qualities of open government data. While I was typically careful to say that I was being descriptive and not normative, my writings and my work were not usually understood that way, and I got swept up with the predominantly normative open government data movement, which says there should be more open government data, that open government data is a good in and of itself.
My mistake was getting caught up with the normative movement.
Open data is great when there is a need. And there are tons of needs. (GovTrack has millions of users now — that’s a need.) But it’s a resource drain, both for government and for advocates, when it happens merely to satisfy open data advocates and presidential orders (which happens all too often).
I’m encouraged, however, by recent changes in the movement. If the first decade of the movement was characterized by internal angst over who benefits (Mike Gurstein long warned about unintended consequences), this next phase has been characterized by new inward facing questions: maybe not everything should be built by us and whether our movement has been welcoming to and empathetic toward issues facing women, minorities, the disabled, and other disempowered groups, and to people we disagree with politically.
I still believe in open government data. It is an extension of an open government movement which dates back to the 1940s. Open government and access to information about government is a proven critical part of maintaining a healthy democracy. No question.
But in government, the question is rarely whether to do things but in what order the taxpayers’ money should be spent on them.
And I don’t have an answer to that.