Monitoring Legislatures: The Long Game

or, Lessons from my first 3 years working on open data standards

These are my notes for a talk at g0v Summit in Taipei in May 2016. You can watch the video or read the abstract. The talk is broken into:

  • Scraping is unsustainable: A story
  • Three sustainable alternatives to scraping
  • Sidebar: Machine-readable data is not enough
  • Getting your standard adopted: A story
  • 10 strategies for standards adoption
  • Building an ecosystem for standardized data: A story
  • 6 strategies for standards development
Photo Credit: Eric Mill

I’m going to talk about how we can use data standards to stop scraping, which is one of the big barriers to achieving sustainability in civic tech. (A data standard just means always expressing the same information the same way, so that different systems can read and use that information.)

Scraping is unsustainable: A story

The problem with scraping is that if the page layout changes, or if pages are moved, then the scraper stops working, and you need to fix it — and fixing scrapers can be expensive over the long-term.

Represent Civic Information API

To illustrate the challenges with scraping, let’s look at what a common, non-profit civic tech project looks like today. I’ll use Open North’s Represent as an example, a service I created with Michael Mulley. Represent lets you submit your home address to find who represents you. For this to work, we need information on representatives.

Open Data Hack Day, December 10, 2011 (Montréal, Québec)

At the time, no government was publishing this information as data, so we had to scrape it. We did the usual thing: we held a hackathon. We used’s predecessor, ScraperWiki, so that volunteers could write a scraper in their language of choice—so long as they saved the data in the same way. We scraped dozens of government websites like this.

Open Civic Data

However, maintaining dozens of scrapers, in three languages, in different code styles, is really annoying. In 2013, we switched to Pupa, a scraping framework created by Sunlight Labs. Now all our scrapers were in one language, in a consistent style. We no longer worried about common scraping concerns, because Pupa handled them for us. We could focus instead on extracting data from webpages. This let us scale to 100 scrapers, requiring less than a day per month to maintain.

This is a pattern I’ve seen a lot, whether it’s Represent, EveryPolitician or OpenCorporates. First, you start with one scraper. Then you realize, “Hey, I’m good at this,” so you write more scrapers for more jurisdictions. You build new tools to be more efficient. EveryPolitician has a sophisticated toolset, which has scaled to hundreds of scrapers.¹ But there are limits. Inevitably, a project ends up spending a lot of time fixing broken scrapers, whenever the government changes the way the website looks.

And the challenge, for monitoring legislatures, is that there are not just hundreds of governments in the world; there are hundreds of thousands. In the US alone, there are 89,000 local governments. Google has an API for elected representatives, but they are far from having all local governments in the US. Getting there is a problem for everyone, including Google.

Three sustainable alternatives to scraping

So what’s the future for these scraping projects? Let’s imagine:

It’s 2035. We‘ve made a human mission to Mars. We mitigated climate change. People get around in self-driving electric cars. And, somewhere in Estonia, Tony Bowden is fixing scrapers for EveryPolitician.

What do we not end up there? There are alternatives to scraping. The options I see are: artificial intelligence, crowdsourcing and standards.

With AI, the idea is to train a robot to extract the names, photos, and contact details of representatives from webpages, which may have many options for each item of information. I don’t know anyone using AI in cases needing 100% accuracy, like electoral representation, but it’s an option.

Crowdsourcing works if you’re not in a hurry to get the data, if you’re okay with never getting some of the data, or if your task isn’t that big — like if you only need to do it once. Wikipedia, for example, will eventually update the representatives for Canada’s federal and provincial governments, but it almost never has information for local governments.

So, going over our options: With AI, we outsource the work to robots. With crowsourcing, we outsource to humans. With standards, we throw the work back to the publishers — which is an obvious option, right? Governments are already publishing this information, but not in standard, machine-readable data formats. If they did that, we would never have to scrape that data ever again, and we could focus on other things. That’s why I think standards are important—yet few people are working on them.

Sidebar: Machine-readable data is not enough

Before diving into standards, I want to be clear that machine-readable data on its own is not enough. It’s better than a webpage, but if every government is publishing data in different ways, we’re still losing.

Reconciling the different data formats of 8 APIs
Machine-readable data isn’t enough—we need standards

For example, in 2015, I worked on Influence Mapping, a project to bring together organizations building databases of how people in power are connected. There are dozens of databases, they all provide APIs, and they all do it differently. If you want to find out who has information on Justin Trudeau, you need to visit each website, read each API’s documentation, and write code to send requests and process responses for each one.

What I did was create a single, unifying API, so that you only have to learn one request format and one response format. The API proxies your request to the other APIs, and reformats the responses for you. Like scraping, there are limits to how far this can scale. The tool I created was just a proof-of-concept. The goal was for future APIs to adopt the standardized format.

Getting your standard adopted: A story

With that out of the way, how do you get started on standards? I’ll tell another story about Represent. Way back in 2012, an open data lead at a city asked me in what format I wanted the data on representatives. I shared a template spreadsheet, and the data was published. He shared the template with other open data leads through several local government networks, but only one other government used it.

One-page description of the Represent CSV Schema

In 2014, we got a grant to promote the template. We ran a modest campaign: we prepared a one-pager, a carefully crafted canned email, and a landing page with the logos of adopters. We shared this with a friendly open data lead to present at a meeting of a local government network, which was attended by other leads. Afterward, we sent the email to each lead, linking to the landing page, with the one-pager attached.

We got early adopters. A few weeks later, we sent follow-up messages asking the leads who hadn’t responded if they had a chance to read the information we had sent. We also shared, “So far this month, [this city of the same size, this neighboring city, etc.] adopted the standard.” We got more commitments to adopt. If a government committed to adopt, but didn’t correspond within a month, we followed up to see if there was anything we could do to help. We kept this going for a year.

Within three months, we had 16 adopters out of the 62 local governments with open data, or 25%. By the end of the year, we had 23 adopters,² who represent over a quarter of the population of Canada. That’s a quarter of the population whose representatives data we never need to worry about again. That’s also a quarter of the scrapers that we no longer need to fix.

And we did all that with email. (More generally, it’s incredible to me how much we can accomplish by sending emails.)

10 strategies for standards adoption

It’s useful to look at what contributed to this success. Let’s first look at what’s transferrable to other campaigns:

1. Use social proof. This is like when Facebook tells you which of your friends are attending an event, or when a product lists companies using it. Social proof is everywhere in sales. When our campaign launched, we already had two adopters — and those adopters were leaders in open data in Canada. Throughout the campaign, we added new adopters to our emails, to show more proof.

2. It’s good to have friends on the inside. In this campaign, we knew an open data lead who could pitch the standard to other leads. They could give us insight on what was working, so that we could refine our message.

3. Build momentum (or the illusion of momentum). When we followed up with governments, we told them about the new adopters since the last time we wrote to them, to give them a sense that, “Hey, this bandwagon’s moving and you’re not on it.” Another strategy is to have your message appear across multiple sources of information that your targets use. For example, in Canada — I think the issue was the clear-cutting of forests — environmental groups bought billboard space on the road that the CEO of a logging company drove to work on. The CEO would start and end every day with a reminder of the issue. In addition to news from other sources, some of which was seeded by the environmental groups, this created a sense that the issue wasn’t a concern to a few, and he started to take it seriously.

With a standard, you can also create an illusion that everyone is talking about it. You can write an article in a publication for governments. You can find an organization that offers webinars and training to governments, and add your content. You can find government networks, and present at their meetings or conference calls. You can announce whenever a new government adopts the standard — it can be a tweet or a press release, depending on how major the announcement is. Even if a government isn’t attending the meeting or reading the article, they’ll catch on that it’s an issue, such that when you content them directly, they will be familiar. It’s a lot easier for people to say “no” to an idea that’s not familiar yet.

4. Use peer pressure. In our follow-up messages, we specifically mentioned adoption and commitments from nearby cities and from cities of similar sizes. Sometimes, a small city will think, “Oh, only big cities can afford to adopt this standard.” Showing that other small cities are doing it is convincing. It can also be embarrassing to big cities if small towns do something they aren’t.

5. Use your reputation. Open North had a reputation in Canada as experts in open data. If you don’t yet have a reputation, you can find organizations or people who do. You can get their endorsement for what you’re doing. Their authority can convince governments, even if they aren’t directly involved in the campaign.

6. Be polite and helpful. Implementing a standard almost always requires a cooperative approach. If you fight for transparency in political donations, you can take an adversarial approach. But if a government already publishes data on political donations, and you want them to implement a standard, you need to use a different approach. To establish a cooperative environment, you can communicate and demonstrate that you are acting in good faith, that you are open to understanding the local conditions and nuances that are barriers to adoption, that you accept that there will be challenges and delays, but that you are all working together toward a common goal. When following up on commitments, we worded our messages as, “Is there anything we can do to help?” It wasn’t like, “What’s taking so long?”

7. External validation works. I said earlier that it’s useful to have an insider. It’s sad, but governments often don’t use good ideas that come from within. I talk to many public servants who share good ideas, which I sometimes repeat in newspaper editorials about what the government should do. It’s only then that their idea is taken seriously. So, if you work in government, share your ideas with outsiders, so that they can help you.

Patiner Montréal

8. Show that people will use the data. Governments fear that they will put in the effort to adopt a standard, and then no one will use the data. In our story, we pointed to Represent, which was used millions of times per year by dozens of recognized organizations.

But your evidence doesn’t need to be that strong. When working to convince my local government to adopt open data in 2010, I created a website to let people know which outdoor skating rinks were open, using data I extracted from PDFs. This got the government’s attention and demonstrated a demand for data. Building just one small, useful thing can be enough. And if the data for your demonstration doesn’t exist yet, you can use data from another government, or you can make it up.

9. Respect the governments’ and users’ needs. In Represent, this meant making the column headers human-readable (like “Source URL” instead of “source_url”), since humans use open data, not just machines. The columns were also limited to only those we had a use for. We didn’t include columns that would be nice but that we didn’t need.

10. Set a deadline. We didn’t use this for Represent, but you can use Open Data Day or other landmark events to motivate governments. Once a few governments have committed to adopting the standard by the deadline, you can use that to motivate more governments. Otherwise, it’s easy to postpone a low priority activity indefinitely.

So that’s the story of Represent. Now we can come to the story of Popolo.

Building an ecosystem for standardized data: A story

Sample Popolo data

Popolo is a standard I created. It’s mainly used for legislative data: who represents you, what do they vote on, when do they meet, and so on. For example, here’s my Prime Minister Justin Trudeau as Popolo JSON. Here’s the story of Popolo:

  • In 2011 in Warsaw, at Open Knowledge International’s Open Government Data Camp, Tom Steinberg of mySociety observed that in civic tech, different groups kept building the same technology; groups in different countries had their own parliamentary monitors, but there was no sharing of code. Recognizing that you can’t transplant a website that monitors the UK government and have it work as-is in a new context, his idea was to break up the problem into small components that could be transplanted — like a component that just tracked people and the organizations they belong to.
  • In 2012, I started building a website to monitor my local government. But I didn’t want to have to rebuild the website for each future city that I wanted to monitor. So, I started looking for data standards that would work in different contexts. I discovered that the W3C was working on a standard to express data about organizations, so I joined the working group and learned a ton about data modelling and the standardization process.
  • At the same time, I researched and evaluated all relevant standards that I could find: dozens of standards, maybe over a hundred. None of them did everything parliamentary monitors needed. So I started stitching the best standards together to fit their needs.
  • A few months later, in early 2013, I learned mySociety and the Sunlight Foundation got grants to build reusable tools for government monitoring — like Tom’s components from 2011. I started conversations with them about adopting a common standard. I documented my earlier work, and this became Popolo, which both adopted in their new tools.
Group photo, PoplusCon, 2014
  • A year later, in 2014 in Santiago, Ciudadano Inteligente and mySociety hosted PoplusCon, for parliamentary monitors to discuss how to work together better and how to pursue a components strategy. I shared Popolo, and held a session to standardize parliamentary voting data, for which no good standard existed.
  • In 2015, things really got rolling. mySociety launched EveryPolitician, with elected officials from 233 countries, all in Popolo format. Kohovolit published data on 8 European parliaments in Popolo format. Councilmatic launched in New York City and relaunched in Chicago—reusing the Sunlight Foundation’s tools that adopted Popolo. OpenAustralia updated TheyVoteForYou— a legislative vote tracking tool—to import Popolo data, and supported its adaptation in Ukraine.

Now it’s 2016, and I never got around to building that website to monitor my local government. But I did learn a lot about making a successful standard.

6 strategies for standards development

1. Build on prior work. Something I see a lot: a developer recognizes that all existing standards are incomplete, over-complicated, under-documented, and unclear, and they figure, “I’ll create my own!” But sometimes standards are complicated because the world is complicated. Sometimes a standard uses an unclear term because an older standard used that term. And in any case, the standard author knew something about what they were doing; it’s important to understand why choices were made in the design of a standard, to avoid re-introducing an error in your standard. Another benefit to building on prior work is that it gives you instant credibility. Popolo re-uses and standards by the W3C, IETF, Dublin Core and GeoNames — which are recognized names in standards.

2. Build an ecosystem. When talking earlier about promoting adoption of the Represent CSV Schema, one strategy was to show that people will use the data — that there is some immediate benefit to adopting the standard. With Popolo, if you transform your data into this format, you can import it into open-source tools like Councilmatic or TheyVoteForYou. That’s an incentive. It’s like GTFS: if a city publishes its transit data in GTFS, Google will import it, and the city gets a trip planner for free through Google Maps. For Popolo, it took about three years for compelling benefits to emerge.

3. Target the bottom of the stack. mySociety and the Sunlight Foundation were building low-level tools to collect, manage and then expose data via an API. The users of each API didn’t care much about its data format, and each API had users, like Councilmatic, that didn’t even know about Popolo. We could get more people to adopt the standard, without them knowing it, by getting the low-level tools to adopt it first.

4. Value face-to-face meetings. Like in so many contexts, meeting people makes it easier to collaborate in the future. I don’t think Popolo would be as popular today if I didn’t meet a lot of the adopters at events like PoplusCon.

5. Use your reputation. In 2012, I didn’t have the credibility in civic tech that I do now. Getting early adopters like mySociety and the Sunlight Foundation was a way to build a reputation and convince more people to adopt.

6. Expertise counts. I started learning about standards in late 2012, and I got my first adopters in early 2013. It didn’t take long to become an expert. There really aren’t many people who are. The time invested pays off. A great way to develop expertise is to join or follow a working group at the W3C, IETF, or elsewhere. The W3C Community Groups are one accessible option.

I’ve talked a lot about why standards matter and how to get them adopted. I could have a whole other talk about how to design and develop a standard, discover and evaluate prior work, identify stakeholders, document use cases, build consensus, manage expectations, and so on.

But my goal here is to make you think twice before scraping. “Don’t wait, scrape” is a Civic Pattern, and it’s a good one. When starting a project, when building a proof-of-concept, when creating a one-time app, you should scrape. If the data never changes, scraping is good, too. And scraping is very much in the spirit of open source. “If you don’t like it, fork it!”

But if your goal is to build something that’s going to be collecting data for a long time, like a government monitoring tool, think about how to identify, rally around, and target common standards. There’s some momentum for building shared code on top of shared data, and we’re starting to see real success stories. And so my closing message is to think long-term about how you collect the data; don’t plan on scraping forever.

¹ mySociety now follows a crowdsourcing approach in Democratic Commons.

² Today, without additional effort, there are 35 adopters.