Turning hand-written forms into structured data

At the latest The Engine Room’s replication sprint I had a chance to work in a team of amazing people to deploy a tool to coordinate a community of volunteers that would transcribe scans of public documents and turn them into structured open data. It soon turned out that the existing tool we had been working with was not enough for more complex cases we had to deal with.

Mapping features at Croatian Engine Room’s sprint

By the end of the sprint we wanted Hungarian K-Monitor to be able to transcribe Members of Parliament’s hand-written asset declarations and Ukrainian Opora to be able to transcribe political parties donations reports in the pdf format. We’ve only managed to deliver a working website for K-Monitor, and not without using some hacks and tricks. Taking into account that experience and demand from other organizations it became clear that we need a more robust tool dedicated to such cases. At the end of the sprint an idea was born to meet once again, but this time solely for planning and developing new features, without adapting them to the needs of any specific organization. We have also set a quite ambitious goal for ourselves — let’s prepare the tool in such a way that its basic functionality could be adjusted for a new organization within one day.

That’s how we ended up at a development sprint in Jahorina, Bosnia, just before the regional POINT conference, working on a more robust tool — codename Moonsheep. We have engaged partners having past experience in the topic: Engine Room that organized two replication sprints and performed a thorough evaluation of existing tools; Open Data Kosovo that supported Engine Room in Quien Compro implementation and that has recently created Decode Darfur microtasking website for Amnesty International; K-Monitor that had practical experience with transcribing and verifying data using Vagyonnyilatkozatok, a website developed on the last sprint. The resulting team couldn’t be prepared better for the task in front of us:

  • Alan Zard — Engine Room’s Techno Gardener, UI and frontend specialist
  • Attila Juhasz — K-Monitor’s subject matter expert
  • Krzysztof Madejski — Fundacja ePaństwo’s TransparenCEE Product Manager
  • Partin Imeri — Open Data Kosovo’s Software Engineer
  • Tin Geber — former Engine Room’s Design and Technology Lead

The sprint

The sprint started by us agreeing that we won’t evaluate any existing tool. Instead we’ve planned to design an ideal tool that could be smoothly replicated for low-tech organizations and to see where it will take us.

It’s surprising how much we did in just three days:

We defined two beneficiaries roles at the organization adapting the tool:
 1) a product owner who is a subject matter expert 
 2) a techie who is eager to employ tech tools, but may have limited coding experience

We defined a replication process involving two above roles as well as external experts

We sketched a needs assessment survey using which organization can prepare itself for the replication sprint. It checks preconditions such as organizational readiness, expected project impact and data availability

We defined core features and broke them down into Github issues

We designed mockups for the most crucial parts of the tool

We assessed which existing codebase/tools could be used to build Moonsheep functionality

What’s next?

Having created such a detailed action plan, now we just need to execute it. We will most likely extend functionality of PyBossa crowdsourcing framework and document comprehensively how to adapt it for transcribing complex structured documents. By the end of August we want to have a version that will be ready for testing.

We have planned the replication process in such a way that it could be performed by any organization with limited external support. Nevertheless, we are aiming to organize a replication sprint just before Personal Democracy Forum Ukraine, happening on 25–26th September in Kyiv. We would bring several interested organizations and support them in adapting Moonsheep to their needs.

Are you working with documents that are hand-written or do you have to go through tons of scanned pdfs? Could you engage your fans in helping you transcribe documents? If what you are missing is just technical support, then let us know.

Team at work

Originally published at transparencee.org.