Poor quality of documents harms government transparency:

Pete Forsyth
Wiki Strategies
Published in
7 min readFeb 3, 2018

The Glenn Simpson/Fusion GPS interviews

Recently, two committees of the U.S. Congress published transcripts of their interviews with Glenn Simpson (of the research firm Fusion GPS, which produced the infamous “Trump-Russia dossier”), who investigated connections between Donald Trump and Russian organized crime figures during the 2016 election cycle. These documents are full of fascinating information. Access to primary sources like this is vitally important to an open society; it supports meaningful participation in our democratic government, whether by individual citizens, media organizations, or interest groups.

But such documents should take advantage of modern computer technology — such as open formats, spell checking, readiness for machine translation, etc. In some respects, the quality of the documents provided by the government is really poor — in ways that are trivially easy to correct.

Wikisource is a lesser-known web site that complements Wikipedia, republishing texts that are free of copyright restrictions.

Fortunately, web sites like Wikipedia and the related Wikisource enable those interested in the documents to produce better (i.e., more accessible) versions of them, and otherwise address the problems introduced by the government-provided versions. But the work is painstaking, and could be greatly facilitated with little effort by the government agencies releasing the documents. If transparency is truly the goal, our government agencies should take a few simple steps to update their practices.

Google Translate is a powerful tool…for text that is readily accessible on the web. But it’s tough to use on PDF documents, especially when they’re full of typos and extraneous header text.

If we want to use a Google or Bing search to find documents containing certain names, typos in the transcripts matter. If a Russian speaker who might have relevant evidence wants to do a quick machine translation of the document, document quality matters. If a blind person wants to generate an audio version of the documents to “read” them, document quality matters. Or even if a sighted, English-speaking American citizen or journalist wants to read the testimony in a pleasant format on a Kindle or iPad, the quality of documents impacts that as well.

Fortunately, the copyright-free nature of works of the Federal government enables others to republish the documents with corrections, improvements, and annotations. Sites like Wikisource enable “the crowd” to do so in a collaborative fashion. What about the possibility of new errors being introduced, you might ask — either by mistake or for deceptive purposes? Specific, page-by-page reference to the original document address that possibility. A critical Wikisource reader can use side-by-side comparison of the transcription and the original to check for deviation.

What the problems are

What kinds of problems crop up in these documents? Here are a few examples:

Try searching for “Comey” and “Corney” yourself in the original PDF.
  • James Comey is mentioned seven times. In every instance, his name is transcribed as “Corney” — that’s C-O-R-N-E-Y — rather than “Comey”. Clearly at fault, in this case, is the Adobe Acrobat Optical Character Recognition software used to transcribe the text. (House Intelligence)
  • Alimzhan Tokhtakhunov is identified as Tokhtakhovno . This (and those below) was human error by the transcription service. (House Intelligence)
  • Russian-born entrepreneur Alex Shnaider is identified as Alex Schnaider . (House Intelligence)
  • US law firm BakerHostetler is consistently spelled with an extra space in the name (both transcripts)

Such issues limit people’s ability to find the names via Internet search. On a basic search, you might falsely conclude that these people and organizations were not mentioned in the interview.

Beyond the inaccuracies, there are some strange judgment calls about what is presented to the public:

  • Do we the people, who ultimately funded the interview and its publication, get any value from Alderson Court Reporting, the company who produced the Senate Judiciary transcript, advertising its web site and phone number on every single one of the transcript’s 312 pages? I rather doubt it. (Senate Judiciary)
  • By contrast, do we have an interest in what company produced the House Intelligence Committee transcript? Because as far as I can tell, it isn’t identified anywhere in the document itself, nor in the committee’s press release. It may be a small detail, but considering the errors I identified above, wouldn’t it be worthwhile to know what company is responsible for those errors? That small piece of information could enable simple correction of errors or processes, when identified. (House Intelligence)

How wiki volunteers address these things

Wikipedia and its sister sites, like Wikimedia Commons and Wikisource, are powerful platforms for organizing information and making it more accessible to the public, and to journalism, academic, and advocacy organizations.

These sites are built and curated by hundreds of thousands of volunteers.

Let’s take the House Intelligence Committee interview as an example.

The original PDF file provided by the committee was uploaded to Wikimedia Commons, with useful metadata identifying the original source, author, etc.

Then, on Wikisource, volunteers underwent a painstaking process of transcribing every page of the 165 page document, taking note of typos, marking page headings, etc. The result is a web-friendly version of the document — in which each page has been proofread by at least two people. Important names and topics are linked to relevant Wikipedia articles. Simple typos introduced by the Optical Character Recognition software used by the Committee’s transcription service are corrected. Typos introduced by the Committee’s transcription service are left intact, but are marked up so that they link to the correct Wikipedia article.

If you want to peek under the hood, this page is our “workshop”. Yellow highlight indicates a page has been proofread by 1 person; green indicates 2 have reviewed. Non-colored pages have not yet been proofread.

This web-friendly version will show up well in any web browser — on your phone, tablet, or PC — but more than that, Wikisource offers convenient links to download the corrected and improved version as a PDF, or in a format for your Kindle or tablet.

On Wikipedia, where there are typos like “Alex Schnaider” (which was transcribed with an extraneous “c”) are addressed with a “redirect”. This means that if you search for the incorrect spelling on Wikipedia, you will be smoothly redirected to the correct article.

As I type, wiki volunteers are hard at work transcribing the controversial Nunes Memo, which was just published today. By the time you read this, the full transcription should be available here.

Conclusions

None of the problems identified above are, in themselves, enormous ones. But any one of them has the potential for undesirable consequences, that we might never see. If a document is a little less accessible to somebody who might be able to put it to good use, we might never benefit from their contribution.

If your child routinely misspells a word, you might be able to understand what they mean. But we don’t typically let that stuff go. A teacher or a parent will point it out. Why? Because down the road, misspelling that word might cost that kid a job, or might earn them derision from a particular reader.

For the same reason, we should insist that our government agencies make appropriate use of available technology, in order to be more genuinely transparent.

When a government agency publishes a document, the first step is to type it up. If it is then printed out, scanned, and then turned into a PDF document for public dissemination, much information is lost. The word processor knows how to distinguish a header from the main text, and knows that the person typed “COMEY” rather than “CORNEY”. But by printing and scanning, that information is lost. And when a Wikisource volunteer goes to transcribe the text, they have to reproduce that work that was already done, at taxpayer expense, by the transcription service.

So, what’s the easy solution? It’s fine for an agency to publish the kind of PDFs that they currently do, and maybe there is good reason to do so. But they should also publish the original word processing document that generated it, in parallel.

There is really only one downside to doing so in a case like this, which probably plays a significant role in these decisions: word processing documents tend to preserve editing history, so redacted text might live on in the “track changes” history of such a file. So, in order to respect redaction of classified material might require a bit of extra effort to make sure it’s not slipping through.

But safeguards against that kind of problem are not impossible to set up. And a concern like that shouldn’t hold us back from taking advantage of the computing technology of the 21st century.

And, one final point: as we learned when the Trump Administration took the White House Petitions website offline for nearly a year, publication by the government may or may not be permanent. When volunteers for wiki sites do this work, it also guarantees that a high quality version of the document will stay online…even if the original should get pulled by the government. In this way, independent sites like Wikisource add an extra guarantee to the notion of government “by the people” and “for the people”.

--

--

Pete Forsyth
Wiki Strategies

Wikipedia expert, consultant, and trainer. Designed & taught 6 week online Wikipedia course. Principal, Wiki Strategies. http://wikistrategies.net/pete-forsyth