How news media will take back control and shape the internet

A schema maintained by a consortium of news publishers would enable a range of innovations to harness new value from existing assets with no retraining and can create new news jobs. That will likely also change the trajectory of the wider internet and related Artificial Intelligence research for the better.

Sach Wry
Digital Diplomacy


Interior of the Bibliotheca Alexandrina in Egypt
The present-day ‘Library of Alexandria’. Photo by Carsten Whimster under CC-BY 3.0 via WikiMedia

Archaeology should not have been a thing. It is possible to have great respect for archaeologists and all you have learnt of history from their toils and still feel that ideally, archaeology should not have been required. It was, though, largely because since inventing writing, humans have been bad not so much at record keeping, but at helping those removed from their own context understand it.

As an economist, I have spent time grappling with the value that society implicitly places upon things. Markets famously fail several areas such as teaching and housework. One of the urgent areas is news media, as evident from the twitter feeds of Margaret Sullivan, Jay Rosen, many others and opinion pieces by Tim Wu, Charlie Warzel and others. They compelled me to mull and self-critique an assortment of smaller solutions over a period. Then this Wired op-ed on Artificial Intelligence by Jaron Lanier and Glen Weyl inspired a broader new paradigm that can coherently encompass them.

What follows is a concrete model for news media that delivers what the pompous subtitle above teases. (The Semantic Web itself is not a radical innovation of course. This is just an implementation tailored to the specific crises facing news media, with its business rationale. It also speaks to the stalemate with search engines which might otherwise end in a very different internet well beyond news.)

In the current model, markets view what a writer and editor produce as text. For four centuries, text has been easily replicable. That is hardly the full measure of the value that a newspaper’s employees hold. If you chose to read a post tagged ‘journalism’, that last sentence probably seems a platitude. But what new parts of that value can you extract in real market terms?

Journalists working on a story know far more than the facts they’re reporting on. And not just amorphous ‘experience’. Even in that moment, they know where the story and its facts fit in that month, in that year, in that region, in that country. When they write an article, that information is not always included in the text for obvious reasons. Some years ago, ‘explainer’ journalism emerged to bring some of the context into the text (and only the text) in some of the articles. In every other piece, the granular understanding of the context is not included. The most obvious of the reasons is that a reader does not need it and space is at a premium. What is less obvious is that only the reader already on the page doesn’t need it and only the visible, human-readable space is at any premium.

By way of an analogy, consider everything we ever did since writing was invented. Humans have long been in the habit of letting contextual data slip away. (And destroying even limited attempts like the Libraries of Alexandria and Baghdad in wars). Enter archaeologists and historians.

We continued in the habit long after embedding data became extremely cheap and effortless. On the internet, text continues to be published just as on paper, with only the information new and necessary in the moment of writing (and for the reader who has already arrived at that item). Enter search engines.

When digital material grew into mountains of information, pushing the slightly older beneath thick sedimentary layers on a time frame of days rather than centuries, it was as impenetrable as unrecorded history. We needed archaeological sleuthing to help find it; that is to say, ‘index’ it all for us. Yet, because it was all digitised, we chose to take crude shortcuts (not ill-suited for 90s technology) and recruited machines. The PageRank algorithm was the machine equivalent of an archaeologist. Besides being way too unsophisticated (in terms of results) to merit the analogy to a breathing Humanities professor, it also set the world, online and off, irreversibly in the direction of that general ‘solution space’. And as has been said everywhere in recent years, unless we change now — and the time is ripe — the future is one of machines curating published content.

What will we change? Add machine-readable context. Eliminate wherever easy the need for algorithms that guess. Ever-evolving algorithms, mutating in secret inside a Black Box, their slow evolution geared towards incrementally improving what essentially remains a statistical guess.

This will develop in layers as all things digital do. For the first layer, all you need is what is called a schema. Simply attach unique IDs, and not just for each article, but for every newsworthy event and entity. Prototypes already exist with the likes of Wikidata — every reference cited on Wikipedia is a unique Wikidata ‘item’ with numerous relationships mapped.

Honestly, to get a real feel for this new media model, try a SPARQL query or just glance through these examples. It is close enough to English to just get a general idea.

Other potential parts of the first-layer schema might be: How confident are we in the reported information at this point — three sources of corroboration? Fewer? Is this the second update in a developing story? Does the authority of the op-ed contributor derive from academic expertise or professional/personal experience?

Before I lose the journalists reading this, that may sound like more digital invasion in the workplace, but it is actually the route to a digital future with more human jobs for the specialist trained personnel that journalists already are, in the Lanier/Weyl sense. For this is part of the new value we can tap from journalists. These IDs will not be mechanically assigned by machines.

Instead, editors and journalists will do what they do best and every day — adjudicate what qualifies as a distinct event worthy of a new ID. Minimal retraining would be necessary. It would be the IT people’s job to design the system in a way that the schema is all behind the scene. All that editors will do is debate their choices as they routinely do, and when their deliberations are done and they sit down to write or edit, simply click one extra button to either mark something as a subclass of something else, or create a new class, not very different from workflows that already add (in-house) tags.

The potentially difficult part is that the system relies on as wide a consortium of media organisations as possible. When you think of the media empires that own numerous local outlets, it should really not be much more difficult than any other industry consortium. Such consortia are common in the development of standards, which is what this is. Perhaps that is an area where the news industry can emulate the tech industry, where collaborating with competitors on standards is just another Tuesday. For that is at the core of the new marketable value for society — the schema has to be shared by all news that deploys this model.

How is that value extracted? Both advertising and tech companies know all too well. Today when you do a web search, you get 10 results on the page ranked above billions of others that are hidden on subsequent pages. The ranking algorithms is increasingly complex but from a user’s perspective remains much cruder, as we said, than even a single archaeologist.

This new system would allow search engines to lay out information by real relevance, not a semblance as cobbled up by a machine. A news story develops over time. All articles under the consortium would tell computers the timeline, degree of new info and analysis, entities involved etc. for any given news event. If the Storm Lake Times is the first to add a key update to a developing event ID#b378af2550, its primacy in results will not depend on whether search engines can statistically determine that from the words and the timestamp (they cannot). The page will just tell them so, in terms of professional opinion, regardless of which outlet used that ID first. That will go some distance in levelling the playing field. Tech platforms may be reluctant to pay journalists for ‘mere text’. But they know advertisers will pay handsomely for contextually rich metadata.

This is not just important to save newspapers, media, the Fourth Estate, democracy and all that. It is also a better direction for computing technology (or Artificial Intelligence if you prefer): to move away from guesstimating and then retelling us humans what we knew well back in the original context. To move towards building on top of what humans tell machines in great detail, at the scope and scale that only machines can see. (An elaboration of that is for another day, for other authors at Wired or Towards Data Science!)

It is hard to say here and now how the system might evolve. All too often, recent history of Silicon Valley suggests, current paradigms soon cloak any new innovations in their own image. So it is likely at some outlets you will see news events/people getting ratings or likes as in an overused Black Mirror trope.

But having a publishers’ consortium in charge means they can direct it elsewhere. A better target for that sort of thing might be in-situ polling on policy items (bills, amendments). Even with politician-approval, this can yield whole new segments of data. Compare a poll conducted by a single entity (with all kinds of sampling techniques) to reader responses collated automatically from a spectrum of outlets from left to right, small to giant, local to national. The latter is of course not to supplant but complement and inform the former.

Mockup showing one co-benefit use case.
Mockup for a potential widget. The poll goes in where the bill is mentioned across the industry, across the political spectrum. Editors pick when to add it, publishers choose what to share with rival outlets. Users/analysts/policymakers/advertisers get a live snapshot of shifting opinions. [ Note: 1) though it gets an image, this is not the primary part of the new model. It is merely to illustrate potential further development and co-benefits at a mature stage. 2) IT folk, you’re wondering about authentication. Yes, given its cross-site nature, the ideal authentication mode might be phone OTPs, with explicit consent]

For the next layer, example might include things like:

  • Political inclination, for author and/or organisation, one data point self-assessed, the other peer-assessment. (Where on the Left-Right spectrum are you / that other outlet, 0 to 10?) OK fine, as a truce with the tech giants, also user (reader) ratings. Imagine being able to filter search results for that. Like you toggle C/F on a weather website, being able to toggle L/C/R. After you’re done reading, give feedback on your own L-R assignment. Let’s leave firmly in the rear-view mirror days when algorithms try to ‘helpfully guess’ what you like reading and should read next. Or try to ascertain originality and other measures of value that the humans at the point of publishing are the best arbiters of.
  • What policy area / bill this relates to, which people are involved. [Image above.] Algorithms can classify this today with some success, but not without biases. A young African-American reporter and their female senior editor’s take on enumerating and relating policy areas, coordinated with their peers across the industry in real time, is far more valuable to societies and markets than an algorithm trained on even a single newspaper’s own decades-long corpus, let alone the entirety of US or English media.

Incidentally, this is also a way to mine rising polarisation to approach that Holy Grail — more user engagement! Readers will see the aggregate and its relation to the curve at outlets of the ‘opposite’ political affiliation and feel driven to influence it. Equally, on occasion this can also counter polarisation. On some issues, readers will find that the curve for the rival outlet does not match their expectation, and the other side actually is not all that different.

Techies are today gatekeepers to published information because all that the media gives them to work with is text with minimal metadata (author, title, date, and in some cases, tags). Most outlets do not add those in a consistent machine-readable form. (I use Zotero.) And what we are talking about is more akin to tags that are common across outlets. It is long past time.

Give them contextually rich metadata and developers will naturally use that instead. That is how Wikipedia snippets get top billing on most search results! It’s SEO performance, which newspapers can only dream of, draws on its rich schema. That is one way to leverage established tech workflows to organically put content creators back in charge.

Algorithms are not the enemy. Today they seem adversarial only because the interoperable contextual schema is missing. If publishers can come together to put that in place, algorithms too will evolve in a more ‘human-symbiotic’ direction.

And if a cursory speed-reading of this raises the spectre of some form of central control, remember that Wikimedia is the very antithesis of central control. Nothing above requires that the members of the consortium work from a common dataset. They will only share a schema so that all the data is interoperable. Again, as observed above, that is what creates new value.

Wikipedia gets top billing, left (first and second) and right, on most searches not merely because so many pages link to it or so many users pick it to click from results. And the main reason it gets a card to itself on the right is not that the content is CC licensed. It is there because that content is mapped onto a machine-readable schema that encapsulates all the related info and the relations. It tells Google this thing is a species, it is a tree, it says how this tree relates to others, where it is found and much more. [Image: Webpage snapshot reproduced under Fair Use assumptions.]

At the moment the fight is over text snippets. You cannot hide the entire text, except with convoluted tricks that degrade reader experience. Worse, it is a strategic business mistake and betrays a wrong grasp of the internet, as Google’s lawsuits in the EU have shown, notwithstanding the untested promises from last week. But you can selectively share and hence charge for this rich data layer. The only constructive way to make the internet (giants) pay more is to render algorithms less valuable than what you offer in its stead.

In fact, even invoking this alternative paradigm can serve as a bargaining chip in the negotiations that publishers will apparently soon have in Australia, Brazil, Germany, and any other countries Google next picks in its wisdom.

If journalists, editors and publishers can form this consortium, they will not just save news media. They will change search engines in the very short term and shape the evolution of internet itself in the medium term. The new model will create new journalism jobs in media rather than allowing suboptimal technologies to kill them. The only job losses in this future would be archaeologists and their ilk.

Only kidding! Archaeology will not disappear. Like much else, it will get more specialised. Because there will still be haystacks of virtual cuneiform tablets due to constant obsolescence of data formats, digital divides, and, because so many humans will remain in the loop by design, errors.

To me, Wikipedia is the greatest human accomplishment. A Library of Alexandria, finally, that is worthy of human progress in recent centuries. And that is just for the proportion of the collective human ken that is available on its surface. Deep in its genetic code though, it just so happens to also show the way forward to organise the written word. Some day not far from now, the entire corpus of human content will itself live, literally, on countless redundant copies written in DNA. Before that day, let’s begin writing the ‘first rough draft of history’ in a way that doesn’t make gatekeepers out of machines that would otherwise help search only the intentionally obscure.

This is the year to rebuild with a fresh slate, isn’t it?



Sach Wry
Digital Diplomacy

Economist, coder. Nonfiction title in the works. [] [ research . mentor @ outlook . com ]