Lessons in AI from the editors

Designing and building a human-in-the-loop AI system for Daily Maverick

Jason Norwood-Young
8 min readJun 14, 2023

The idea was simple: let’s take this new-fangled GPT-thing and use it to summarise the news. How hard can that be?

This was back in late 2022, artificial intelligence had just taken a massive leap forward, both in real capabilities and in public perception; OpenAI’s ChatGPT-3 was blowing minds with its human-like responses to questions like: “What should I bake with my overripe bananas?” through to “Make a plan to end the human race.” Everyone wanted in on the hype, and news publishers, still suffering from a mixture of FOMO and PTSD from the rise of the internet, were not going to be left out.

Daily Maverick was no exception. As a real digital leader in the publishing space, with an outrageously successful membership programme, the South African news publisher has always forged ahead where others fear to tread, both editorially and technologically. It has reaped the rewards, growing fast and constantly improving its technological and editorial offering, despite tough economic conditions in South Africa.

I had recently been inspired by a talk by Tim Part from FT Strategies on newsroom experimentation, and wanted to build something small as a starting point, just to see what the technology was capable of. I started experimenting with OpenAI’s Playground, a system that gives you many more options than the usual ChatGPT interface, and is more indicative of how we’d eventually interface with OpenAI’s API.

Feeding articles in, and asking GPT to summarise, create headlines, check for errors, and create Tweets, taught us its shortcomings quite quickly. GPT-3 was pretty good at summarisation, okay at headlines, and useless at subbing for copy errors. For Tweets, it was highly unpredictable, creating a brilliant Tweet one second and an unusable one the next.

But throughout it all, we could see the potential coming through: it was surprisingly creative, creating hashtags and even occasionally throwing in an emoji at times for the Tweets.

Its biggest struggle was counting: you could not expect GPT-3 to limit itself to 140 characters, or x number of sentences for a summary, or even “Summarise in four bullet points,” with any reliability. It’s the same reason GPT-3 is poor at rhyming: it cannot plan ahead. (What AI researchers call “thinking slow”. Large language models like GPT-3 are good at “thinking fast”, but planning ahead — thinking slow, is their Achilles’ Heel. As soon as they solve that problem, the human race is doomed, of course.)

Another problem with summarisation is that OpenAI’s models tended to try to summarise every major point in an article. This isn’t necessarily the behaviour a newspaper wants; depending on the use case, it might want to entice readers to read a summary giving a few top facts, and then click through to the full article if they want more detail. The summaries didn’t complement the articles: they made them redundant.

If an article wasn’t written in a strict pyramid style, the summaries were often wildly off the point. A flowery intro, common in opinion writing for instance, would throw GPT-3 off completely.

And occasionally, perhaps one time out of ten, the summaries would just miss the hook completely, or even get the article factually wrong.

Another issue was tone: Without careful prompting, the summaries were dry, lacking the stab and thrust of Daily Maverick’s typical jousting style. This was solved in large part by adding: “… in the style of Daily Maverick.” While many creators may complain about their content being added to the gaping maw of the AI models without consent, in some cases it’s pretty useful that the AI is pre-trained on your unique flavour of language. Still, adding this would occasionally result in summaries that started like: “According to an article in Daily Maverick…”, or, worse, “This article is about …”. Gouge my eyes out!

Clearly, while GPT-3 was astonishingly good, it was also get-yourself-sued bad. A bit like a toddler, you didn’t want to leave it home alone, unsupervised for a day, or you could be sure that at some point, some mischief would occur.

Enter the concept of human-in-the-loop: as it says on the tin, a human-in-the-loop system has a real live person assist the technology. In the case of summaries, we simply marked a summary as unapproved — and therefore, not fit for public consumption — until an editor (be that a subeditor, section editor, or chief sub) had checked it. The editor could then decide to accept it as-is, to reject it (which immediately generates a new option), or to edit the summary.

This is my favourite type of AI, and I think where we should be focussing our efforts: AI becomes a very powerful tool, assisting skilled people to do their jobs much, much better. It still takes someone who knows how to edit to make the system work, but it makes them much better at their job, capable of thinking at a higher level, while the AI assists with the hum-drum work.

AI is like some kind of instant agricultural revolution, from plough to tractor, except everyone in the world was given a tractor overnight. That’s great, if you’re a farmer, but if you know nothing about farming, it’s pretty useless. (Except for towing Russian tanks, of course.) It makes highly skilled people better, but its ability to give unskilled people the powers of doing a skilled job is, at this point, highly overrated.

I built two separate interfaces for what became the SummaryEngine Wordpress plugin. One interface appeared while the editor was editing the article; the second was an overview of all summaries for all articles, which lets an editor work through multiple summaries quickly. As these things go, neither interface was too cumbersome to build, although there was a lot of detail-work, and of course plenty of code and design that I would change if I were to build it again. But the point is that it was a pretty inexpensive experiment to get us generating summaries, to see how editors use it, and to build up a foundation of knowledge, awareness and acceptance of AI within the organisation.

Of the two interfaces, only one is really used, the article CMS, which was one of my big take-aways from this project. Editors are busy people, apparently. They want to do everything on one screen, and this screen is the one they work on the most. (This insight is also becoming clear in some other projects I’m working on.) Just because it’s someone’s job to use a web interface, the same rules that work on a potential browser or customer apply: “Don’t make me think”; and definitely: “Don’t make me load a new page”.

We went through two major iterations of design with the user interface on the edit article page, with one of the smallest changes having one of the biggest impacts: changing “Unapprove” to “Reject”. (Thanks Styli Charalambous for suggesting this.) Too often, as software developers, our nomenclature is based on our data model or application model, but it doesn’t match the model in our user’s heads of what’s happening. Their perception of reality is actually more important than your coded reality, because at the end of the day you need their buy-in more than they need extra work to do on every article.

I recently ran into this problem when I discovered that what editors refer to as “section pages” are what are distinct front pages for the site. I could have either tried to re-educate 20 editors as to a semantic difference, or I could just roll with it and adapt my language, and the language of our software, accordingly. Software is more malleable than people’s ingrained perceptions. And while it rankles me that people are using the wrong word for something, I need to remember that it’s only the wrong word to me, not to them.

The editors, much to my delight, started using the summaries, but we still didn’t really know what we were going to do with them. (We didn’t tell the editors this, of course.) Options included a summary newsletter, use in our mobile app, or even as a popup or sidebar while reading an article. And while I wouldn’t usually recommend starting a project without a firm design, in this case it worked out very well: Daily Maverick has quietly launched a completely new interface with summaries, as another cheap experiment to see how our users interact with them. Two types of summaries are used in the new interface: a short summary, and bullet points. It’s quickly become my favourite way to read Daily Maverick, and has been visited by over 17,000 readers in the last week. Whether it becomes a mainstay of DM’s offering will depend on the data, but the fact that we could quickly put out this product, and maintain it, even though we’re just starting to experiment with AI, shows the promise of the technology.

The project has also opened the possibilities of other use cases: Experiments with GPT-4 show much better results in terms of length and tone. It still cannot sub-edit or copy-edit worth a damn, but one thing it can do is translations. TranslateEngine is on the cards. It can also suggest headlines, which we will probably roll into the headline scorer I built, which has changed the way news editors structure headlines based on what our readers respond to, and has helped Daily Maverick surpass 10 million unique readers a month. Once again, these interventions will be built with our people in mind, to give them a tool, and not try to replace them. If you’re going to let AIs write headlines alone, expect the same result as giving a toddler access to a permanent marker, a bucket of paint, and your makeup drawer, and then leaving them alone for a few hours.

Daily Maverick runs on Wordpress, so SummaryEngine was built in PHP on top of Wordpress, using MySQL as the data store, with some of the user interface built in Svelte with Typescript. We used the GPT-3 Completion interface, but will be trialling GPT-3.5 and GPT-4 using the Chat interface.

--

--