How often does The Guardian link out to data sources?

Testing a hunch about how often The Guardian links to the ONS was a chance to explore good data journalism practice, revisit some Python and play with the excellent Guardian API. But what did I find out?

Andy Dickinson
4 min readMay 6, 2022

When introducing data (and data-driven journalism) to students I always include The Office of National Statistics (ONS) as a go-to source. It's a really accessible gateway to the ‘low hanging fruit’ of government data sources as well as their own data gathering and analysis.

I always try and balance that with examples of how journalists might use the data. So I look around for contemporary examples that draw directly on ONS data. It’s important that students see that data is as much about content as it is grist for the research mill! But examples articles are also a good way to highlight good practice. And there’s perhaps no more important good practice than being transparent with your sources — share the source of your data.

So I was frustrated that as I looked around, I was still finding articles that used ONS data but didn’t link to it. And I tweeted my frustration with a pretty blatent example from the Guardian.

But does this mean that the Guardian is really bad at this? I had a quick look at the site, searching for ONS and scanning the stories and, on the first pass, it didn’t look too bad. But, if we’re talking about data journalism, this is really a lot of anecdote and not a lot of data. I needed to be a little more systematic. So, I dug out the old GIANT CAP and set about finding out just how much they do link.

The Guardian open platform API

Getting data from the Guardian.

The one thing you can’t accuse the Guardian of is being a closed shop when it comes to accessing their content. They have an easy to use API for searching their content called Guardian open platform. You can sign up for free(with some minor limits) and start programmatically searching their content.

So I used the API to give me as many results as I could get for a search of “Office of National Statistics”. The result was 1389 results from 2008 to 2022.

The Guardian helpfully includes lots of tags and other data in with the results. So we can see how those 1389 results spread across the different sections of the site. Anything after Opinion is basically n< 4

A chart of the search results grouped by section. Anything after Opinion is basically n< 4

How many links to the ONS

It's interesting to see the distribution, but there are no real surprises. But it's not the main question. So, how many of these articles mention the ONS link to the website.

Again, kudos to the Guardian API. They include a text version and an HTML version of the article's body text. So I searched the HTML version of each result for the text “ons.gov.uk/”. Which would pick up any URL in a link.

And by that measure, 645 out of the 1389 results had at least one link to the ONS. That’s less than 50%. And that’s pretty much the same picture when you look at individual sections.

It’s not a pretty bottom half but you can explore the data directly on Datawrapper.

Some observations.

A couple of other things popped up in my quick look at the numbers. 175 of the results with links had more than one link and 24 had more than four.

A quick look revealed Live blogs were commonly where multiple links appear; that kind of makes sense. One live blog I looked at had 8!

Liveblogs often had multiple links out which makes sense given the repetitive nature of the form.

Conclusions

There are lots of interesting things I want to dive into here to confirm some of my broad assumptions. I’d also like to look at ‘how’ the links are presented. But the short answer to the question is ‘clearly not enough.’

Even by my fairly unrefined analysis. Not linking out to a data source in half the articles you produce feels like a missed chance to be a little more transparent and be more “data journalism” in what they do. Especially when, in this case, the source is open and available — no proprietary data sets here.

Of course, the key here is what we might measure that against. Are The Guardian any worse or better at this than other media orgs? I just don’t know. That’s my next project at some point. But grabbing data from multiple news sources is going to take a while! If only other news orgs were as open with their content as The Guardian is with their API

But the long and short is. Regardless of how good or bad the Guardian is at this stuff… it's just good practice to link to your data sources!

After matter

Darren Waters, Head of Digital Content at the ONS made a fair point about why some content may not be linked.

That idea ‘feels’ right, although I’m not sure there’s an equal or greater value in assuming the audience won’t understand.

--

--