Finding networks of websites

Published in

Disarming Disinformation

4 min readJan 22, 2020

My geekier self just published notes from a “data safari” on a network of news and business sites that have been created across the US and beyond, and were tracked diligently and well by data journalists like Pri Bengani at Columbia JSchool, who published the writeup Hundreds of ‘pink slime’ local news outlets are distributing algorithmic stories and conservative talking points (go read it) and Matt Grossman, who first raised the alarm.

“Safaris” are things I do occasionally to keep in contact with data, especially when I’m spending a lot of my time doing policy work, because IMHO it’s dangerous to do data or tech policy without keeping at least some contact with the craft. I did this one because whilst much of the conversation about disinformation has centered around message-based disinformation and social media, the information ecosystems outside social media, and especially the domain-based (as in have urls) ecosystems that present as news and information sources are IMHO equally important. Hell, I thought they were important enough that I spent a year of my life designing algorithms and writing code behind the Global Disinformation Index, an organisation whose main mission is to reduce the flow of money through online disinformation sites (Yeah sorry… I wrote a lot about work building coordination tools for responses to large-scale disinformation campaigns, some about the nearly 40,000 miles on the road across America that I drove between giving talks, setting up meetings and events, visiting researchers, exploring and listening hard to the people in the centre of the country (and their radio stations, media etc), but didn’t write at all about the part-time contract work that funded it. My bad…).

So apart from even more respect for the craftmanship of data journalists, did I have any useful thoughts? Well maybe (again, read the original article — it’s good). On finding sites:

It’s amazing how often stuff is found by accident. We search and search, then a researcher notices a ‘local’ paper in their community that they’ve never heard of. Sigh.
Never forget the low-hanging fruit. Each local news site had a list of other sites in its footer. There were sets of sites per US state, and there were unique phrases and articles to search on.
The Google API is a great shortcut for finding things, especially if you have unique phrases and unusual pagenames to try. Use the “repeat the search with the omitted results included” button to stop it helpfully filtering out all the repeats that you actually do, yes, want to see. And it’s worth trying multiple search engines: sometimes DuckDuckGo had results when Google didn’t.
Context is king. You can get a long way searching on content (see above), but eventually you have to start looking at objects that are related to the things you’re looking for. Pri looked at companies (that owned and ran the sites), people (connected to the companies and each other), and third-party tags on each website. Social media is also useful: most sites have social media accounts and pages, and looking at the other accounts and pages they link to can help with your search.
Builtwith is great for finding tags and relationships between sites using those tags.
Many of the sites were registered on the same day. Could the registries alert when large batches of newsy URLs get registered together?
I liked the New York Times’ grid of site front pages. It was a simple visual check on how similar a set of sites was, but a powerful one, that’s worth doing again.

And on what to do once you’ve found them:

This is astroturfing of local news, business news, and bunch of other news domains. There are two possible parts to this: the filling of news ‘voids’ where there are no local outlets, and the overwhelming of other local voices. “This was purposely done to mislead people into thinking that was a publication from the district” (The Guardian). The articles that aren’t stock seem to be of a specific political perspective; this could possibly be setting up an arena that could be used to shift beliefs and emotions in a politically-charged year. One specific thing that could be done is to use a tool like Carto to map these local news locations against a) the news deserts in this University of North Carolina study b) US election battleground states and c) other large syndicates like Sinclair TV.
This is, ultimately, about trust and how we judge information. “In all of these cases, the issue is less about politicians promoting their points of view than hiding their affiliation with the content — making it hard for a reader who would naturally bring more skepticism to a campaign ad than they would a local news story.” Information analysts use the Admiralty Scale to judge each piece of information that they receive: where the information comes from is as important as the information itself. But if we just see the information, we’re making a snap judgement to fill in the “source” rating, and people in general are terrible at that, especially if they’re given lots of cues or padding Jenna Abrams-style to deliberately increase their trust.
It’s also about learning from the internet’s history. The creation and population of a site for each vertical looks eerily familiar to the story of how meetup.com became popular (by creating and populating pages for groups that would most likely form, then watching as people joined them). It’s smart — and it’s worth looking at our other origin stories to see how many of those have these other paths too.

In the end, for me, this was about tracing a network of sites to watch, and honing skills and tools to find and watch other potentially less benign networks as they emerge over the coming year (and several have already been found around the world). Also I really need to look into the Iffy Quotient.

Finding networks of websites

Written by Sara-Jayne Terp