In the latest installment of the Lumen Researcher interview series, we spoke with Andrea Fuller, a reporter for The Wall Street Journal who specializes in data analysis. Prior to her work with the WSJ, Fuller was previously a data journalist for Gannett Digital, The Center for Public Integrity, and The Chronicle of Higher Education.

Fuller’s work combines data, spreadsheets, and code with investigative journalism and storytelling. Last May, Fuller wrote an article for the Journal about her investigation into Google’s handling of DMCA takedowns. Using Lumen’s extensive database of copyright notices, Fuller exposed some of the shady copyright claims and abusive submitters that were flying under Google’s radar.

In this interview, Adam Holland, Project Manager at the Lumen Project, Anna Callahan and Gina Markov, Lumen’s 2021 summer interns, spoke with Fuller about her research and how the Lumen Database helped her develop her story about the bad actors in the world of DMCA takedowns.

Lumen: Could you give us a brief description of your bio?

Andrea Fuller: I’ve been at the Journal since 2014, so, seven years now. I’ve done tech stories, I’ve done a lot about nonprofit shenanigans, I’ve done a lot about education. So, I’m a little bit of a jack of all trades when it comes to subjects. But I mainly work on data projects at the Journal — I do spreadsheets, I do SQL, I do R, I write code, I call people and talk to them, and then I write stories. So, yeah, I do a little bit of everything.

Lumen: In 2020, you wrote an article for The Wall Street Journal about how Google approaches DMCA takedown requests. Could you give us some background about your process of researching and writing that piece?

Andrea Fuller: We started this article in late 2019 after it came out of some other Google reporting. We came across some of Lumen’s own blog posts and great work about the DMCA and we wanted to take this a step further. So we spent months and months downloading lots of DMCA requests from the Lumen database. We got the Lumen API key and started uploading the text of these requests and assembling our own database.

I set up a MongoDB database that we imported all these JSON files into, and we started collecting them incrementally. I actually wrote a bunch of code to turn it into a relational database so I could search them more easily. I was working in SQL. Google releases its own transparency report, which does not have all the detailed information of the text of these requests, but does say how many and which of the URLs were removed, so I was able to cross reference that.

I looked for any DMCA requests against major websites, whether they be major retail sites or news sites. It was interesting — there weren’t any successful DMCA requests from sites like The Wall Street Journal or The New York Times. What I started finding was more obscure, niche news sites — whether they be blogs or international news sites in foreign languages — [where] Google had complied with the take-down requests [it had received]. It became evident to me that Google was doing a really good job at filtering out bad takedown requests against major newspapers and major consumer sites, but not so much against more esoteric media.

Lumen: Do you have any speculation as to why that is?

Andrea Fuller: One of the things we say in our article is that the team at Google that’s in charge of DMCA is not very big. To actually manually review all of these requests is a near impossible task as the volume of DMCA requests has increased dramatically. And it’s really hard to make those judgment calls, because whether something is a real news site or not can be a really difficult determination. You get into broader philosophical discussions about the credibility of media in particular.

In one of the most compelling interviews that I had, I talked to this journalist in the Ukraine who was affiliated with an international reporting organization. He described these shady characters who will post fake copies of webpages on live journals and then file DMCA take-down requests to remove the original pages. It works, forcing them to file counterclaims, but there’s a gap before you can have it restored. So it’s a pretty effective strategy for getting content removed if you are an alleged Ukrainian gangster.

I also talked to one girl who was a blogger in Singapore, and she had a blog that was taken down through a DMCA request. She was writing letters to Google saying, “Why? I don’t understand. What did I do?” She didn’t do anything. She wrote about this Instagram influencer who was promoting a questionable sham cryptocurrency product. And that [product’s] organization created a blog post and filed a take-down request, claiming that it wrote the article first, which was laughable because it was clearly written in her voice, in her person. So, it’s really hard for people to fight these. They can definitely file counterclaims, but there is a delay in the process and then sometimes people don’t necessarily have the resources.

It’s a really tricky issue. It would take a lot of manpower to manually review all of these. Even though I was able to use queries to identify likely problematic ones, this took me a lot of time. Eventually, I sent Google about 100 things that I pretty much knew were wrong. And from that, because they have much more data than I do internally, they were able to restore something like 50,000 pages that they had taken down erroneously. And that’s just out of my one little slice of research.

Lumen: Did you see any trends in how fraudulent notices looked as you were going through them?

Andrea Fuller: You would see examples of things that claim to have happened earlier, but the date was something that couldn’t possibly match with the dates in the context of the piece. And it was clearly just made up. So, it was pretty self evident. I think there were a lot of really clever, yet nefarious, tactics. But once you look at some of these articles it was really blatantly obvious: We had these LiveJournals that had been hacked and were suddenly all in Russian or Ukrainian, and had all these articles about oligarchs.

In other cases, it was harder. Both of these sites in Ukrainian look kind of scammy, which is the real one and which isn’t? And it was really hard to assess that in some cases.

Lumen: What was the most shocking or unexpected thing that you found during your research?

Andrea Fuller: There were a couple things that I think surprised me, one being the hacking of LiveJournals. The idea that LiveJournals played such a big role was so bizarre to me. I actually, at one point, started searching for any take-down requests that were based on claiming that a LiveJournal had copyright, because that was a way I was identifying them. I think I was also pretty surprised by that story where the girl’s blog actually got taken down because it was a Google Blogger product, and that was pretty stunning to me. Not only was it not available in search, but it was just gone from the internet entirely and that was pretty troubling.

Lumen: What alternative approach do you think you would have been able to take had the notices not been available to you via Lumen? Do you think that approach would have been as effective?

Andrea Fuller: I think it would have been really difficult just going off the URLs — I don’t think this would have even been possible without the Lumen data. I don’t think we could have found such a large volume of stuff. Having the actual text of the complaint was what made this analysis possible and I can’t even really fathom how we could have done it simply based on the transparency report data. We could have probably found a few anecdotes, but I didn’t know what websites to search for because I don’t know the name of Ukrainian blogs, so I think it would have been really difficult without Lumen.

Lumen: Do you think there is any appetite for a follow-up piece about this stuff?

Andrea Fuller: Yes. My coworker, Rachel Levy, did a great story a couple months before mine about reputation management. This hedge fund manager had created fake websites to promote himself and manipulate Google search results. So, it’s certainly a really interesting area, and I think DMCA is a larger piece of that puzzle.

Lumen: As a data journalist who often covers issues pertaining tech and business, how important, in your opinion or experience, is transparency through notice sharing?

Andrea Fuller: As a journalist, we do our jobs based on this kind of transparency. If there had been no Google transparency report or Lumen database we wouldn’t have been able to do this. And so, you have to at least give Google credit for releasing that kind of data. And I think if there was more stuff like that out there in the tech world that would be really beneficial to researchers and journalists and people who can help with the accountability process. I think that the more transparent data we have, it makes stuff like that possible. And at the end of the day, it helps the public and consumers and it really helps the Ukrainian journalists. And that’s what we’re all here for.

[The quotes in this interview have been edited for clarity.]

--

--

Lumen Database Team
Berkman Klein Center Collection

Collecting and facilitating research on requests to remove online material. Visit lumendatabase.org and email us if you have questions.