ProPublica’s Documenting Hate Dataset — a Year in Hate Crimes.

Documenting Hate is a collaborative effort headed by the investigative non-profit ProPublica, aiming to catalog hate crimes in 2017. The project includes an all-star team of heavy-hitting collaborators, such the New York Times, Google News Labs, WNYC and more. In addition to a potent body of investigative reporting, the project touts a publicly-available dataset consisting of thousands of stories from an array of news organizations across the nation, ranging from local news stations to special interest outlets to national publications. This is presumably the same data that backs their front-end application.

The dataset isn’t perfect — some events are duplicated across multiple events, non-authoritative organizations are heavily represented, and it generally lacks good metadata — , but it’s still worth digging through to get an idea of how the media reported on hate crimes this past year.

An initial glimpse gives us an idea of what organizations are represented in the collection. It’s unclear what the guidelines for inclusion in the dataset were; when I emailed ProPublica two days ago for further information, I didn’t receive a reply. However, it’s clear that HuffPost (formerly the Huffington Post) has been responsible for extensive coverage of hate crimes since February.

Interestingly, the Daily Caller— a media outlet that at different times has advocated driving cars into crowds of protesters, carried articles by a noted white surpremacist and manufactured a libelous controversy about a Democratic senator — is also represented in the sample. Most of the stories included from the Caller focus on hate crimes towards police officers and attempting to undermine stories about crimes directed at minorities.

While the Daily Caller should not be considered an authoritative source, its inclusion in the dataset, along with Breitbart News, is important given the recent spat of extreme right-wing commentary that has seen a rise in recent years; exclusion from the collection would mean ignoring a potent, if distasteful, movement in the modern media.

Hate crime reporting also varied considerably in volume throughout the year, with certain outliers coinciding with major news pieces.

  • March 2nd: numerous Jewish cemeteries are vandalized across the nation, most notably in Rochester, Philadelphia and St. Louis.

It appears as if the Unite the Right rally in Charlottesville is significantly underrepresented in the dataset, as August 12th and 13th appear to have equal coverage as most any other day during the February-August news cycle. This is unfortunate given not only the importance of the events to the national dialogue, but also the tremendous volume of reporting that they continue to produce months later.

While only a heuristic, performing keyword extraction from the titles of articles in the dataset gives some indication of what topics came up the most. “Muslim” was far and away the most common term, closely followed by
“Islamic”. This isn’t surprising given the way in which the president’s rhetoric has cast religious discrimination into sharp relief.

Most unusual is the comparatively little little coverage received by anti-Semitic hate crimes. According to the FBI’s 2016 hate crime statistics, more than half of religiously-motivated hate crimes that year targeted the Jewish population. Clearly the dataset has no overlap with 2017, but it is surprising that hate crimes directed at Muslims would receive many times the amount of coverage while occurring half as often as attacks on Jews in the preceding year. However, as previously mentioned, a tide of religious vitriol targeting Muslims has unquestionably emerged in the wake of the election and inauguration, possibly explaining at least part of this deficit. What fueled crimes targeting members of the Jewish community remains unclear, though the recent public reemergence of Nazism in America and its tether to the Alt Right movement may provide an answer.

Lastly, the most common keywords for the five largest contributors to the dataset were evaluated. This gives an indication of what different media sources chose to cover between February and August. The Washington Post and the New York Daily News appear to be particularly focused on coverage of white supremacy and local hate crimes, respectively. By comparison, HuffPost’s keyword count is extremely low, in spite of having the most respresentation in the data overall. This could indicate coverage dispersed across a broader diversity of topics, though it’s tough to say without dissecting each story.

While the results of this analysis are interesting, they also highlight some deficiencies in data collection efforts writ large. For example, the data provides no information on the nature of the hate crime each article describes — no field to specify what motivated the crime, who the target was, or who the aggressor (if known) was. There’s also uneven coverage of different events, most notably a lack of data for the Charlottesville rally and attack.

But it’s a start. As data collection and metadata curation improves, it will become easier to use the news to produce real insights into trends in hate crime. Hopefully this is the beginning of a new and vital direction for not just documenting, but understanding, hate.

All code for this analysis is available on Github.

Data Scientist, Freelance Journalist.