Engines of Truth: a Deep Dive

DCVC
5 min readDec 14, 2018

--

Primer’s software distills millions of unstructured data sources into concise summaries by a) breaking documents down into the concepts they discuss, b) identifying the relationships among those concepts, and c) reporting patterns in those relationships that humans might miss. Their project on how Russian and English news sources cover terrorism showcases their engines nicely. Each engine replaces a different task that an analyst would otherwise do by hand.

Source: “Primer uses AI to understand and summarize mountains of text

Primer’s first and second engines, Structure and Ensemble, identify the concepts discussed in the documents, determine how they relate to each other, and reconcile them into multi-document models. These two sentences are from articles on Keaun Cook, an Illinois teen arrested in 2016 for allegedly planning terror attacks:

“Authorities arrested an 18-year-old Southern Illinois man who they allege had communicated with an unspecified terrorist group about a plan to attack at least one area location.” — Business Insider

“Cook, they say, was devising a plan to attack three or four Madison County venues.” — KSDK

Primer’s Structure algorithm would make separate concepts of “Keaun Cook” and “attack” in each document. It would also mark Southern Illinois and Madison County as locations. Ensemble would consolidate these concepts into a multi-document model. The “attack” model would have the “at least one” and “three or four” estimates attached to it. The software would identify Madison County as part of Southern Illinois, and would pinpoint the event’s location in Godfrey after reading other documents.

The Event engine identifies patterns in document dates and keywords to isolate individual events in a body of documents. For instance, the Event engine identified an uptick in articles with the phrase “Istanbul nightclub shooting” in January 2017. The graph below shows relative search volume over time, but the daily article count would follow a similar trend. The Events engine groups documents and their models into distinct events and constructs a timeline of them.

Source: Google Trends

Primer’s Context engine analyzes how each piece of information connects to those around it. It can identify a claim’s origin and how it spread, find supporting or contradicting evidence for a fact, and establish a probable cause timeline among related events. Primer’s terrorism news example doesn’t show this engine, but if it looked into how Keaun Cook was doing, it would connect his arrest, competency evaluation, assault of a county deputy, and indictment for that assault.

The Difference engine detects differences at multiple levels of resolution: sentence, document, and corpus. Changes as minor as a single word can be important to identify in legal documents. Differences between documents can show variations in how media covered an event. For instance, in the Barcelona attack in 2017, the eyewitnesses Russian and English media sources interviewed conveyed slightly different sentiments. The Russian eyewitness reported that some people were “walking calmly and smiling” and that, since terrorism happened everywhere, “why be afraid?”, while English eyewitnesses were less optimistic.

Barcelona terror attack eyewitness quotes

Gazeta: “The terrorist attack in Barcelona: 13 dead”

“… The streets were filled with flashes and sounds of flashing lights, panic began. Someone fled in fear, and someone calmly walked towards and smiled.” …

“Terrorist attacks happen everywhere: why be afraid? We need to think about the good,” he added.”

New York Times: “Van Hits Pedestrians in Deadly Barcelona Terror Attack”

“It was horrific,” said Sergi Alcazar, a 25-year-old photographer who arrived 10 minutes after the attack to find victims lying amid broken umbrellas, charis and cafe tables.

Keith Fleming … looked out over his balcony and “saw women and children just running and they looked terrified.”

“It’s just kind of a tense situation,” the A.P reported him as saying. “Clearly people were scared.”

On a corpus level, the Difference engine compared the amount of attention Russian article corpus and English article corpus paid to different events. This information determined the size and shade of the regions on their interactive map. Tracking differences between data sources helps analysts identify misinformation, track how stories develop over time, and deconvolute the game of “telephone” that distorts information as it spreads.

Finally, Primer’s Story engine summarizes the information the other engines analyzed. Researchers have pursued automatic text summarization since the 1950s, but they struggled to make concise, readable summaries that contained the right information. When a human reads a document and summarizes it, there are two crucial intermediate steps: understanding the document’s content and inferring its most important points. Traditionally, computers have used certain heuristics to do this: phrase frequency; key phrase identification; sentence position (the first sentence in a paragraph is usually the most important one); cue words (significantly, therefore, finally, etc…). These helped computers perform extractive text summarization — identifying the most important sentences in a document and pulling them into a short summary. More recent tools use abstractive summarization, summarize documents in their own words.

Primer combines extractive and abstractive summarization to create succinct overviews of thousands or millions of documents. This requires two algorithms: one that reads the document (the encoder) and one that generates a summary (the decoder). Both the encoder and decoder are recurrent neural networks — a type of AI adept at processing context-reliant information. These networks were trained by processing 300,000 article / summary sets from CNN and the Daily Mail. Text is put through the encoder, which feeds into the decoder, which outputs a summary. Particular document features determine the encoder’s output, like the article’s topic, or whether it contains certain keywords. The decoder then processes the encoder’s output, picking out the words worth paying attention to. When the decoder is confident that it understands a concept, it generates its own sentences (abstractive); when it is not confident, it copies text from the original documents (extractive). Below are examples of Primer’s summaries for a cryptocurrency news dataset and event:

Source: “Sean Gourley: Building machines that read and write”

Primer generated this four-paragraph summary from 1,000+ articles in a fraction of the time it would have taken an analyst. It covered the week’s major events by article volume and social interest, identified key people and regions, and included all relevant statistics.

You can return to our main post on Primer AI here.

--

--

DCVC

DCVC is a VC fund that backs brilliant teams applying deep tech to transform giant industries from their startups’ Seed and Series A stages through growth round