Semantic DNA to Analyze Messaging Effectiveness: an Application of Explainable NLP

SumUp Analytics
Nov 23, 2019 · 6 min read


Identifying the problem

When you’re marketing a product, persona, or idea, you must communicate your message effectively. Your proposition is meaningless if it does not resonate with your audience. Measuring that effectiveness becomes challenging when you lack reliable and consistent Key Performance Indicators. Political campaigns provide a prime example. Election polls have self-selection bias and election results are too infrequent and definitive to be useful. The information will often change by the next election cycle. Current tools lack the ability to accurately measure the effectiveness of even the most tracked and researched political and cultural messaging.

The private sector has its own challenges when analyzing message effectiveness. Overlapping propositions from competing products or services become increasingly difficult to differentiate. How do we confidently measure the effectiveness of the general message shared with our competitors against that of our unique message? How do we improve A/B testing so that it starts to account for these factors?

We present a fast and simple method to identify the specific message of an entity and analyze its impact on general or specific content. This method uses SumUp Analytics’ flexible APIs which enable you to focus on the most specific aspects of message dissemination.

Providing a solution

Our method consists of a three-pronged approach:

•Identify the key characteristics of a message in the form of text components that allow it to be clearly recognized and differentiated from other competing messages

•Recognize the presence or absence of these key characteristics within a well-defined, targeted text corpus of interest. This presence or absence is defined as the message penetration

•Rank the level of penetration between different messages if desired

Applied to the case of a political campaign, this approach enables the identification of the key differentiating characteristics of a specific candidate’s message within a corpus of well-identified news coverage, the level of penetration of this message, and a measure of the effectiveness of this penetration compared to similar or competing messages.


Defining a method

After having identified a text corpus representative of a targeted message and a contrasting corpus representative of the competing message(s), we use corpus representation, topic extraction and contrast analysis to extract key wording representing and distinguishing the targeted message from the competing message(s).

In particular, this process constructs a complete contrasting dictionary, allowing us to identify characteristics of the focus message which are absent in the competing message(s) and vice versa. This produces a unique ‘DNA’ for each message.

Measuring actual relevance

The accuracy of the above method is easily validated through a cross-validation test (cf. Appendix). However, it is important to note that the value of the approach is essentially dependent on the quality of the data used to build the model. In this case, the quality of the focus corpus and the competing corpus is essential, especially the extent to which each corpus is truly representative of the corresponding message. In some cases, the focus message and the competing message might actually look different but might not be that different. The proposed approach will identify this issue, either by failing to identify a truly differentiating dictionary or by showing poor results in the cross-validation test. Both outcomes, even though showing a failure, provide value to the user as they signal a too-high degree of similarity in messages being compared, or an inadequate choice of corpora retained for this analysis.


Defining a target corpus

Once a message’s semantic DNA has been clearly defined, it is now possible to look for this DNA in other corpora, such as social media platforms, news coverage, in specific regions, at specific times or with specific focuses. This aspect requires additional domain expertise to be used in synergy with our explainable NLP tools. Given a well-defined corpus, step 1 provides the ammunition to look for DNA present in that well-defined corpus. The conclusions to be inferred from that presence or absence lies in the hands of the domain experts.

Quantifying message penetration

We combine sentiment analytics, corpus representation, and topic extraction to clearly quantify the “level” of presence of a specific message within a well-defined corpus, by instantly recognizing the existence of key differentiating elements within the corpus. By quantifying this presence, we are able to identify the extent to which a message is “present” or not from a given corpus.


For most observers, several aspects of messaging effectiveness can be important:

•Effectiveness of a specific message

•Effectiveness of a message compared to an alternative message

•Effectiveness of several competing messages

Using our analytics platform, through the method described above, allows clear quantification of the level of penetration of a message within a specific target corpus and allows the user to address each of these aspects. For example, different messages can be ranked by level of penetration.


For the purpose of illustration, we are running the following ongoing experiment.

The 2020 Election news cycle is providing a wealth of study material: multiple candidate messaging, competing on multiple aspects of their message, in multiple fairly distinct regional or national platforms.

We identify the semantic DNA of 6 candidates: Joe Biden, Pete Buttigieg, Kamala Harris, Amy Klobuchar, Bernie Sanders, Elizabeth Warren, using each of these candidates’ platforms and their respective plans as focus and competing messages. The cross-validation tests in Appendix suggest that these semantic DNAs are fairly accurate at identifying from which campaign a given document is coming from, and significantly outperform a benchmark methodology in that regard.

Exhibit 1 — Candidates textual DNA (excerpt): subset of the words most representative of each candidate against their opponents
Exhibit 2 — Candidates textual DNA (excerpt): subset of the words most representative of each candidate’s opponents

We identified a number of news outlets of interest in Iowa, intended to be representative of Iowa news coverage ([14] through [21]), as well as a set of political websites providing a daily RSS feed and a balanced coverage of Democrat and Republican inclinations ([1] through [13]).

We then used the approach described above in order to rank each candidate, in terms of the level of penetration of each candidate’s messaging in the identified press coverage.

Exhibit 3 — Relative rankings of each candidate’s message in online news corpora, from detecting their respective semantic DNA


We have validated the accuracy of our semantic DNA identification method through a cross-validation test performed on the content retrieved from each of the 6 candidate’s websites.

The test is implemented as follows:

-For a given candidate, a sub-corpus is created by excluding one of the documents retrieved from that candidate’s website. This sub-corpus is combined with all the documents from that candidate’s opponents

-A semantic DNA is identified for the candidate and the grouping of their opponents, following the methodology described in the first section of this post

-That semantic DNA is then traced in the document that was left out

-If the trace comes back positive, that is a match. If the trace comes back negative, that is a miss

-This process is repeated, alternatively excluding each document retrieved from the candidate website so as to obtain a positive / negative trace for each of the candidate’s documents

-This process is then similarly applied to the corpus of documents from that candidate’s opponents, where documents one-by-one are excluded from the semantic DNA identification, then used to trace said semantic DNA

Standard performance metrics are computed. Precision is not reported due to the one-sided nature of these tests (precision always equal to 1). Our approach is benchmarked to a method developed by LSE. We also evaluated the impact of design choices in terms of how weights are ascribed to each word extracted during the semantic DNA identification. It further establishes that our methodology is accurate and significantly outperforms the benchmark. The following exhibit displays the results for Elizabeth Warren and the group of her Democrat opponents.

Exhibit 4 — Example of cross-validation test results: SumUp Analytics’ approach and benchmark LSE approach [22]
























SumUp Analytics

Written by

Accelerate Understanding: Explainable AI for Text at Scale

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade