Chenhao Tan
6 min readJul 28, 2017

Exploring the Friendships, Rivalries, and Trysts between Ideas in Texts

Ideas have been competing and collaborating throughout our history. For instance, it took a long time for Copernican heliocentrism to win out over the geocentric model in the western world; we are now experiencing a competition between protectionism and globalization. In this blog, we introduce a simple yet effective framework to explore the relations between ideas from our paper in ACL 2017, Friendships, Rivalries, and Trysts: Characterizing Relations between Ideas in Texts”.

An Ecosystem of Ideas

Ideas do not exist as isolated islands; they relate to each other in different ways. Inspired by the “meme” analogy from Dawkins’ famous book “The Selfish Gene”, we can think of the interactions between ideas like a set of species interacting in an ecosystem.

Tracing the Ecosystem in Texts

The first step towards exploring the ecosystem of ideas is to find the trace of ideas. Fortunately, although ideas exist in the mind, they are made manifest in language. We now have access to massive amount of temporal textual data that spans long periods of time, for example, Google Ngram viewer and news corpora that can date back to the 80s.

Next we will first show how to quantitatively describe relations between ideas in texts. We find surprising ways that ideas may relate to each other beyond simple friendships and competitions in a quantitative framework. We further use examples to show how we learned more insights into terrorism and immigration in news corpora. Moreover, our framework is independent of how we might represent ideas.

A Simple yet Effective Framework

What the heck do you mean?

Consider news articles on abortion or immigration from 1980 to 2000. We have extracted the ideas in each news article. They are represented by the yellow dots in the following figure. These ideas may be related in different ways. For instance, we may expect “illegal alien” to be competing with “undocumented immigrants” as they signify different political ideologies. So do pro-life and pro-choice. Meanwhile, “Obama presidency” and “undocumented immigrants” may be working with each other hand in hand.

Can we find a way to describe these relations from textual data? And more importantly, can we effectively explore potentially meaningful/interesting relations from all the possible pairs of ideas (messy networks on the left)?

Each orange dot represents an idea. Our framework sifts out interesting relations to allow for effective exploration.

Temporal correlation is not enough!

One possible approach is to compute the prevalence correlation between two ideas. Competition between two ideas means that one will win out over another (negative correlation), as suggested by “marketplace of ideas”, while two friendly ideas are positively correlated. However, the first observation that we made is that “pro-life” and “pro-choice”, two seemingly diametrically opposed ideas, are strongly correlated in temporal prevalence.

The trend between “pro-life” and “pro-choice”. Their prevalences almost completely track each other. Prevalence correlation thus cannot reveal their competition.

It Takes Two to Tango: Prevalence Correlation and Cooccurrence

The main insight in this paper is to combine temporal prevalence correlation and cooccurrence, NLPers’ favorite statistic. Combining these two leads to the following four quadrants (examples are from an immigration news corpus from 1980 to 2016, ideas are represented using topics):

  • Friendship. Two ideas that always cooccur and are correlated in prevalence. A classic example of nice friendship, e.g., “undocumented immigrants” and “president Obama”.
  • Head-to-head. Two ideas that rarely cooccurr and are anti-correlated in prevalence. A canonical example of competition, e.g., “undocumented immigrants” vs. “illegal alien”.
  • Arms-race. Two ideas may rarely cooccur, but completely track each other’s popularity, similar to the cold war situation between U.S.S.R and U.S.A. These pairs are likely related to the same underlying cause, e.g., “immigration deportation” and “republican party”.
  • Tryst. Surprisingly, we also see pairs of ideas that cooccur briefly but then continue on different paths: one becomes popular while the other fades out. An example is “immigration deportation” and “detention”.

A simple statistic from multiplying prevalence correlation and cooccurence allows us to sift out the following top relations.

Four top pairs in each quadrant by ranking the pairs based on prevalence correlation and cooccurence.

Refer to the paper for more quantitative evaluation to show that combining these two statistics allows for effective exploration of relations between ideas. We also have public code repo for exploration and visualization (links in the end).

Examples, I Want More Examples

This framework revealed a few interesting insights that we did not expect.

Islam and Arab

The relation ranked #2 in tryst in terrorism related news is between keywords “arab” and “islam”. Our observation suggests a conjecture that the news media have increasingly linked terrorism to a religious group rather than an ethnic group, perhaps in part due to the tie between the events of 9/11 and Afghanistan, which is not an Arab or Arabic-speaking country. The underlying reason for this observation requires further investigation. Coincidentally, an article in Huffington Post called for news editors to distinguish Muslim from Arab.

Tryst relation between keywords “arab” and “islam”, ranked #2 in news on terrorism from 1980 to 2016.

International relations in terrorism related news

It turns out that top relations between ideas in terrorism related news center around a topic called “federal, state”, which describes the domestic policy on terrorism. We see intriguing connections that 1) the domestic policy topic are in arms-race with topics around Afghanistan and Pakistan (gaining popularity but rarely cooccur), while in head-to-head relations with Iran and Israel (rarely cooccur and anti-correlated, Iran and Israel have faded out in terrorism discussions since 2000); 2) the connections follow structural balance theory, “the enemy of an enemy is a friend”.

Top relations between ideas in news on terrorism from 1980 to 2016, centering around “federal, state”.

Immigrants of different ethnicities

The final example is on relations between ethnic groups. Although keywords “latino” and “asian” are likely to cooccur, with the discussion of Asian immigrants in the 1990s giving way to a focus on the word latino from 2000 onward. Possible theories to explain this observation include that undocumented immigrants are generally perceived as a Latino issue, or that Latino voters are increasingly influential in U.S. elections. Again, further investigation is required.

Tryst relation between keywords latino and asian, ranked #8 in news on immigration.

Recap and Looking Forward

We show a simple yet effective framework to explore relations between ideas in temporal text corpora. It combines prevalence correlation and cooccurrence and is independent of idea representation.

Our work is only a small step into the ecosystem of ideas. Better semantic representation will be necessary for a deeper understanding of the relations between ideas. Not to mention that these relations are inherently dynamic.

Code and visualization tool

Our code is available at https://github.com/Noahs-ARK/idea_relations. Our awesome team are releasing an interactive visualization tool here (https://github.com/nwrush/Visualization).

Screenshot of the visulization tool

Go explore!

People behind the above idea: Chenhao Tan, Dallas Card, Nikko Rush, Noah A. Smith.

Chenhao Tan

Assistant Professor @UChicago, previously @CUBoulder postdoc @UW, PhD @Cornell, study human-centered AI, NLP, and computational social science.