Evoking Syntax: Part 1, Part-of-Speech

Published in

Forge.AI — Fueling Machine Intelligence

12 min readJan 12, 2019

In our last post, we began by trying to create a model for veracity, and ended with the idea of creating a model for intention using the syntax of sentences. In this post, we are going to start looking into the particulars for a model of intention using syntax.

In general, I create models by slowly peeling away salient questions and answering them one by one until one of two things happens:

We have a model whose performance is “good enough” for what we need
We have shown that a particularly bad property or performance must exist, and we need to rethink our premise

With that in mind, let us dive into syntax and intention.

The Base Question: Is it even possible to determine intention using syntax?

Let us dig into James W. Pennebaker’s research for some inspiration. An expert in natural language and social behavior, Pennebaker is mostly content with counting word choices so we will also begin with simple counts of the different parts of speech (POS). Of course, we have decided that voice and is something we want to avoid modeling (for now), so we will want to normalize our raw counts by something like style (as defined in our previous post: Veracity: Models, Methods, and Morals).

Let’s go to the simplest thing I can think of: different styles of writing have different lengths of sentences. Therefore, we will normalize the raw counts by the number of tokens to get a “density” of the different POS.

Now that we have an idea of the things our model will include (densities of POS), we want to turn our question into a true experiment.

The most important question, which we will keep going back to as we create more and more models, is also the simplest: is it still possible to do X with Y? For us, this becomes: Is it still possible to model intention using syntax?

Experiment 1 — Parts-of-Speech Densities as Syntax

To build an machine learning model experiment, we need several things:

A hypothesis to test
A dataset
A validation strategy to compute our confidence in the answer

Often, the hypothesis and dataset come hand in hand, and that is what I will do here. Let us think about the most clear and obvious link between intention and syntax…

With a quick think, I can come up with several links:

Politician’s words are “empty promises”
Good survey designers and statisticians spend a lot of time making sure their questions do not bias the answers
Clickbait headlines are all about clicking (the intention is very clear here!)

Option #1 I will throw out since I do not want to troll any particular group.

Option #3 has a very clear intention, but I’m not so sure about the syntax.

Option #2 seems perfect: the intention of non-bias is explicit, and word choice and syntax are clearly a part of survey question design.

Option #2 looks good, now can we grab a dataset and build and experiment for it?

Survey Questions, not Answers

Thanks to the people at the Pew Research Center, we have a dataset of questionnaires. Of even more importance, looking at the Pew Research Center’s methodology, we see that the questions are very clearly created with an intention of non-bias in order to get true random sampling answers for the surveys. In fact, there’s an entire page on their methodology pertaining to question creation here.

Ok, the questionnaire question can be one type of intent; and there are even tons of different surveys so we can try to sample questions across a diverse set of topics in order to make sure we are modeling syntax and not content!

Sadly, the Pew Research Center datasets mostly focus on answers instead of questions. As such, the questionnaires themselves are just DOCX formatted (a non-free format) documents that are not easily machine parse-able. So as any good Modeler should, let’s roll up our sleeves and hand-massage some questions out of some surveys. I grabbed four different Pew Research Center surveys:

Cyber-Security Knowledge — administered June 17–27, 2016
Libraries — administered March 7 — April 4, 2016
Information Engaged Wary — administered September 29 — November 6, 2016
American Trends Panel, Wave 24 — administered January 9 — January 23, 2017

I then hand-extracted the questions into a single YAML file which can be found alongside this Jupyter Notebook in the repository. The result? 317 questions. There are a few caveats to my methodology when extracting questions:

I ignored all text that was a command or suggestion to the surveyor (usually demarcated with “[ and ]”).
Whenever a question was written as the multi-part question, there would often be two styles of the question: one for the first part, and one for the second part (to be used “when needed”). This, of course, means that the questions are actually different per survey administered. I decided to use the alternative question syntax (the “when needed” part) every three sub-questions.
Some questions are fill-in-the-end style with ellipses at the end. I kept the ellipses.
Some questions were only to be asked if a previous answer was within a specific range of values. I worked under the assumption that the previous answer was in the range that qualified for the follow-up question..
Some questions started with a sentence or two of information or direction. I kept those sentences as part of the question. My reasoning was that these statements were part of the question, and therefore should be part of the syntax of the intention).

An example from the ‘September 29…’ document:

ASK ALL:

Q1. How interested are you in keeping up-to-date on the following topics? (First,/Next,) [INSERT ITEMS; RANDOMIZE]. [READ FOR FIRST ITEM, THEN AS NECESSARY: Would you say you are very interested in keeping up-to-date on that, somewhat interested, not too interested, or not at all interested in it?] {new}
a. Business and Finance
b. Government and politics
c. Sports
d. Events in your local community
e. Schools or education
f. Health or medical news
g. Science and technology
h. Arts or entertainment
i. Foreign affairs or foreign policy

This question had a set of follow-up questions:

How interested are you in keeping up-to-date on the following topics? First, Business and Finance?
Next, Government and politics?
Next, sports?
Next, Events in your local community? Would you say you are very interested in keeping up-to-date on that, somewhat interested, not too interested, or not at all interested in it?
Next, Schools and education?
Next, Health or medical news?
Next, Science and technology? Would you say you are very interested in keeping up-to-date on that, somewhat interested, not too interested, or not at all interested in it?
Next, Arts of entertainment?
Next, Foreign affairs or foreign policy?

Note that context and sequence are inherently important for these questions. We’ll look into this in a later post.

SQUAD: Reading Comprehension Questions

After hand-extracting the Pew Research Center questions, we now need a second set of questions that has a completely different intent. Thankfully, the Standford Question Answering Dataset (SQUAD), has an easy-to-parse JSON set of questions intended to measure reading comprehension. That means that we can download the JSON and extract the questions as-is; nice! In total we have a little over 87K questions from SQUAD.

Here’s an example set of questions from SQUAD:

{
“data”: [{
“title”: “University_of_Notre_Dame”,
“paragraphs”: [{
“context”: “Architecturally, the school has a Catholic character. Atop the Main Building’s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend \”Venite Ad Me Omnes\”. Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.”,
“qas”: [{
“answers”: [{
“answer_start”: 515,
“text”: “Saint Bernadette Soubirous”
}],
“question”: “To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?”, “id”: “5733be284776f41900661182”
}, {
“answers”: [{
“answer_start”: 188,
“text”: “a copper statue of Christ”
}],
“question”: “What is in front of the Notre Dame Main Building?”,
“id”: “5733be284776f4190066117f”
}, {
“answers”: [{
“answer_start”: 279,
“text”: “the Main Building”
}],
“question”: “The Basilica of the Sacred heart at Notre Dame is beside to which structure?”, “id”: “5733be284776f41900661180”
}, {
“answers”: [{
“answer_start”: 381,
“text”: “a Marian place of prayer and reflection”
}],
“question”: “What is the Grotto at Notre Dame?”,
“id”: “5733be284776f41900661181”
}, {
“answers”: [{
“answer_start”: 92,
“text”: “a golden statue of the Virgin Mary”
}],
“question”: “What sits on top of the Main Building at Notre Dame?”,
“id”: “5733be284776f4190066117e”
}]
}]
}]}

Visual Inspection

I always like to visualize my data, so the first thing I did was plot the histograms of the POS densities for each of the datasets. We expect there to be differences if there is any hope of creating a model that can differentiate the two using these densities. As we see in the figure below, on the ADV plot, there are some nice visual differences in the histograms.

So far, so good: it is still possible than we can model intent using syntax.

As a fun aside, look at the determiners, DET, plot below (a zoomed version of the DET subplot from above):

Pennebaker referred to these as “function words,” and you can see a clear difference in their distributions over the datasets. Always good to see evidence of the theory you used as inspiration for your own work.

Hypothesis and Statistical Significance

Now that we have two datasets that we believe have questions which differ in intent, let us create our hypothesis and test for the hypothesis.

Null Hypothesis

The average density of adverbs (per token) for Pew Research Center’s survey questions is the same asthe average density of adverbs for SQUAD’s reading comprehension questions.

Alternate Hypothesis

The averages between the two sources differ.

With our hypotheses ready, we’ll run a standard two-sample Welch’s t-test since we do not think that the variances of the underlying distributions are equal.

Welch’s t-test result

statistic=-9.85252000171142
pvalue=3.747944550892387e-20
null hypothesis REJECTED

We could run a similar test for all the other POS but that is a bad idea. We just want to sanity check that something syntactical is different between the datasets.

“Impossible” Questions

It turns out that SQUAD v2.0 added a set of questions that are impossible to answer. That is, the questions ask about something that is not contained in the attached reading material for the question. We can ask whether these impossible questions have different POS statistics than both the Pew research questions and the possible SQUAD questions. Instinctively, I would say that POS do not capture vocabulary choice and so the impossible questions will look exactly the same as the possible questions when viewed using POS tags (e.g. “Is Jean cold?” versus “Is Anita Cold?” share the same syntax but one may be impossible to answer).

Let’s start with plotting the POS densities per token between the possible and impossible questions.

We can also run the same test but this time asking if the mean adverb density is different between the impossible and possible questions in the SQUAD dataset:

Welch’s t-test result

statistic=-16.096645828445343
pvalue=3.3082419596345946e-58
null hypothesis REJECTED

So it turns out that there is a difference between the mean density of adverbs for impossible questions and the mean density for possible questions. Who knew?! Now, the intention behind impossible questions is the same as for possible questions, so this test result actually adds some evidence towards not using POS as features to judge intentions of a piece to text. At the same time, it could be argued that the intention of impossible questions is to trick a reader, versus possible questions are meant to rate the understanding. It is all evidence (for, against, or unknown) and we are very early in our process, so we should keep the results in mind and continue to ask how intention can be modeled.

The Power of a Test

Showing that the means differ between the datasets is a pretty low bar for a feature. So what else can we learn from the hypothesis tests? We can look at the power of the tests, and specifically we can look at how much data is required to reject the null hypothesis with a set threshold.

With this information in hand, we can determine a number to keep in mind when we talk about the size of data required to train a model based on the properties tested in the hypothesis test. After all, the training data must exhibit the property in question for any model to hope to learn how to leverage the property. We can ask the following question: On average, how much data do we need for the hypothesis test to reject the null hypothesis with a p-value of 0.01?

We will estimate the number of data points required to reject the null hypothesis (as above) with a p-value less than 0.01 by randomly sampling subsets of the full dataset. We do a number of iterations of this random sampling technique, where for each iteration we check how much data was needed to reject with p-value < 0.01. This will allow us to compute an estimate of the mean size of the data required to reject the null hypothesis (as long as the full dataset reflects truth) as well as a confidence bound over this mean estimate.

We run the procedure with two versions of random sampling. In the first version, we simply randomly shuffle the dataset and forward simulate as if the data were coming in the order of the newly-shuffled data. This will result in the expected amount of data given that the ratio of labeled data is the same as our full datasets.

We can also sample each labeled dataset separately and assume a balanced distribution over the labels; this is the second random sampling we will investigate. The functions data_size_estimate and balanced_data_size_estimate implement the above procedures. After 100,000 iterations, we get the following:

Inherent Label Ratio Data Size Estimates

PEW vs SQUAD:

Mean Total Data: 5417.389 (with sigma: 4024.697)

Possible vs Impossible:

Mean Total Data: 1775.358 (with sigma: 1657.245)

Balanced Label Ratio Data Size Estimates

PEW vs SQUAD:

Mean Total Data: 200.792 (with sigma: 12.593)

Possible vs Impossible:

Mean Total Data: 1735.086 (with sigma: 1505.538)

While we rejected the hypothesis for both PEW versus. SQUAD and Impossible versus Possible, there is a rather large difference in the data requirements to see the effect.

Cooldown

We started with the idea of modeling intent using syntax. The first step was to create a very simple but well-defined hypothesis test for whether it was even possible to make such a model. We took the simplest syntax we could, POS, and the most simple property which is just the density of different POS. It would be unfortunate if these densities did not differ between different intents (especially since Pennebaker suggested that they do), so we built up a hypothesis test for whether the average density of adverbs differed or not.

I always believe we should crawl before we walk, and so we took two datasets for which our intuition strongly suggested that syntax and intent were coupled: survey questions versus reading comprehension questions. We chose these two datasets because the intention behind their creation is clear and differs between the datasets. Our intuition, guided by Pennebaker’s research, tells us that when we have clear and different intentions we expect clear and different syntax. After some data wrangling, we ran the hypothesis test and were able to reject the null hypothesis: the average adverb density was different between survey questions and reading comprehension questions. Whew! Even more interestingly, the average adverb density differed between impossible reading comprehension questions and those which were possible to answer.

The next step is to actually create a model using syntax; but for now we have some very interesting plots to look at.

The narrative above gives the impression that the trajectory from the original question (“is it possible to model intent using syntax?”) to the designed experiment was a smoothly flowing river of reason. It was not. In fact, the time between deciding to model intention using syntax and having the experiment designed was on the order of 1–2 weeks. I also want to give a shout-out to Jake Neely and Thomas Markovich, who acted as my sounding boards when I was considering possible datasets and models. There was also the struggle of *choosing* the dataset, between the political high-interest/high-troll-ability sources like Politifact, and the eventual “Questions” dataset.

All of the code for this post, including the datasets created and used, can be found in our github repo at evoking-syntax-01.

Note: This post was originally published on our blog: https://www.forge.ai/blog/evoking-syntax-part-1-part-of-speech