Using Pandas DataFrames to analyze sentence structure

Fred Reiss
IBM Data Science in Practice
8 min readMay 20, 2021
an oak tree on a hillside at sunset or sunrise
Photo by Fred Reiss

In this article, we show how to use Pandas DataFrames to extract useful structure from the parse trees of English-language sentences.

This article was written in collaboration with Bryan Cutler.

If you use a natural language processing (NLP) library like SpaCy or Stanza for your work, you’ve probably noticed that it has a feature called “dependency parsing”. What exactly is dependency parsing, and what can you do with it? In this article, we’ll explain this powerful NLP technique and show you how you can use Pandas DataFrames to derive useful business results from the output of a dependency parser.

What is dependency parsing?

Dependency parsing is an NLP technique that identifies the relationships between the words that make up a sentence. These relationships contain useful facts about the words in the sentence: things like, “this noun is the subject for this verb,” or, “this adjective describes this noun.” With a bit of work, you can turn turn these facts about words into facts about the events, people, and things that these words describe.

We can treat the relationships between the words of a sentence as the edges of a graph. For example, here’s the graph that the SpaCy dependency parser produces for the sentence, “I like natural language processing”:

a dependency parse tree with the verb “like” as the root of the sentence, with the phrase “natural language processing” as the direct object and the pronoun“I” as the subject.
Dependency parse for the sentence, “I like natural language processing.”

This graph is always a tree, so we call it the dependency-based parse tree of the sentence. We often shorten the phrase “dependency-based parse tree” to dependency parse or parse tree.

Every word in the sentence (including the period at the end) becomes a node of the parse tree:

an image of the same parse tree of the sentence “I like natural language processing” showing a highlight of the words as nodes of a graph.

The most important verb in the sentence becomes the root of the tree. We call this root node the head node. In this example, the head node is the verb “like”.

Edges in the tree connect pairs of related words:

an image of the parse tree for the sentence “I like natural language processing” highlighting the relationships between nodes, i.e, the words, as edges.

Each edge is tagged with information about why the words are related. For example, the first two words in the sentence, “I” and “like”, have an nsubj relationship. This means that the pronoun “I” is the subject for the verb “like”.

Dependency parsing is useful because it lets you solve problems with very little code. The parser acts as a universal machine learning model, extracting many facts at once from the text. Pattern matching over the parse tree lets you filter this set of facts down to the ones that are relevant to your application.

An enterprise application of dependency parsing

In a previous article, we showed how to use Watson Natural Language Understanding to find places where a press release quotes an executive by name. In this article, we’ll use dependency parsing to associate those names with job titles.

A person’s job title is a valuable piece of context. The title can tell you whether the person is an important decision maker, or what the relationship is between different employees at a company. By looking at how titles change over time, you can reconstruct a person’s job history.

Here’s an example of how names and job titles can appear in press releases. This example is from an IBM press release from December 2020:

An image of a executive quote “By combining the power of AI with the flexibility and agility of hybrid cloud, our clients are driving innovation and digitizing their operations at a fast pace,” said Daniel Hernandez, general manager, Data and AI, IBM.” with the name of the executive “Daniel Hernandez” highlighted and the title of executive “general manager, Data and AI, IBM” also highlighted.

This sentence is 45 words long, so the entire parse tree is a bit daunting.

the entire parse tree for the quote above

If we zoom in on just the phrase, “Daniel Hernandez, general manager, Data and AI, IBM,” some structure becomes clear.

a zoomed-in view of the parse tree for the executive quote, showing the apps edge or relationship between the name “Daniel Hernandez” and the job title “general manager, Data and AI”

The arrows in this diagram point “downwards”, from the head (root) node to the leaves. There’s a single edge from the head node of Daniel Hernandez’s name (“Hernandez”) to the head node of his job title (“manager”). So all the nodes that make up the job title are below the head node of the name.

The edge types in this parse tree come from the Universal Dependencies framework. The edge between the name and job title has the type appos. appos is short for “appositional modifier”, or “appositive”. An appositive is a noun that describes another noun. In this case, the noun phrase “general manager, Data and AI, IBM” describes the noun phrase “Daniel Hernandez”.

The pattern in the picture above happens whenever a person’s job title is an appositive for that person’s name. The title will be below the name in the tree, and the head nodes of the name and title will be connected by an appos edge. We can use this pattern to find the job title via a three-step process:

  1. Look for an appos edge coming out of any of the parse tree nodes for the name.
  2. The node at the other end of this edge should be the head node of the job title.
  3. Find all the other nodes that are reachable from the head node of the job title.

Remember that each node represents a word. Once you know all the nodes that make up the job title, you know all the words in the title.

Step 3 here requires a transitive closure operation:

  • Start with a set of nodes consisting of just the head node
  • Look for nodes that are connected to nodes of the set. Add those nodes to the set.
  • Repeat the previous step until your set of nodes stops growing.

We can implement this algorithm with Pandas DataFrames.

Transitive closure with Pandas

We’re going to use Pandas to match person names with job titles. The first thing we’ll need is the locations of the person names. In our previous article, we created a function find_persons_quoted_by_name() that finds all the people that a news article quotes by name. If you're curious, you can find the source code here. The function produces a DataFrame with the location of each person name. Here's the output when you run the function over an example press release:

The second thing we will need is a parse tree. We’ll use the dependency parser from the SpaCy NLP library. Our open source library Text Extensions for Pandas can convert the output of this parser into a DataFrame, as can be seen below.

This tokens DataFrame contains one row for every token in the document. The term “token” here refers to a part of the document that is a word, an abbreviation, or a piece of punctuation. The columns “id”, “dep”, and “head” encode the edges of the parse tree.

Since we’re going to be analyzing the parse tree, it’s more convenient to have the nodes and edges in separate DataFrames. So let’s split tokens into DataFrames of nodes and edges. Here’s the code that performs this split.

We will start with the nodes that are parts of person names. To find these nodes, we need to match the person names in person with tokens in nodes.

The “person” column of persons and the “span” column in nodes both hold span data. Spans are a common concept in natural language processing. A span represents a region of the document, usually as begin and end offsets and a reference to the document's text. The span data in these two DataFrames is stored using the SpanDtype extension type from Text Extensions for Pandas. If you’d like to learn more about how we added these new data types to Pandas, take a look at our article about extending Pandas.

Text Extensions for Pandas also includes functions for manipulating span data. In the code that follows, we use one of these functions, overlap_join(), to find all the places where the span of a token in the nodes DataFrame overlaps with the span of a name in the persons DataFrame. The code creates a new DataFrame, person_nodes, containing all the matching pairs of spans.

The parse tree nodes in the “span” column of person_nodes are our starting points for navigating the parse tree. Now we need to look for nodes that are on the other side of an appos link from these starting nodes. Since the nodes and edges of our graph are Pandas DataFrames, we can use the Pandas merge() method to match edges with nodes and walk the graph. The listing that follows defines a function, traverse_edges_once(), that finds all the nodes that are one edge away from the nodes in its argument, start_nodes.

Now we can use this function to follow all appos edges downward from the parse tree nodes in person_nodes. The next listing puts the nodes at the other side of those edges into a DataFrame called appos_targets.

The “id” column in each row of appos_targets identifies the head node of a person’s title. To find the remaining nodes of the titles, we’ll do the transitive closure operation we described earlier. We use a Pandas DataFrame to store our set of selected nodes. We use the traverse_edges_once() function to perform each step of walking the tree. Then we use Pandas.concat() and DataFrame.drop_duplicates() to add the new nodes to our selected set of nodes. The entire algorithm looks like this:

Now we know the spans of all the words that make up each job title. The “addition” operation for spans is defined as:

span1 + span2 = smallest span that contains both span1 and span2

We can recover the span of the entire title by “adding” spans using Pandas’ groupby() method:

Now we have found a job title for each of the executive names in this document!

Tying it all together

Let’s put all of the code we’ve presented so far into a single function. We’ll call this function find_titles_of_persons(). Here’s the source code.

If we combine this find_titles_of_persons() function with the find_persons_quoted_by_name() function we created in our previous post, we can build a data mining pipeline. This pipeline finds the names and titles of executives in corporate press releases. Here’s the output that we get if we pass a year’s worth of IBM press releases through this pipeline:

Our pipeline has processed 191 press releases, and it found the names and titles of 259 executives!

We hope you’ve enjoyed this brief introduction to dependency parsing and this example of using Pandas to turn parse trees into business insights. If you want to try these techniques for yourself, check out Text Extensions for Pandas here.

--

--

Fred Reiss
IBM Data Science in Practice

Fred Reiss is a Principal Research Staff Member at IBM Research and Chief Architect at IBM’s Center for Open-Source Data and AI Technologies (CODAIT).