Person sitting in front of a laptop and using it
Source: Getty Images

Market Intelligence with Pandas and IBM Watson

Fred Reiss
IBM Data Science in Practice

--

In this article, we’ll show how to perform an example market intelligence task using Watson Natural Language Understanding and our open source library Text Extensions for Pandas.

This article was written in collaboration with Bryan Cutler.

Market intelligence is an important application of natural language processing. In this context, “market intelligence” means “finding useful facts about customers and competitors in news articles”. This article focuses on a market intelligence task: extracting the names of executives from corporate press releases.

Information about a company’s leadership has many uses. These uses include finding points of contact for sales or partnerships; estimating how much attention a company pays to different strategic areas; and recruiting executive talent.

Press releases are a good place to find the names of executives, because these articles often feature quotes from company leaders. Here’s an example quote from an IBM press release from December 2020:

Snippet of a press release: “By combining the power of AI with the flexibility and agility of hybrid cloud, our clients are driving innovation and digitizing their operations at a fast pace,” said Daniel Hernandez, general manager, Data and AI, IBM.

This quote contains information about the name of an executive:

The quote from the previous picture, highlighting the name “Daniel Hernandez” as the name of executive

This snippet is an example of the general pattern that we will look for:

  • The article contains a quotation.
  • The person to whom the quotation is attributed is mentioned by name.

The key challenge that we need to address is the many different forms that this pattern can take. Here are some examples of variations that we would like to capture:

Variations on the quote from the previous picture: (1) Present-tense “says” instead of “said”; (2) Name occurs before the quote; and (3) Name occurs in the middle of the quote

We’ll deal with this variability by using general-purpose semantic models. These models extract high-level facts from formal text. The text could express a given fact in many different ways, but all of those different forms produce the same output.

Semantic models can save a lot of work. There’s no need to label separate training data or write separate rules or for all of the variations of our target pattern. A small amount of code can capture all these variations at once.

Let’s get started!

Use IBM Watson to identify people quoted by name

IBM Watson Natural Language Understanding includes a model called semantic_roles that performs Semantic Role Labeling, or SRL for short. You can think of SRL as finding subject-verb-object triples:

  • The actions that occurred in the text (the verb),
  • Who performed each action (the subject), and
  • On whom or what the action was performed (the object).

If take our example executive quote and feed it through the semantic_roles model, we get the following raw output:

That format is a bit hard to read. Let’s make it clearer. In our previous article, we showed how our open-source library, Text Extensions for Pandas, can convert the output of another Watson model into a Pandas DataFrame. We can apply the same conversion to the semantic_roles model:

Now we can see that the semantic_roles model has identified four subject-verb-object triples. Each row of this DataFrame contains one triple. In the first row, the verb is "to be", and in the last row, the verb is "to say".

The last row is where things get interesting for us, because the verb “to say” indicates that someone made a statement. And that’s exactly the high-level pattern we’re looking for. Let’s filter the DataFrame down to that row and look at it more closely.

The subject in this subject-verb-object triple is “Daniel Hernandez, general manager, Data and AI, IBM”, and the object is the quote from Mr. Hernandez.

This model’s output has captured the general action of “[person] says [quotation]”. Different variations of that general pattern will produce the same output. If we move the attribution to the middle of the quote, we get the same result:

If we change the past-tense verb “said” to the present-tense “says”, we get the same result again:

All the different variations that we talked about earlier will produce the same result. This model lets us capture them all with very little code. All we need to do is to run the model and filter the outputs down to the verb we’re looking for.

So far we’ve been looking at one paragraph. Let’s rerun the same process on the entire press release.

As before, we can run the document through Watson Natural Language Understanding’s Python interface and tell Watson to run its semantic_roles model. Then we use Text Extensions for Pandas to convert the model results to a DataFrame:

If we filter down to the subject-verb-object triples for the verb “to say”, we can see that this document has quite a few examples of the “person says statement” pattern:

The DataFrame quotes_df contains all the instances of the “person says statement” pattern that the model has found. We want to filter this set down to cases where the subject (the person making the statement) is mentioned by name. We also want to extract that name.

Identifying person names

In this press release, all three instances of the “person says statement” pattern happen to have a name in the subject. But there will not always be a name there. Consider this example sentence from another IBM press release:

27 percent of Gen Z surveyed said they will increase outside
interaction, compared to 19 percent of Gen X surveyed and only 16 percent of
those surveyed over 55.

Here, the subject for the verb “said” is “27 percent of Gen Z surveyed”. That subject that does not include a person name.

How can we find the matches where the subject contains a person’s name? Fortunately for us, Watson Natural Language Understanding has a model for exactly that task. The entities model in this particular Watson service finds named entity mentions. A named entity mention is a word in the document that is referring to an entity like a person, place, or company by the entity’s name.

Watson’s entities model is very effective at finding the places where a document mentions a person by name. The code below tells the Watson service to run the entities model and retrieve mentions. Then we convert the result to a DataFrame using Text Extensions for Pandas:

The entities model's output contains mentions of many types of entity. For this application, we need mentions of the names of people. Let's filter our DataFrame down to just those types of mentions:

Tying it all together

Now we have two pieces of information that we need to combine:

  • Instances of the “person said statement” pattern from the semantic_roles model
  • Mentions of the names of individual persons from the entities model

We need to align the “subject” part of the SRL output with the person mentions. We can use the span manipulation facilities of Text Extensions for Pandas to do this.

Spans are a common concept in natural language processing. A span represents a region of the document, usually as begin and end offsets and a reference to the document’s text. Text Extensions for Pandas adds a special SpanDtype data type to Pandas DataFrames. With this data type, you can define a DataFrame with one or more columns of span data. For example, the column called “span” in the DataFrame above is of the SpanDtype data type. The first span in this column, [1288, 1304): 'Daniel Hernandez', shows that the name "Daniel Hernandez" occurs between locations 1288 and 1304 in the document.

All the spans in this particular DataFrame are from the same document. Later on, we’ll stack several of these DataFrames to produce a series of spans across multiple documents. You can access information about the underlying document text via the target_text property of each span.

The output of the semantic_roles model doesn't contain location information. But that’s ok, because it's easy to create your own spans. We just need to use some string matching to recover the missing locations:

Now we have a column of span data for the semantic_roles model's output, and we can align these spans with the spans of person mentions. Text Extensions for Pandas includes built-in span operations. One of these operations, contain_join(), takes two columns of span data and identifies all pairs of spans where the first span contains the second span. We can use this operation to find all the places where the span from the semantic_roles model contains a span from the output of the entities model:

To recap: With a few lines of Python code, we’ve identified places in the article where the article quoted a person by name. For each of those quotations, we’ve identified the name of the person and its location in the document (the person column in the DataFrame above).

Here’s all the code we’ve just created, condensed down to a single Python function:

This function, find_persons_quoted_by_name(), turns a press release into a list of executive names. Here's the output that we get if we pass a year's worth articles from the "Announcements" section of ibm.com through it:

Now we’ve turned 191 press releases into a DataFrame with 301 executive names. That’s a lot of power packed into one screen’s worth of code! To find out more about the advanced semantic models that let us do so much with so little code, check out Watson Natural Language Understanding here!

--

--

Fred Reiss
IBM Data Science in Practice

Fred Reiss is a Principal Research Staff Member at IBM Research and Chief Architect at IBM’s Center for Open-Source Data and AI Technologies (CODAIT).