Introducing Text Extensions for Pandas

Fred Reiss
IBM Data Science in Practice
9 min readApr 8, 2021

--

This blog was written in collaboration with Willie Tejada, GM & Chief Developer Advocate for IBM.

Introduction

IBM Research and the IBM Center for Open-Source Data and AI Technologies(CODAIT) team recently created the open source Text Extensions for Pandas project to make the “easy” parts of natural language processing actually be easy.

Text Extensions for Pandas is a library of extensions that turns the popular Pandas DataFrames into a universal data structure for natural language processing. Since most data scientists already use Pandas on a daily basis for problems outside the natural language processing domain, we believe this extension will enhance their productivity for problems involving unstructured text.

In this article, we will look at how Text Extensions for Pandas makes natural language processing easier.

What’s so complicated about natural language processing?

In natural language processing, the hard things are easy, and the easy things are hard.

If you look online, you will find free versions of super-advanced natural language processing models, packaged so that you can call them using very little code. But, the moment you step outside the bounds of what you can get off the shelf, the difficulty level goes way up.

Take, for example, named entity recognition, or NER for short. NER is the task of finding places where a document refers to an entity, such as a person or a company, by that entity’s name.

The transformers library from Huggingface includes a state-of-the-art NER pipeline based on BERT embeddings. So, if you want to find all the names of people in your document, you can do that with very little code.

Here’s an example of invoking that pipeline:

[{‘entity_group’: ‘PER’,
‘score’: 0.9987641175587972,
‘word’: ‘Terry Gilliam’,
‘start’: 0,
‘end’: 13},
{‘entity_group’: ‘PER’,
‘score’: 0.9878875613212585,
‘word’: ‘John Cleese’,
‘start’: 31,
‘end’: 42}]

In the example above, there is a four-line program that takes in English-language text and outputs all the names of people (PER is the identifier for "Person Name") in the text . The fact that you can bring so much power to bear with so little code is a truly remarkable technical achievement.

However, in many enterprise applications, NER is just one step in a complex pipeline. The end goal of this pipeline is to find useful facts about people. The fact that a particular word is a person’s name is interesting, but you need to add additional context to turn that information into a useful fact.

Does the document express positive or negative sentiment about this person? Does the document mention that the person performed an interesting action or visited a particular place?

The context creates value.

Adding context to a search is difficult

Let’s see what happens when we add a tiny bit of context to our example. Instead of just finding person names, let’s find all the words that occur immediately to the left of a person’s name.

In principle, this new task is almost the same as the one we just did. But it turns out to be much more difficult.

The code above returns the character offset of each mention of a person. You can use an NLP tool called a tokenizer to find the character offsets of all words in the document. Then it’s a matter of cross-referencing those two sets of character offsets.

Here’s an algorithm for doing that cross-referencing:

Iterate over the tokens and build a lookup table that maps from begin offset to token index. Then iterate over the person mentions. For each person mention, use the lookup table to find the index of the first token in the mention. Then retrieve the information about the token that comes immediately before that first token and construct a result.

And here’s some code that implements this algorithm, using the SpaCy NLP library’s tokenization facilities:

And, yes, this code does work:

[{'person': {'entity_group': 'PER',
'score': 0.9987641175587972,
'word': 'Terry Gilliam',
'start': 0,
'end': 13},
'word_before': None},
{'person': {'entity_group': 'PER',
'score': 0.9878875613212585,
'word': 'John Cleese',
'start': 31,
'end': 42},
'word_before': {'word': 'fabulous', 'start': 22, 'end': 30}}]

Now, that’s quite a bit of complexity just to add a tiny bit of context to the earlier result.

What if, instead of finding the word before each person name, you wanted to find the word that comes after each person name? You would need to design a different algorithm.

As you can see, in NLP the easy stuff is hard.

Why is this the case? The answer is tied to the way that the NLP tool represents its results. The libraries we’ve shown here give you a thin veneer on top of the outputs of a model and expect you to do everything else. To perform even the most basic of follow-on tasks, you have to invent an algorithm. And inventing an algorithm is hard.

We can do better.

Text Extensions for Pandas offers transparency, simplicity, and compatibility

With the Text Extensions for Pandas project, our goal was to make NLP easier. Our approach is simple: don’t invent algorithms that navigate data structures based on the raw output of complex natural language processing models. Instead, use Pandas DataFrames to represent NLP data. The algorithms you need are already there.

To make this possible, we created Text Extensions for Pandas, a library of extensions that turns Pandas DataFrames into a universal data structure for NLP.

Text Extensions for Pandas includes Pandas extension types for representing natural language data, plus library integrations that turn the outputs of popular NLP libraries into easy-to-understand DataFrames. Compared with the custom data structures that most NLP libraries expose, DataFrames with Text Extensions for Pandas give three important benefits:

  • Transparency: You can look at a DataFrame and understand at a glance what information is present.
  • Simplicity: Pandas includes a huge collection of high-level routines to perform many common tasks with very little code.
  • Compatibility: If two models produce Pandas DataFrames, the outputs of those models are automatically compatible with each other.

Transparency in Text Extensions for Pandas

Let’s revisit our example from before, but this time using Pandas. We’ll leverage the integration between Text Extensions for Pandas and Watson Natural Language Understanding.

Anyhow, here’s a bit of code that takes the text from before, runs it through Watson Natural Language Understanding, and translates the results into Pandas DataFrames:

One of these DataFrames contains the results of Watson Natural Language Understanding’s named entity recognition model:

A DataFrame with four columns: type, text, span, and confidence. In the first row, “type” is “person” and “span” is [0, 13): ‘Terry Gilliam’

This output is a great example of the transparency of Pandas DataFrames. Take a look at the table above. It conveys a great deal of information at a glance. There are rows and columns. The rows represent entity mentions. The columns represent properties of entity mentions. Everything is right there in front of you.

Compare that with the corresponding model output we looked at earlier:

[{'entity_group': 'PER',
'score': 0.9987641175587972,
'word': 'Terry Gilliam',
'start': 0,
'end': 13},
{'entity_group': 'PER',
'score': 0.9878875613212585,
'word': 'John Cleese',
'start': 31,
'end': 42},
{'entity_group': 'LOC',
'score': 0.9998065829277039,
'word': 'England',
'start': 51,
'end': 58}]

When you see this output, the first thing that comes to mind is probably, “Yes, that’s JSON data”. The second thing you’ll notice is that the first row introduces two levels of nesting. To understand what’s going on here, you need to visualize a tree in your head.

Our second DataFrame, words_df, contains the outputs of Watson Natural Language Understanding's tokenizer:

A DataFrame with two columns, “span” and “sentence”. In the first row, the “span” value is “[0, 5): ‘Terry’” and the “sentence” is “[0, 59): ‘Terry Gilliam …’”

Again, the DataFrame representation gives transparency. Each row contains information about a token. For each token, there is information about the location of the token and its containing sentence. Compare this with what happens when you print the output of SpaCy’s tokenizer:

Terry Gilliam and the fabulous John Cleese live in England.

That’s strange. When we try to print out the tokens, we just get back the document text.

It turns out that the variable tokens here is an instance of the Python class Document, which doesn't have any human-readable serialization. The Document class provides access to another class, Token, which holds token information. To understand and use these classes, you need to read through the API documentation at https://spacy.io/api/doc and https://spacy.io/api/token.

In order to write the algorithm you saw earlier, we needed to pull several non-obvious facts out of this documentation:

  • To iterate over the Tokens in a Document, you use the Document object as a Python generator.
  • The beginning offset of a Token is stored in a field named idx.
  • The end offset of the token is not stored in the Token object, but can be computed from the value of idx and the token's length, which you get to by calling the Token class's __len__() method
  • The location of the token (in tokens, not characters) is stored in a field called i.

Now, there’s nothing wrong with these design choices. Indeed, SpaCy is a well-designed library, with clear documentation. But the developers of the Token class needed to make a number of arbitrary choices. And users need to know these choices before can use the code.

The transparency of the DataFrame representation makes it easier for users to be productive.

Simplicity in Text Extensions for Pandas

Let’s talk about the second benefit of Pandas: simplicity. When data is in a DataFrame, you have access to Pandas’ huge library of built-in high-level operations, including facilities for filtering data.

For example, it’s easy to turn our DataFrame of all entity mentions into a DataFrame of just person mentions:

A DataFrame with four columns: type, text, span, and confidence. In the first row, type is “Person”, text is “Terry Gilliam”, span is “[0, 13): ‘Terry Gilliam’”, and confidence is “0.995”.

Now what about our task of adding context to these person names by finding the words that occur to the left of them? We just need to use one of the span manipulation functions from Text Extensions for Pandas to match up pairs of adjacent spans in words_df and persons_df:

A DataFrame with two columns: word and person. In the first row, word is “[22, 30): ‘fabulous’” and person is “[31, 42): ‘John Cleese’”

No need to invent a new algorithm! Once we’ve translated the models’ outputs to Pandas DataFrames, we have a selection of high-level operations readily available.

Remember how the algorithm we created earlier to find words to the left of person names wouldn’t work for finding words to the right of person names? Not so here! Just swap the arguments to adjacent_join():

A DataFrame with two columns: person and word. In the first row, person is “[0, 13): ‘Terry Gilliam’” and word is “[14, 17): ‘and’”.

Compatibility via Text Extensions for Pandas

The DataFrames persons_df and words_df that we have been looking at contain the results of two different models. Normally, combining the results of two models would require writing code to translate between the different data structures that the models represent their outputs. Indeed, a good deal of the complexity of our example algorithm earlier came from model incompatibilities. Huggingface's model used JSON-format output; while SpaCy's tokenizer used Python classes. The Huggingface model called the begin offsets "start"; while SpaCy's tokenizer called the same thing "idx". And so on.

In contrast, when the two model outputs are DataFrames, they have instant compatibility. They are both the same kind of data, even though they came from different models. So we can skip all of that translation code and go directly to solving the problem. With one line of code.

IBM’s focus on improving natural language processing development

This project aligns with IBM’s goal to continually develop and deliver new natural language processing innovations, both in the open source community and through products like Watson Discovery and Watson Natural Language Understanding. In fact, the Text Extensions for Pandas project integrates with IBM Watson Natural Language Understanding and IBM Watson Discovery to help make it even easier for developers to uncover new insights from natural language text. In the 2021 Magic Quadrant for Cloud AI Developer Services, Gartner recognized IBM’s leadership in creating developer-friendly NLP solutions, and Text Extensions for Pandas will continue to help developers succeed with NLP.

Conclusion

We hope that this example has shown you why the easy things in NLP are so hard — and how Pandas can fix that problem. If you’d like to find out more about our Text Extensions for Pandas library, take a look at our web site at http://ibm.biz/text-extensions-for-pandas.

Originally published at https://developer.ibm.com.

--

--

Fred Reiss
IBM Data Science in Practice

Fred Reiss is a Principal Research Staff Member at IBM Research and Chief Architect at IBM’s Center for Open-Source Data and AI Technologies (CODAIT).