Text Extensions for Pandas: Tips and Techniques for Extending Pandas

Bryan Cutler
IBM Data Science in Practice
6 min readMay 12, 2021

Written in collaboration with Fred Reiss, Chief Architect at IBM CODAIT

Pandas is the de facto tool for data science and comes with a powerful extension framework to create custom data types. These data types can then be added to DataFrames and used with the standard Pandas library. This article will share what we learned when developing custom data types for Text Extensions for Pandas. We hope this will help those creating their own extensions or interested in the process.

Text Extensions for Pandas provides extensions to turn Pandas DataFrames into a universal data structure for Natural Language Processing (NLP) and to use with popular NLP libraries. For a blog post that covers basic usage, have a look at Introducing Text Extensions for Pandas.

Provided Extensions to Pandas

Text Extensions for Pandas has two main custom data types:

  • SpanDtype: character-based spans over a single target text. It gives position and covered text at points in a document.
  • TensorDtype: tensors (or numpy.ndarrays) of equal shape. These can be useful to store feature embeddings for tokenizer inputs.

The code below creates a DataFrame with these data types:

Although “span” and “embedding” are each one column, there is a lot of information behind them. The values are stored in the DataFrame as Pandas extension arrays. Let’s extract the SpanArray and TensorArray using the array attribute to do some work with them.

The SpanArray shows begin and end positions with the covered text for each span, for example [0, 5): ‘Monty’. It can do this because the SpanArray is has two numpy.ndarrays, one for begin and end positions, plus a reference to the original text. The above column is the tokenized text from the document and is lazily computed from the positions.

The TensorArray stores a batch of tensors of the same shape and has a single numpy.ndarray. This allows the column to contain high dimension values and still work with standard Pandas operations. In the above example, each token has an embedding vector length of 768, with a total shape of (9, 768) for the entire document.

Extending the Base Classes

So far we have shown extending Pandas requires defining a dtype and array. The dtype is derived from pandas.api.extensions.ExtensionDtype and the array is derived from pandas.api.extensions.ExtensionArray To learn more, see the API reference and the source files for span and tensor extensions.

ExtensionDtype

Deriving from ExtensionDtype allows you to register your extension data type with Pandas to identify the array during selection, construction and other operations with a Pandas DataFrame. The definition for SpanDtype is:

@pd.api.extensions.register_extension_dtype
class SpanDtype(pd.api.extensions.ExtensionDtype):
...

The decorator takes care of registering the dtype so Pandas is aware of it. The class should be filled in with attributes indicated by the interface definition from Pandas. A couple of important properties are type which returns the definition of a single element in the extension array. For SpanDtype this is:

@property
def type(self):
return Span

A Span defines a single range of characters in a SpanArray. This class will come into play in different ways of working with the extension array. For example, binary operators might have a SpanArray on one side and a Span on the other. Internally, Pandas could build a list of Span values, then call _from_sequence() to construct a new SpanArray from them.

The property na_value returns an instance of how your extension array represents an NA value. The default is numpy.nan, which is a float so if your extension type can not inter-mix with that value, you should override this. For SpanDtype, we return a special Span that represents NA.

ExtensionArray

Deriving from ExtensionArray defines a Pandas array that represents values of a Series or columns of a DataFrame. The definition for SpanArray is:

class SpanArray(pd.api.extensions.ExtensionArray, SpanOpMixin):
...

Note: this also inherits a SpanOpMixin that helps define operations for use with Span and SpanArray which we will cover later.

There are many methods described in the API reference , and we won’t cover them all here. Instead, let’s focus on a common operation of applying a boolean mask to the DataFrame and show how the SpanArray handles it.

That’s a simple operation, but there is actually a lot going on here. First, the command ~df.special_tokens_mask produces a new boolean series with special tokens set as False and all others True. Pandas will use this mask to compute a new index for the resulting DataFrame. Then it will call take() with that index to create a new array. The SpanArray.take() method has the signature:

def take(
self, indices: Sequence[int], allow_fill: bool = False,
fill_value: Any = None
) -> “SpanArray”:

The input indices is a 1D numpy.ndarray of type int64. Since the SpanArray is backed by two numpy.arrays, we can defer to ndarray.take() to compute new begin and end arrays with the given indices. Finally, with the new arrays and the same target text reference, a new SpanArray is returned for the resulting DataFrame.

This shows that using numpy.ndarrays with built-in NumPy or Pandas functions will lead to efficient implementations in your own extension array.

Slicing the Array

A common operation is to take a slice of a DataFrame or Series. Pandas can do this in many different ways, such as brackets with an index, or with iloc[]. These will unwrap the extension array and call __getitem__ with the given index or slice.

For a single index, Pandas expects the return value to be the defined scalar type, e.g. SpanArray returns Span.

Binary Operations
You can define arithmetic and comparison operations by adding the standard Python operator methods to your class, such as __lt__ and __add__. For SpanArray addition is the minimal span that covers both spans.

Let’s look at this addition operation more closely. The left side is a single Span of ”Monty”, and the right side is a SpanArray slice with tokens [“and”, “the”]. This means our operation must support scalar and array classes as inputs. We define SpanOpMixin that both classes inherit to handle additions with any combination. Here is the code (abbreviated):

Example Mixin for Operations on Extension Types

An important thing to note is the line if isinstance(other, (ABCDataFrame, ABCSeries, ABCIndexClass)). This checks if the right value is a Pandas wrapper object. In writing your extension array, you never need to deal with high-level Pandas objects, such as Series or DataFrame. Pandas will automatically take care of unwrapping these, and then wrapping the result back up accordingly. This makes writing an extension array much easier with less boiler-plate code, so you focus on the actual implementation. For most cases, returning NotImplemented will tell Pandas to take care of unwrapping and call again with the actual array.

Reduction Operations

More complex operations are also supported, such as reductions. These are implemented in the _reduce() method with a name argument indicating the type of reduction. SpanArray supports the sum reduction, so when done on a series, _reduce() will compute the smallest span that contains all spans in that series:

Testing

A big part of creating an extension type is that it can be used with the rest of the Pandas library. Pandas has a huge amount of functionality built in and there are often lots of different ways to do a certain task. So how do you know your custom data type will work in every aspect of Pandas? Fortunately, Pandas comes with a testing base that you can extend and bring into your own continuous integration testing.

Pandas uses PyTest, so you will first need to define some PyTest fixtures. There is not a documented list of what needs to be defined, so you might need to inspect the Pandas test base or other test files to figure out what you need. As an example, some for TensorArray tests:

There are quite a few testing base classes that can verify your custom data type on different aspects of Pandas. Occasionally, you might need to override certain tests to change or skip something that doesn’t make sense for your type. For more details, see the Pandas extension test base or test_span.py for SpanArray tests.

Conclusion

The excellent Pandas extension framework allows you to leverage the Pandas library to do high-level analysis on custom data types. We hope that this post, and the Text Extensions for Pandas project, will help others also interested in extending Pandas. To learn more about Text Extensions for Pandas, visit the project page at https://ibm.biz/text-extensions-for-pandas.

--

--

Bryan Cutler
IBM Data Science in Practice

Software Engineer at IBM - Center for Open-Source Data & AI Technologies (CODAIT)