Implementing a simple text preprocessing pipeline with spaCy

Tang U-Liang
Voice Tech Podcast
Published in
6 min readAug 10, 2020

spaCy, the Python based natural language processing library offers NLP practitioners an encapsulated and elegant way of writing text preprocessing pipelines.

I thought I would illustrate this by writing a FilterTextPreprocessingclass which will be added to the default spaCy document pipeline.

Suppose we wanted to perform a simple stop word removal from a document. This is how we could do it.

import spacy nlp = spacy.load('some_english_model')
doc = nlp('I saw a chicken crossing the road')
bow = [tkn.text for tkn in doc if not tkn.is_stop]

We can do this because spaCy exposes the is_stop attribute on it’s Token types. This is a step up from collection based stop word removal where we filter stop words based on some set of words. In spaCy, the stop word collection can be customized separately thus enforcing a kind of separation of concerns in your text processing script.

The above could even be wrapped in a function

def filter_stop_words(doc: Doc) -> List[str]:
return [tkn.text for tkn in doc if not tkn.is_stop]

The problem with this being that this function forces that any subsequent processing functions down the line know that the return value of filter_stop_words is List[str] and not the default Doc container, with all the conveniences it brings. Returning a sequence of str also makes it difficult for further processing because str does not have convenience attributes like like_num , is_punct , like_email and the like.

Matters only improve slightly by returning List[Token] . The main problem still remains: Preprocessing functions, if they are to be chained, are dependant on the arguments and return values of what comes before and after in the pipeline. This makes it difficult to write reusable pipeline functions.

Fortunately for us, spaCy encourages a better style of coding using the awesome Matcher class. This means, we can write matching patterns for tokens we want to discard (or keep) and rely on the spaCy framework to handle the filtering for us.

Here’s how to do it:

  1. Define a custom attribute on the Doc container to hold our final list of filtered tokens. This ensures we still have access to the original Doc object anywhere in the subsequent chain of processing steps.
  2. Define a custom attribute on the Token to mark it as “keep” of “discard”. The idea is that after all filtering is done, we simply do [tkn.text for tkn in doc if tkn._.keep] to obtain the final Bag-Of-Words we want to pass to our downstream models.
  3. Define a set of patterns and pass them to the Matcher . The Matcher is spaCy’s preferred way to detecting tokens of interest to us. Don’t iterate over the doc to find tokens of interest. This is slow and makes for hard to understand code.
  4. Define a callback to handle matching events. This function fires every time the matcher detects a token satisfying the pattern you passed to it earlier. We would implement the marking logic in this callback.

Now let’s see how to implement each of these concerns in a single class.

Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com

There’s nothing magical about spaCy processing pipelines. They are just plain vanilla classes inheriting directly from object . But they do need to be a callable taking a Doc and returning a Doc .

class FilterTextPreprocessing:
def __init__(self, nlp, *options) :
pass
def __call__(self, doc) :
return doc

Now to initialize the Doc and Token (global) objects to hold custom attributes. These containers expose the _ attribute for us to attach our own custom attributes to the container.

from spacy.tokens import Doc, Token
class FilterTextPreprocessing:
def __init__(self, nlp, *options) :
Doc.set_extension('bow', default=[])
Token.set_extension('keep', default=True)
def __call__(self, doc) :
return doc

Now we import the matcher and initialize it with our Language vocabulary.

from spacy.tokens import Doc, Token
from spacy.matcher import Matcher
class FilterTextPreprocessing:
def __init__(self, nlp, *options) :
Doc.set_extension('bow', default=[])
Token.set_extension('keep', default=True)
self.matcher = Matcher(nlp.vocab)
def __call__(self, doc) :
return doc

I’m also going to replace the *options with a more specific patterns parameter which I am supposed to initialize my pipeline with. This allows me to pass in required patterns for particular use cases.

from typing import List, Tuple, Dictfrom spacy.tokens import Doc, Token
from spacy.matcher import Matcher
class FilterTextPreprocessing:
def __init__(self, nlp,
patterns: List[Tuple[str, List[Dict]]]) :
Doc.set_extension('bow', default=[])
Token.set_extension('keep', default=True)
self.matcher = Matcher(nlp.vocab)
for string_id, pattern in patterns:
self.matcher.add(string_id, None, pattern)
def __call__(self, doc) :
return doc

Notice the typing annotation for patterns . It is a list containing tuples of a string_id and a list of dictionaries. In spaCy, each dictionary represents a single token match and a list of such dictionaries determines a sequence of tokens.

So currently, we have passed None to the callback argument. We will revisit that later. Now, let’s see how we can enable the class to read patterns from a pattern file. To this end, I’ve changed the typing annotation for patterns and reimplemented the logic to add patterns to the matcher.

from typing import List, Dict, Unionfrom spacy.tokens import Doc, Token
from spacy.matcher import Matcher
from srsly import read_json
class FilterTextPreprocessing:
def __init__(self, nlp,
patterns: List[Dict[str, Union[str, List[Dict]]]]) :
Doc.set_extension('bow', default=[])
Token.set_extension('keep', default=True)
self.matcher = Matcher(nlp.vocab)
for patt_obj in patterns:
string_id = patt_obj.get('string_id')
pattern = patt_obj.get('pattern')
self.matcher.add(string, None, pattern)
def __call__(self, doc) :
return doc
@classmethod
def from_pattern_file(cls, nlp, path) :
patterns = read_json(path)
return cls(nlp, patterns)

Now we can initialize this class by passing it a path to a pattern file. More reusability ftw!

Such a json file would need to be structured like this:

[
{"string_id": "stop_word", "pattern": [{"IS_STOP": true}]},
// other patterns
]

Let’s implement the logic for a matching event. When we add patterns to a Matcher, we can pass a callback to fire on a matching event. This callback has signature (matcher, doc, i, matches) . matcher and doc are pretty self explanatory. i is the index of the current match, and matches is the list of all successfully matched “tokens”. I say “tokens” because it’s not really tokens which are contained in the list, but the hashed string_id of the match and the start and ending index of the Doc(to allow for span matches).

from typing import List, Dict, Unionfrom spacy.tokens import Doc, Token
from spacy.matcher import Matcher
from srsly import read_json
class FilterTextPreprocessing:
def __init__(self, nlp,
patterns: List[Dict[str, Union[str, List[Dict]]]]) :
Doc.set_extension('bow', default=[])
Token.set_extension('keep', default=True)
self.matcher = Matcher(nlp.vocab)
for patt_obj in patterns:
string_id = patt_obj.get('string_id')
pattern = patt_obj.get('pattern')
self.matcher.add(string, self.on_match, pattern)

def on_match(self, matcher, doc, i, matches):
_, start, end = matches[i]
for tkn in doc[start:end]:
tkn._.keep = False
def __call__(self, doc) :
return doc
@classmethod
def from_pattern_file(cls, nlp, path) :
patterns = read_json(path)
return cls(nlp, patterns)

So what this callback does is to mark a matching token as keep = False . Recall that we have set an extra attribute on the Token container. And that extra attribute is where we store that “keep”/ “discard” state. In essence, what this callback does is to alter the state of the matching token from “keep” to “discard”.

Then finally we can implement the calling logic. The final code is given below:

To use this pipeline in a spaCy project,

>>> nlp = spacy.load('some_english_model')
>>> make_bow = FilterTextPreprocessing.from_pattern_file(nlp, '/path/to/patterns.json')
>>> nlp.add_pipe(make_bow, last=True)
>>> doc = nlp("I crossed the road to buy 5 drinks.")

Here’s a sample pattern file that implements, stop word, punctuation and numerical filter. Further attribute that can be passed to the matcher are found here.

[
{"string_id": "stop_word", "pattern": [{"IS_STOP": true}],
{"string_id": "punctuation", "pattern": [{"IS_PUNCT": true}],
{"string_id": "numerical", "pattern": [{"LIKE_NUM": true}]
]

To which the result of the above is:

>>> print(*doc._.bow, sep=' ')
crossed road buy drinks

As you can see, spaCy’s pipeline based architecture promotes reusable components and hopefully, more structural and easier to maintain code. Happy hacking!

Something just for you

--

--