🏎️ Fast bag-of-words using spaCy and cython

Mehdi Hamoumi
May 14, 2019 · 8 min read

When processing large amounts of text (more than 1 million books) like we do at Glose, it is crucial to optimize every step of the pipeline.

In this post, we will be focusing on a very common Natural Language Processing (NLP) task that involves counting the number of occurrences of every word in a text: building a bag-of-words (BoW).

🎒 BoW applications and a simple example

NLP pipelines usually start by converting a text to an array (or several arrays) of numbers (vectors). This vectorial representation is crucial because it is much easier to manipulate vectors than raw strings in machine learning algorithms (such as document classification, sentiment analysis, Part-of-Speech tagging, Named Entities Recognition…), or in recommendation systems that compute similarities between items based on their vectors.

Building the BoW representation is often the first step in obtaining the vectorial representation. For instance, in topic modeling, where every number in the vector corresponds to the contribution of a topic to the text, we start by the BoW, then compute the words’ frequencies (Tf-Idf), and finally compute the topics (using methods like LSA or LDA).

Let us start with a simple example, using this post’s introduction:

When processing large amounts of text, we optimize every step. In this post, we will be focusing on counting the number of occurrences of every word in a text.

Before building the BoW representation, we convert the text to lowercase, remove all the stopwords (meaningless words) and punctuation, and replace all words by their lemmas, which gives:

process large amount text optimize step post focus count number occurrence word text

Then we count the occurrences of each word, which gives:

{
'process': 1,
'large': 1,
'amount': 1,
'text': 2,
'optimize': 1,
'step': 1,
'post': 1,
'focus': 1,
'count': 1,
'number': 1,
'occurrence': 1,
'word': 1,
}

If we associate an integer identifier to every word (‘process’ ↔ 0, ‘large’ ↔ 1 …) — a.k.a build a dictionary — we can write the BoW representation in a more compact manner [(id, count), …]:

[(0,1), (1,1), (2,1), (3,2), … , (11, 1)]

We can then use this BoW representation in further processing steps, as mentioned above.

🐍 A naive implementation

Let us first write a simple python program that transforms a preprocessed text into a compact BoW representation:

We use python’s built-in collections.defaultdict to count the number of occurrences of words, and build the dictionary by iterating on all the words, and adding the missing ones with their integer identifier.

We can now try our BoW implementation on the previous preprocessed example:

sample_text = 'process large amount text optimize step post focus count number occurrence word text'

We get:

BOW representation: [(0, 1), (1, 1), (2, 1), (3, 2), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1)]

which is exactly what we got in the previous section, by manually counting the occurrences.

Finally let us time the text2bow function on a preprocessed text of 26893 words, corresponding to a single book:

4.36 ms ± 29.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

While this seems like a reasonable processing time for a single book, it is important to remember that it is only the first step of our NLP pipeline, and that it will be applied on ~1M books, which takes us to ~ 1 hour 12 minutes of processing time just to create BoW representations of all our books.

We will see in the following sections how to use Cython and spaCy’s Cython API to speed up this code.

⚙️ About cython and spaCy

Working with pure Python code is great for fast iteration and experimentation. However there are some use cases where one can benefit from a statically typed and compiled language. Mainly:

  1. When developing a production module, that needs to work at full speed.
  2. When a bottleneck is identified in Python code (profiling is key!). This is often related to portions of Python code with plain or nested for-loops.
  3. When native parallelism is needed (that means releasing Python’s Global Interpreter Lock).

This brings us to Cython, which is defined as follows in its homepage:

Cython is an optimising static compiler for both the Python programming language and the extended Cython programming language (based on Pyrex). It makes writing C extensions for Python as easy as Python itself. […]

The Cython language is a superset of the Python language that additionally supports calling C functions and declaring C types on variables and class attributes. This allows the compiler to generate very efficient C code from Cython code. The C code is generated once and then compiles with all major C/C++ compilers […].

Cython thus solves (1) and (3) by generating pure C/C++ code, and (2) by being a superset of Python, which means that Python code is valid Cython, and allows to only optimize the bottlenecks.

Leveraging all of Cython’s benefits for NLP can however be tricky. It is mentioned in the documentation that one should not use C strings because they require manual memory management and are more cumbersome than Python strings.

This is where spaCy comes of help. It is a fast NLP library written in Python/Cython, which uses a clever method to manage strings, that manipulates 64-bit hashes internally instead of Python strings, making the code much faster while keeping the flexibility of Python strings from the user’s perspective.

spaCy’s StringStore

Most of spaCy’s strings management is taken care of in the file strings.pyx, and its StringStore. As mentioned in the documentation’s header, the StringStore’s purpose is to:

Look up strings by 64-bit hashes. As of v2.0, spaCy uses hash values instead of integer IDs. This ensures that strings always map to the same ID, even from different StringStores.

This means that instead of manipulating strings internally, spaCy computes and manipulates their 64-bit hashes only. They are converted back to their Python string equivalent only when necessary (for instance when a user wants to print the output of the spaCy pipeline).

Hence, the StringStore is the data structure where the mapping between the 64-bit hashes and their Python unicode string equivalent is stored. It is in fact a Cython extension type, which means that from a Python module’s perspective it behaves like a Python class, but internally it can have cdef statements (either static attributes, or C methods), and is built as a C struct.

In order to leverage the speed of spaCy’s low level C structures for our fast BoW program, we need to understand what happens when a Python unicode string is added to the StringStore. Most of this behavior is located in the StringStore’s methods add, intern_unicode, and _intern_utf8.

Here are the main steps of this process:

  1. The unicode string is encoded in utf-8 (code)
  2. The encoded string is hashed to a 64-bit integer by using MurmurHash2 (code)
  3. The encoded string is transformed into a C char array (or pointer). This is done in a way that optimizes memory usage (by either using a char array of 8 elements if the string is small enough, or using a pointer array otherwise) (code for the _allocate function, that converts the encoded string to a C array/pointer, line where the _allocate function is called)
  4. The (64-bit hash, C char array/pointer) couple is stored as key-value in a Cython Hash Table for Pre-Hashed Keys, which is an attribute of the StringStore (code)
  5. The 64-bit hash is stored in a list attribute of the StringStore (code)

💡 The main lessons from this analysis are that in order to be able to use a fast C level hash table the unicode strings need to be converted to a C char array (or pointer)¹, and that all fast computations must be performed on the 64-bit hashes, not on the strings.

🏎️ Cythonized version of BoW

Now that we know how spaCy manages strings internally, we can start implementing our own Cython version of the BoW.

First, we will import the necessary C types and classes (or extension types):

Among all these imports let us comment on a few:

  • The cymem package is used to tie memory to a Python object, so that the memory is freed when the object is garbage collected.
  • The preshed package contains both the Hash table where we store the (64-bit hash, C char array/pointer) couples, and a fast counter extension type (PreshCounter) that we will use to perform the BoW counting.
  • We use cimport instead of import to access the extensions types’ C methods and attributes.

Let us now write the counting function:

The _insert_in_hashmap function (used in fast_count) is defined below, as well as other utility functions, directly inspired from spaCy’s strings.pyx:

Finally, we compile the Cython file with the following simple setup.py, by executing python setup.py — build-ext -if (more details on Cython compilation here):

This time, by executing text2bow on the example text we get:

BOW representation: [(2751841902330220293, 1), (6601561424492272668, 1), (11916616154811659322, 1), (4645701992108298564, 1), (15099781594404091470, 2), (9470301821735104089, 1), (9556349622722280057, 1), (6437996555066804658, 1), (1020421249059553464, 1), (12460925685579008443, 1), (18223104521466393082, 1), (8530854408006191868, 1)]

We can see the 64-bit hashes in the BoW, and a representation of the hash table using unicode strings.

Processing the same 26893-words book is now around 4 times faster:

1.1 ms ± 6.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Computing the BoW of 1M books now only takes ~18 minutes 🚀 compared to the previous 1 hour 12 minutes!

Conclusion

In this post we have presented both a naive implementation of Bag-of-Words, as well as an optimized Cython one, directly inspired by spaCy’s internal way of managing strings.

Exploring Cython and spaCy is a good way to write more optimized NLP pipelines, and we definitely advise you to read more about these two. In particular some great introductory Cython material can be found in Kurt Smith’s video, and his book, while more advanced topics can be studied in Cython’s documentation. For spaCy, same thing goes with the great documentation, and don’t hesitate to delve into the code!

¹ Which is exactly what Cython’s documentation was warning us about! The tricky manual memory management of the C strings is in fact taken care of in the _allocate function.

Glose Engineering

Stories from the Glose team

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store