How to avoid memory overloads using SciKit Learn
A small technique we found while text-mining 40,000 documents
This article is a part of our making of series on “Islam, media subject”, a study on the perception of Islam by french national daily newspapers. You can see our first published article here and the previous one here.
In “Islam, media subject”, we wanted to quantify how media were treating “Islam”. To do that we used text-mining tools and techniques to analyse the articles published in daily newspapers from France.
One of those techniques was the analysis of (co)occurrences of terms in the articles thanks to Document-Term matrices like bellow. Those matrices hold occurrences of terms (all terms compose the vocabulary) through a set of documents (the corpus). They constitute one of the basic components of many techniques like Topic Modeling, Clustering, study of similarities, etc.
As you can see on this schema, the vocabulary is constituted of part-of-speech tagged words. This is a part of our analysis because we wanted to be able to study what were the terms used in our corpus associated with “Islam”.
But this is not a part of DTM, so you can ignore it for the rest of this article. However if you’re interested in it you can read the first article of this series that details it a bit.
In order to create those kind of matrices, we used a feature took from the Sci-Kit Learn libray (or sklearn, a Python library providing machine learning algorithms and techniques) called CountVectorizer. This feature was exactly what we were looking for because it took in charge the creation of DTM while letting us be able to study more deeply what terms were used in the different parts of our corpus thanks to a vocabulary (as shown above).
Why is it taking so much memory?
This vocabulary is a Python dictionary associating terms to a matrix columns. As we can see in the previous scheme, those terms can be single words or association of words (or n-grams). In our case we used n-grams between 1 and 3 terms and that’s what made this vocabulary get really huge quickly.
In this chart, we can see that the main problem lies in this vocabulary. For instance, when the number of documents in our corpus gets above 14,000 the vocabulary is taking 750MB while the corpus itself (a pandas DataFrame) takes ~250MB and the associated DTM is taking only ~180MB. This also illustrate the way Python deal with memory allocation for dictionaries (like vocabulary). As we can see, it first allocate an amount of memory, then, when it got filled by new elements, it doubles its space, and so on. That’s why we see a huge gap between 13,000 and 14,000 documents.
So we can guess that the next allocated memory space, for this dictionary alone, will be around 1.5GB when we will reach a bigger number of documents. So as the number of documents get bigger, you will reach a point where the allocated memory space is so big that CountVectorizer can’t compute the DTM anymore without throwing a Memory Error.
How to resolve this memory overload?
First, if you don’t need to, don’t use CountVectorizer. Sklearn have other less memory-consuming features like HashingVectorizer. However it didn’t allow us to have an in-memory vocabulary (since text got hashed). This made us stick with CountVectorizer in order to be able to query this vocabulary to study terms usage in our corpus.
But if you need to use CountVectorizer, what you can do is to separate this costful operation into several smaller operations and construct the Document-Term matrix with multiple operations, not one. That’s what we’re going to see next.
Divide the Document-Term matrix creation process in smaller parts
This sounds like an easy task to do. It seems you “just” have to separate your documents in small batches (chunks); then you apply CountVectorizer on them to create small DTMs, then combine them to get a big DTM, right? Well, no. If you do that, you will end up with incompatible matrices. Because every chunks’ DTM will have columns corresponding to different vocabulary terms. Indeed, the vocabulary depends on the texts content.
Different set of texts will produce different vocabularies and this is what prevents you to concatenate those matrices; simply because it doesn’t make any sense to concatenate two matrices having columns representing different terms. So, in order to separate this DTM creation in smaller parts, you should instead:
- Create a single vocabulary of the whole corpus,
- separate your corpus in different chunks,
- for each chunk create a corresponding DTM with the vocabulary created earlier (this way you can be sure your matrices will have the same shape),
- and finally concatenate your matrices.
We’ve summed this in the schema bellow. Remember that the Part-of-Speech Tagging process is not mandatory at all, but if you want to use such technique then you should do it with the depicted order. You can also look at a simplified version of what we implemented in this gist.
This approach have several benefits. First, if your corpus computation fail (for various reasons) you could take back computation to where it failed (if you took care of saving computed parts during the process). This saves a lot of time and helps to keep a relatively good mental sanity.
Second, this solution could be parallelized/distributed since we have small independent corpus chunks and a single shared vocabulary, it won’t be an issue to process those matrices independently.
And finally, it makes computation of DTM smaller in terms of memory consumption. That’s why it’s less prone to break with a Memory Error.
Improvements to do
We didn’t spend additional time on this issue but we can imagine different improvements or different approaches to improve our solution.
As I’ve said earlier, the biggest problem lies in the memory space took by our vocabulary and our solution don’t reduce this space. And things will get worse when the corpus gets bigger (like in CountVectorizer); to a point where this vocabulary will get too big to do anything on the side. That’s why we should consider different approaches to reduce this space.
- Reduce the size of the vocabulary
You could, for instance, remove terms appearing only once in the whole corpus or terms appearing too many times (like tool words, stop words). However, this approach won’t be enough to reduce its size significantly.
- Get the vocabulary out of the RAM, store it in a database
This would be a huge improvement. By storing this vocabulary in a database would save a lot of memory space for other parts of the analysis. We didn’t look for what would be the “perfect” database for this job but if we can manage to have a working solution with PostgreSQL it will be good enough for us.
- Use a different data structure
Python dictionaries are taking a lot of memory space. Heading for a similar data structure but with a memory consumption focus could be the solution. The only clue I have right now is to look at Trie (and the marisa-trie implementation).
If you know simpler, more efficient or just different approaches that could tackle this issue, feel free to comment!