Machine Learning -Reducing Physical Memory Consumption - Jumping One Hurdle at a time

How to handle RAM utilisation in Data-science projects without killing your average Laptop?

Jackson Sunny Rodrigues
3 min readAug 11, 2018

Version2 available here

What does this blog talk about? This shows a View on how to use chunking capability in the project. We have tutorials on how to use PyTables/H5py/Pandas (HDF5). But what is described here, is how to use it to integrate it with an Algorithm.

This is targeted to audience who are fairly new to Machine Learning, like me, with little budget and mountain of desire to learn ML. If you are like what I have described above then you can understand the pain. You all have encountered situations where you are trying some new algorithm and eventually your system hung and you had to force restart your PC or kill the program, leaving you sad..

How can we solve this problem?
Very simple. Only keep what you need to process/hold in Memory, i.e process in Chunks, load only what is needed for chunk in RAM

  • Process data in Chunks
  • HDF5 — Pytables / Pandas/ H5py comes with support

Processing Data in Chunks

So this portion is simple. Say you have 100 items and you know that you can only load 10 items at a time safely without overloading system memory. What would you do. You load 1st ten and after processing it, you load the 2nd, … till the last ten, and collate the results together.

But the problem here is, most of the code here available might be moulded to process all of the items in a stretch. So what you do is add steps to do the chunking. But then you have add lot of additional steps than writing the logic of the Actual Process . And also you have to do implement chunking code again and again for every section handling large data set. This not only clutters the code/logic, but also is lot of duplicate code.

Duplication can be easily handled by simple concept called decorator

A decorator helps to extend functionality of a targeted function by wrapping it another.

Below given is an implementation of this concept, where you can extend a function that process a collection, to work in chunks. You can see that wrapper takes additional arguments for supporting chunking, such as total length, chunk size, functions to fetch inputs for chunk and write outputs.

You and see that by just using decorator we could extend a function “generate_shingles_for_items” which is designed to process a whole data set, to process in chunks. see here.

HDF5 — Pytables ,Pandas ,H5py comes with support

Previously we have seen way to process data in Chunks. But all of those wont work unless you could keep check on memory, by just loading what is needed there at that time.

So lets see one example over here.

We will read multiple documents, and then generate shingles for each document. First we will do it In-Memory way. Its the same method as seen in above example. Instead of using inline/lambda functions we are using named functions.

In memory procssing

What we see now is that RAM consumption during the process starts increasing steadily. This will eventually flood the RAM and hung the process for a very large data set.

So lets extend the code with a pinch of PyTables support. With just some modifications we can reduce memory consumption. Just 3 line in start and one at the end.

With pytable support

Now when we observe the RAM usage we can see that the memory is not increasing like previously, but is more or less steady. But it takes more time. Would be do to I/O operation.

See the below snapshot of memory consumption in two cases.

Conclusion

With a little bit modification in code/approach, we would be able to process large data set without killing the System.

Source Code can be found here. Which show how to iterate over array, and create a matrix for huge data on low spec machines.

Implementation of Minhash (from inspiration) modified it to functional style, and converted to implementation.

Version 2 available here

--

--