Python for big data computation on a single computer

Mateo Restrepo
7 min readJul 16, 2018
Many workers crunching big data on a single computer

Is it possible to process big data on a laptop? Most people would answer with a blatant no. In this post, I would like to convince you that not only is it possible but, in some cases, it can be convenient.

But first, let’s get back to the basics. What does big data analysis entail? Well, according to one of the simplest and most accepted operational definitions, big data computation happens whenever you need to process a data set that doesn’t completely fit in the RAM of a single computer.

As per this definition, big data computation doesn’t have to be distributed among many machines. If you only have one machine and your dataset fits in its hard disk, but not in your RAM, you are facing a big data challenge. This situation calls for techniques from the area of out-of-core algorithms. The name stems from the fact that the only possible approach in this case is to load one chunk of data into memory, do something with them, store (intermediate) results to disk, load another chunk of data into memory, rinse and repeat.

But wait! — you would say — writing and reading intermediate results to disk is very expensive and time-consuming. You're 100% percent right. In fact, a comparison of the time needed to read 1000 MB of data from different media goes as follows:

Transfer rates of 1000 MB to different media. Sources: [1],[2]

We see there is a two+ orders of magnitude difference between the transfer rate of DDR to main memory vs. that of a traditional hard disk. Note, however, that not all hard disks were created equal, a solid-state drive being only around a factor of 10 slower than memory.

But wait! — you would say — having to divide my data into chunks and then coordinate the writing and re-reading of intermediate results to and from disk seems like daunting, tedious and error prone task. Again, my dear reader, you are right! The good news is that, in the Python ecosystem, there already exist good quality, high-level libraries that take care of all the nitty gritty details for you. And the best thing is: their APIs are very similar to the ones you might be used to by now, namely Pandas.

Enter Turicreate and SFrames

About two years ago I had my first contact with Graphlab Create, a closed-source Python library for easy exploration, visualization and development of machine-learning models from large data sets, both structured and unstructured. Despite its misleading name, Graphlab Create was pretty awesome. Its only problem: it required an expensive license to use. Dato, the company behind Graphlab Create, was then acquired by Apple and became Turi. This acquisition had two nice side-effects: 1) Graphlab Create was re-branded as “Turi Create”, a more abstract yet less misleading name. 2) Turi Create was open-sourced. Most of Turi Create’s functionality is built on top of the SFrame class, which is very similar in concept to a Pandas DataFrame, except that it can be out-of-core.

A small (but significant!) case study: stalker-stalkee pair detection

In the remainder of this article, I will report on my experience trying out Turi Create and SFrame on a tricky problem using real data.

The data

The data we will use for our experiment comes from the (now inexistent) Gowalla social networking site. Two data nice data sets coming from this site are available here. We will be looking at the biggest one, which contains the event-log of “check-ins” of Gowalla’s users to a set of locations. This data set contains 6.44 million records, each containing a single check-in and just a few columns, of which we will pick only 3: user_id, location_id and checkin_ts (the second-resolution timestamp of the check-in event). The total size of the data set in uncompressed plain text format is just below 380 MB, which cannot be considered big data by today’s standards, but please read on…

The problem and its (theoretical) solution

We will use Turi Create to attack what could be termed the “stalker-stalkee detection problem” on this data set. In this problem, we are asked to identify pairs of users (E, R) that maximize the ‘stalking measure between E and R’. The stalking measure between E and R is defined as the number of distinct locations where there was ever a check-in by user E (the stalkEE) followed by a check-in by user R (the stalkER).

As you can see, this is not an ML problem or even a problem requiring sophisticated statistical techniques. It’s just a “data munging” problem requiring simple structured data manipulations.

In fact, here is what the solution to this problem looks like using Pandas:

However, trying to run this code on a laptop or PC with the amount of RAM that is usual these days, (say 16GB), will result in a MemoryError exception…

But, why is this a big data problem?

To answer this question, we have to go into a bit of the minutia of what the code does. The first thing is to index the check-ins by location_id (remember that in pandas a single value for a key can refer to more than one row). This will make the following computation easier.

Then comes the tricky part, for each location we want to consider all pairs of check-ins where the check-in time stamp of the first user in the pair strictly precedes that of the second user. The code thus generates chin_pairs, a data frame containing all pairs of check-ins for the same location and then filters it to enforce the conditions just described, to generate pairs_filtered.

This is the tricky part because it implies a location-wise “quadratic” computation time- and memory requirement. And it is precisely here where the naïve Pandas solution blows up memory-wise. Sadly, or rather, interestingly, there is no way to get around this. When we run the out-of-core solution below, we will also get the exact measurement of how big the intermediate data sets in this computation are.

What we are witnessing here is a problem that falls in the category of big data not because its input data set is big, but because an intermediate step can overflow our machine’s memory.

Turi Create to the rescue

We will try out Turi Create to implement and run the same algorithm. Luckily, Turi Create’s designers mimicked a good part of Pandas’s API and the translation is almost direct!

First, we load the data set directly from the tab-delimited text file and select the three columns of interest:

Loading and selecting some columns with turicreate.SFrame

Next, we generate the pairs of check-ins that satisfy the conditions of our detection algorithms. Along the way, we rename the columns in the data frame of pairs, to identify columns containing potential stalkee and stalker ids.

Construction of candidate stalkee-stalker check-in pairs.

The rest is smooth-sailing.

Deduplicating, grouping, counting and selecting top pairs by unique location count.

The code above produces the following output.

And voilá! The whole thing ran to completion with no memory error. The total computation time for this script on a 2-core machine, with just 4 GB RAM and 16 GB SSD, running Anaconda Python 3.6 on Ubuntu, was 1703 seconds, i.e. slightly below half an hour.

Measurements of the size of the intermediate results (using len()) yield the following:

Number of records for initital and intermediate datasets

Notice that each record in our data set takes up 56 bytes of space: 16 bytes for the user_ids, 8 for the location_id , 16 for the two check-in timestamps, and (at least?) 8 bytes of overhead implied by the Pandas index. Thus, the first intermediate result, chins_ps, would take up at least 56 B x 562 MM = 31e9 B or about 30 GB. This clearly explains the MemoryError exception we got with Pandas.

A note on alternatives

There is at least one very promising alternative to Turi Create out there, called Dask. Unfortunately, Dask hasn’t even reached version 1.0, and in a preliminary experiment I wasn’t able to reproduce the result above with it. Apparently, it requires some careful fine-tuning of some configuration parameters, which was beyond my humble infrastructure configuring abilities. If I ever figure out how to make Dask work for this problem, I will certainly write a post about it.

Conclusion

In this post we saw that doing big data computation in a single, regular sized computer makes sense and is feasible by employing out-of-core algorithms. Further, the use of high-level and mature libraries such as Turi Create on Python makes it convenient. If you have an SSD, you can even do it efficiently for moderately sized datasets, say, less than 100 GB.

More where this came from

This story is published in Noteworthy, where thousands come every day to learn about the people & ideas shaping the products we love.

Follow our publication to see more product & design stories featured by the Journal team.

--

--

Mateo Restrepo

Lead of Data Centered Solutions (and ML) at La Haus (https://www.lahaus.com/) Ph.D. in Applied Mathematics Machine Learning and Programming enthusiast.