Making Sense of Big Data
The Obstacle of Big Data
Techniques to overcome 100+ million datapoints
While surfing through the open data platforms for New York City, San Francisco, and Seattle, in search of a time series project that struck my fancy, I stumbled upon a behemoth — a dataset so large that I questioned my (and my computer’s) ability to handle it.
A bit of background: naturally, being what my friend sardonically refers to as a bookhead, I was drawn to library data, and I began to wonder if I could build a project around forecasting future library activity, i.e. the number of checkouts across an entire library system. To my dismay, both New York Public Library (NYPL) and San Francisco Public Library (SFPL) did not have any data regarding checkouts; but data ho! Seattle Public Library (SPL) did. And did it ever.
Enter the behemoth
The dataset in question, entitled Checkouts by Title (Physical Items), has the following description:
This dataset includes a log of all physical item checkouts from Seattle Public Library. The dataset begins with checkouts occurring in April 2005. Renewals are not included.
While it feels like so long ago (but what is time, really?), I seem to recall a sort of willful ignorance, not really considering just what it meant that this dataset was started all the way back in 2005. But on that fateful day — December 15, 2020 — I clicked Download
and CSV
and found myself waiting several hours for a 26.93 GB file to download. An eye-opening number and certainly the largest dataset I have yet encountered.
May I see it?
Not entirely sure what I would find, I plopped the CSV file in my project’s Data
folder and tried loading it into a Pandas DataFrame. I wrote an Amphetamine prescription for my computer and walked away, assuming it would take a while to load. A while was an understatement; an indeterminate amount of time later (at least an hour, maybe two, maybe more?), and with my computer’s fans intermittently at full blast, the file had still not loaded. Ah, I was so much younger then; I didn’t even think to keep track of exactly when I had started running the cell.
I have since built a function called status_update
, which takes in a message as a string and prints out a (you guessed it) status update with the current timestamp. This has been a lifesaver for any code that takes even a non-negligible amount of time to run. There aren’t many feelings worse than uncertainty when it comes to working with large datasets or complex code. Incorporating time trackers and quality checks into my pipeline has been crucial for this project thus far and are habits that will forever stay with me in future projects.
Armed with my status updater, I let my computer run all night, and in the morning, I was still in the dark, with no Pandas DataFrame in sight.
Creating a peephole
After I finally realized the nature of the beast with which I was dealing, it dawned on me that I would need to peek at a small chunk of the data and build a data transformation pipeline around that. I took to the Pandas documentation to see if there was any way I could limit the number of rows and columns that I loaded from the raw CSV. Of course, there is! Seeing as this dataset ends up being a full three orders of magnitude larger than any I had previously worked with, I hadn’t really needed to use the nrows
or usecols
or skiprows
arguments in the past. Below is an example of the code I first used to peep into the data.
# load data
df = pd.read_csv(
'data/Checkouts_By_Title__Physical_Items_.csv', # file path
nrows=1000000 # number of rows to load
)
Since I hadn’t yet seen the data, I wasn’t sure of which columns I may consider not loading; eventually, however, I would make use of the usecols
argument, which allows you to choose which columns to load from the source. There is also a skiprows
argument, which you can use if you have a specific chunk of the data you want to load. In this case, I just wanted to see something! so the first million rows, with all the columns, would do.
Luckily, I immediately noticed a fair number of superfluous columns, and knew that by not loading those, I could decrease my load time of all the rows considerably. Columns I didn’t load include:
ID
— generated identification number with metadata present elsewhere within the data.CheckoutYear
— information already contained within the more necessary column,CheckoutDateTime
.BibNumber
,CallNumber
, andItemBarcode
— unique numbers per item, which could be helpful in determining which editions of items are most popular, but I more simply wanted to look at the most popular books/movies/items in general, regardless of edition. For this, I can use theItemTitle
column.ItemType
— a potentially useful column, when paired with the data dictionary but one that contains less information than theCollection
column.
And so, from the original 10 columns, I chose only 4 — Collection
, ItemTitle
, Subjects
, and CheckoutDateTime
. I can then amend my previous code before building the rest of my pipeline.
# list of columns to load
usecols = ['Collection', 'ItemTitle', 'Subjects', 'CheckoutDateTime']# load data
df = pd.read_csv(
'data/Checkouts_By_Title__Physical_Items_.csv', # file path
nrows=1000000, # number of rows to load
usecols=usecols # specify columns to load
)
Spoiler alert: by specifying these 4 columns, my load time of all 106,581,725 rows decreased from untold hours to 15 minutes, a drastic improvement, albeit still a decent chunk of time for loading raw data.
Bypassing the behemoth with a pipeline
Before actually loading all the data, it was necessary to use this smaller million-observation chunk to construct a pipeline for all of the necessary data transformations. Below is a brief overview of processes in my pipeline:
- check for any NaN values
- convert the
CheckoutDateTime
column from a string to a datetime object (necessary for any time series project) - merge item information from the data dictionary
- drop any unnecessary columns
- transform any values that can be consolidated into more appropriate categories
- save the cleaned dataset
I knew that one of the biggest tasks would be merging on data from the data dictionary, which contains information important mainly for exploring the data (EDA), such as:
Format Group
— whether an item isPrint
,Media
,Equipment
, orOther
Format Subgroup
— a more specific categorical with 15 categories, the top three beingBook
,Video Disc
, andAudio Disc
Category Group
— generally whether or not an item isFiction
orNonfiction
, although other categories are includedAge Group
— whether an item is consideredAdult
,Teen
, orJuvenile
All of these columns contain string objects, which can be very memory- and time-intensive to parse through. After some research I found a solution via conversion to a Pandas Categorical object.
Converting strings to Categorical
To save memory and time during data manipulation and subsetting, I used the Categorical class in Pandas. To read more on how the Categorical class improves memory usage (among other benefits), refer to the documentation. The basic idea is that instead of one of my columns containing 100+ million strings, I can have n
number of categories. In my case, the above columns’ values for n
ranged from 3 to 15, notably fewer than 100+ million.
Spoiler alert: I have since calculated the size of each column as Categorical data and string data and found that the Categorical data were exactly an eighth (12.5%) the size of the string data! Accounting for all four of these columns, I saved over 3GB in memory and a significant amount of time during the merge process, as well as during my data transformation process and subsequent EDA.
Everything is i̶l̶l̶u̶m̶i̶n̶a̶t̶e̶d̶ elongated
With my pipeline more or less complete (one must always leave room open for further optimization and improvement), I was ready to load in all of the necessary data and send it through the pipeline. It quickly became clear that dealing with Big Data doesn’t just end with how long it takes to load in the raw data; everything takes longer with a dataset of this magnitude. Even checking for NaN values took me almost four minutes. Converting the date column from a string to a datetime object took over six minutes.
Merging data from the data dictionary was surprisingly quick at four minutes, but dropping the columns necessary for the merge (but useless after) took a whopping 45 minutes. I was unable to find a faster way of doing this natively within Pandas, although there may be ways of optimizing this which I discuss below. All other data transformations, thanks to NumPy, were nearly instantaneous.
Which brings me to the final hurdle…
The problem with saving
Perhaps the biggest roadblock I faced, and something I need to do more work on, is actually saving my final cleaned and transformed dataset. For a variety of reasons, including stability and memory saving, I settled on saving the dataset as a compressed Pickle. Still, even though my pipeline worked just fine on all other steps, I was running into issues saving the data in one bunch. Memory usage was still being exceeded and my kernel kept crashing on the last crucial step.
My solution for now has been to save it in batches of 10 million rows, for a total of 11 files. The very large downside is that this process takes around 8 hours! The upside is my data that was previously about 27GB is now clean and only totals 3GB. Loading in the now-clean data still takes awhile (~20 minutes), but I view that as time to make some coffee and eat a quick breakfast.
An important note
I should note that this very large dataset that takes a really long time to save and a fair amount of time to load is, in a sense, optional. It is only important for the exploratory data analysis portion of the project, which to me, is very important indeed. The actual time series data, however, can be created nearly instantaneously (after the aforementioned data transformations) by dummying and summing up any of the category columns you want, as well as the all-important target variable of total_checkouts
. This data can fit into a Pickle file totaling a whopping 277KB.
Always more to optimize
Depending on your use case, the significant amounts of time it takes to perform what are usually quick tasks (when dealing with smaller data) could be unacceptable. For my personal project, after building a data transformation pipeline with the first million rows, it was no big deal to run the script on the entire raw dataset overnight and save a cleaned DataFrame that can now be accessed with ease.
Much of this “excess time” can be attributed to the fact that Pandas only uses a single core when processing data. I’m intrigued by a library called Modin, which can parallelize your data and make use of all your computer’s cores. Unfortunately, it is not currently functional with the latest version of Pandas, so downgrading may be necessary. Dask and PySpark are other tools I’m considering, although early experiments with Dask have shown a negligible (and sometimes zero) increase in performance during data transformation. Checking for NaN values, for instance, takes just as long.
As of now, I’ve got my cleaned data to run with, but in the future, it will still be beneficial to make my pipeline more efficient. Working through these issues so far has taught me that there is always more work to be done and that not all beasts can be tamed (or slain).
Visit my project’s repository
While I’m still in the early(ish) stages, the process detailed in this post, as well as my EDA so far, can be seen in my project’s repository. As a bonus, I recently performed my first API calls for the project in order to obtain the remaining data for the year of 2020, which I then ran through my pipeline and cleaned with ease.