4 Easy Tips for Working with Multi-CSV Datasets in Python

Jamie Winger
Analytics Vidhya
Published in
14 min readFeb 18, 2021

Has someone ever handed you a dataset sliced into dozens (or even hundreds) of .csv files? Or maybe you’d like to try your hand at your first Kaggle competition, but that cool dataset comes in an endless list of smaller datasets. Even worse, by the time you’ve managed to read in all those files one-by-one, your data science friends have moved on to the next hot competition and you are left behind, frustrated and alone, clutching an unused machine learning model you learned about on Youtube.

Yes, before you’ve even started your project, opening a zip folder for the first time and discovering a ton of .csv files can leave you feeling like this:

Fear not! Having a dataset with more .csv files than the number of beets on Dwight Schrute’s farm (for non-fans of the The Office, that’s a lot) does not have to be a permanent road block. With just a few tips that are simple for any beginning data scientist to implement, you can overcome these challenges and learn approaches that will serve you well as you level up throughout your Data Science career.

Why Multi-CSV Datasets Are Challenging (Especially for Beginners)

Datasets in this form pose a few practical challenges when working with them in Python:

  1. They are often large, so memory constraints can be an issue.
  2. Reading in 100+ .csv files without some automation is time consuming and inefficient.
  3. Unforeseen errors while building data frames with tens of millions of rows can waste a lot of valuable time.
  4. When things do work properly, having to rebuild that enormous data frame every time we restart our program costs time and memory.

At the start of my data science career, these challenges were true barriers to getting started with my first few projects. These 4 easy tips for working with large, multi-csv datasets in Python will help you master these challenges, so you can easily build and store your data frames and then get to work.

For this explainer, I will be using the Stock Market India Dataset available for download here on Kaggle.com. It consists of 159 .csv files, and each file is a minute by minute time series of a different stock on the NSE starting on January 1st, 2017. The dataset is clean and and has a ton of files to work with, which is what we want. Click here for the complete step-by-step Jupyter Notebook.

Tip 1: Connect an external hard drive to Jupyter Lab (for Mac OS)

Let’s assume that as new data scientists, we’ve made the wise decision to purchase an external hard drive so as to not unnecessarily fill up our local hard drive with all the fascinating and large datasets we will encounter throughout our storied data career.

For this project, we’ve downloaded the entire India Stock Exchange Kaggle file (6.2 GB) onto our new 1 TB external hard drive and are ready to work our wonders on these .csv files with a Notebook in Jupyter Lab. Here’s a screenshot of our Juptyer Lab file directory:

Hmm… where is our external hard drive? How can we work with our dataset if we can’t connect our external hard drive to Jupyter Lab’s file directory? Enter: terminal shell!

On Mac OS, you can access your terminal by clicking on the launchpad icon of your Mac’s dock and typing terminal in the search bar:

Click the terminal icon, and you’ll see a terminal shell that looks like this:

If you’ve never worked with a terminal shell before, there are a lot of advantages to doing so outside the scope of this article. A quick Google search produces tons of youtube options for familiarizing yourself with techniques and advantages of the terminal. Here is just one example I found.

Now that we have our terminal shell open, we can create a direct connection from our external hard drive to Jupyter Lab. To help visualize what we are doing, check out the screenshot below of my file navigator. In essence, I am going to tell my terminal to open Jupyter Lab from the location “Backup Plus” (which is the name of my particular external hard drive — yours will probably be different) instead of opening from the default location which contains those folders visible under the “Favorites” section.

Okay, here it goes:

  1. First, in the terminal, type cd /Volumes which will move your operating directory out of the default volume.

2. Next, type ls which will show you the names of all mounted volumes.

In the screenshot below you can see Macintosh HD, the default volume we just exited. You can also see Backup Plus, my external hard drive, which is the volume we want to switch to. Almost there!

To switch to Backup Plus, let’s use our change directory command again and simply type: cd Backup\ Plus/

NB: A useful shortcut to make sure you are typing the directory path correctly is to use the “tab complete” feature in the terminal shell. Instead of physically typing “Backup\ Plus/” and fumbling with the right direction of the slashes, what I actually do is just type the first letter of my desired drive “B” and then press the “tab” button, which automatically fills in the rest of the directory path in the correct format.

Of course, your directory names and pathways will differ from mine, but the same shortcut will work on any terminal shell and speeds things up considerably, especially when navigating through several folders in a directory. To keep things simple, however, I’ve kept my path very short and this is what we end up with:

Okay, now we are operating out of the desired directory. We just need to open Jupyter Lab. To do this, simply type jupyter lab in the terminal. The terminal lights up with text, and Jupyter Lab automatically opens in a browser tab.

There we go! All the files on our external hard drive (including our India Stock Exchange folder containing our dataset) are now visible in our file directory in Jupyter Lab. Success!

Tip 2: Use the Glob package to create a custom list of files to read-in from the dataset.

Now that Jupyter Lab can talk to our external hard drive and our dataset, we need to figure out an efficient way to read all these files into a pandas data frame. Let’s take a closer look at our India Stock Exchange files using some basic directory navigation commands.

pwd displays your current directory and cd allows you to change to a specified new directory, just like it did in the terminal. Use that “tab complete” feature we discussed earlier to produce the correct directory names. It works the exact same way in Jupyter cells as it does in the terminal shell. Once you’ve navigated to the folder with your .csv files, type ls to list all the .csv files in the folder.

Here’s a screenshot of my directory navigation commands as an example. Your exact navigation path will, of course, differ depending on your file organization and directory names:

The output displays a very long list of .csv files, 159 to be exact. Getting a look at the entire list so we can notice patterns in the file naming conventions will come in handy for what we do next.

Before we go any further, we should try to read in a single .csv just to get a glimpse of our data and confirm everything looks okay. Let’s use pandas to do a quick read-in of a random file in our dataset (again, your exact file path will differ).

import pandas as pddf = pd.read_csv('/Volumes/Backup Plus/India Stock Exchange/FullDataCsv/NIFTY_100__EQ__INDICES__NSE__MINUTE.csv')
df

Here is our output:

Everything looks good, but if we want to load our entire dataset into a pandas data frame, we are going to need a more efficient method to read in all these files. Thankfully, we can use the glob module from the glob library to create a customized list of file paths matching a pattern that we specify.

To test this out, let’s try making a glob of a subset of files from our data. If we go back and look at the list of all the files in our FullDataCsv Folder, we notice that there is a handful of .csv files representing different NIFTY indices.

Let’s grab those. To do this, we can specify the pathname followed by a wildcard feature using the asterisk symbol (*)

Essentially, we are telling Python to grab all files within the specified pathway which have the string ‘NIFTY’ in the filename.

Here’s how it works:

Our output is a list of 9 NIFTY files in our glob:

With this list of file paths stored conveniently in one variable, we can pass that variable into a simple algorithm that uses a for-loop to read in and append each file to a pandas data frame as it iterates over the list of pathnames.

Notice that in this loop I am also taking the “filename” in each iteration, stripping off the pathname and the “.csv” extension so that I am just left with a string that represents the name of the stock or index represented by each individual file. I then create a new column that is populated by the stripped name of each file. This keeps the stock and index names attached to their corresponding minute-by-minute time series when they are all appended together in the larger data frame, preserving the analytical value of the dataset. If all the .csv files represented a time series for just one stock over several years, for example, you wouldn’t need to take this step because you’d know all the data is from one stock.

Here’s our output (with our new “stock or index name” column):

Great! Using glob, we have managed to read in 9 .csv files from our dataset, creating a pandas data frame with almost 3 million rows in a few seconds. We can use the same glob method to read in the entire dataset using the exact same algorithm we used above. Should we use that exact algorithm though? Answers after the jump…

Tip 3: When building a data frame with all your .csv files, write algorithms that sample first, fail fast, and use print statements and timestamps to debug before scaling.

Let’s say, without my knowledge, someone planted a junk file in my folder of .csv files.

In reality this could be a file in a different format that accompanies the dataset, maybe a data dictionary or some other supplemental info. In a list of 150+ files and less obvious naming than “EVIL_FILE,” such a thing could be easy to miss. While this is just a toy example, there are a million different reasons why the algorithm we used above might encounter problems with an individual file in our glob. It’s worth demonstrating this real quick.

(If you are coding along with me, feel free to plant your own EVIL_FILE in the India Stock Exchange FullDataCSV folder. Otherwise, just sit back and witness the horror show this problem file causes.)

Say we don’t know our EVIL_FILE exists, and we try to use the same algorithm we used to build our NIFTY data frame, except now we use a glob that grabs all the files in our dataset folder. From looking at the complete file list again, it looks like all the file names have a double underscore. Let’s use that to create our glob. It works the exact same way as before, we just use the double underscore in the wildcard instead of NIFTY.

Cool, now we have our glob variable again. This time that variable is storing a list of all 160 files. Let’s charge forward and read in all our files with no other modifications to our algorithm.

Yikes! Here is our output:

In this case, we know the source of our error is the EVIL_FILE. However, if we actually had a problem file that we didn’t know about, it might not be immediately clear from this error output what went wrong. The most important takeaway from this little detour is that my code ran for over an hour before outputting this error. That’s an expensive fail for us busy data scientists.

So we don’t repeat the same mistake, let’s add a few easy modifications to our code to muscle up this algorithm.

First, let’s create a basic timer function so we can track how long each iteration of our for-loop takes. We will use the datetime module from the datetime library. This is a great tool for monitoring our algorithms so we can better understand what's happening under the hood as our code is running.

# Importing the module
from datetime import datetime
# Defining our timer function which we will use below
def time_elapsed(start_time):
return datetime.now() - start_time

We will also use our same glob with the double underscore wildcard that we used before, except this time we are going to pass that glob variable into a new, more robust algorithm.

Notice that this code is almost exactly the same as the algorithm we used the first time we tried to read in all 160 .csv files, except for 3 powerful additions:

  1. At the beginning of our code, we’ve defined a “start” time variable paired with a print statement and a “file_count” variable set to the len of our “all_files” glob. These variables will both come in handy for tracking our code’s progress.
  2. We’ve added print statements corresponding to key lines in our for-loop so we can track the same 4 operations every time our for-loop iterates. Notice we’ve populated the print statements with our “time_elapsed()” function we defined above the for-loop. Also, notice that in the second print statement (line 16), we inserted the “filename” variable, so that we will know exactly what file we are reading in with each iteration. This will tell us exactly when our code hits our problem EVIL_FILE.
  3. Finally, we’ve activated a very powerful parameter in the “pd.read_csv()” section of our code: the “nrows” parameter (line 17). We’ve set the nrows parameter to 10,000, which means that our algorithm will only read in 10,000 rows from each .csv file. Recall that the first .csv file we read in had over 370,000 rows. By limiting the row number to only 10,000, we will greatly speed up our algorithm as it reads in our 160 files, which should help us discover our problem EVIL_FILE file much faster.

Okay, let’s give this a try and take a look at our output.

See all those beautiful print statements and timestamps? They allow us to see exactly where the error happened (we see you, EVIL_FILE). Furthermore, because we sampled only 10,000 rows from each file, our algorithm only took 34 seconds to fail instead of an hour. With just a few simple additions to our code, we’ve cut the time it takes to discover and diagnose a problem by orders of magnitude. Sampling, print statements, and timestamps FTW!

In a real life situation, depending on the content of the problem file, we could simply move the file outside our target folder or refashion our glob to skip that file. Let’s take the second approach. It turns out all the dataset files we want have the string MINUTE, so let’s use that for our glob wildcard…

all_correct_files = glob(pathname='/Volumes/Backup Plus/India Stock Exchange/FullDataCsv/*MINUTE*')

…and pass our new glob variable into our modified algorithm. This time, the only change we are making is deactivating the nrows parameter, since we are confident we’ve solved the issue. (If we wanted to be diligent, we could run the code with 10,000 row sampling again to confirm everything works. That would produce a sample data frame of 1,590,000 rows, which is what we would expect from 159 files, but we are going to skip that step for now.)

Now that we have eliminated our problem, we can run our algorithm and confidently walk away while our code runs for an hour (binge watching a couple episodes of our favorite workplace comedy set in Scranton, perhaps). If we get curious, we can always stop by and glance at our screen to check out our print statements and the code’s progress.

Here’s the tail end of our output and a look at our complete data frame:

Wow! Our code executed without error, and we now have a data frame containing our entire dataset, which is almost 60 million rows. We are now free to manipulate and analyze our data in whatever way we desire.

Part 4: Save large data frames in parquet files for quick and easy access.

What a journey! After all that, wouldn’t it be nice to save our data frame in a way where we can always access it quickly and easily without having to spend an hour reading it into Python again? Saving the data frame into a parquet file is a great solution. Here’s how it works:

When we read the parquet back into Python, we have the exact same data frame:

Saving large data frames to parquet files is also great way to manage memory constraints. Let’s make a .csv file of the exact same data frame and compare the file sizes.

Notice that the parquet file is less than a quarter the size of the .csv file. When we are talking about huge datasets that are greater than 10 or 20 GB, using parquets to store our data frames can yield huge savings!

Conclusion

Michael is happy again and ready to take on these .csv files! Let’s quickly recap what we’ve covered in order to efficiently build and store our Stock Market India dataset which came to us in the form of 159 .csv files:

  1. First, we used our terminal shell to run Jupyter Lab directly from our external hard drive. This removed the memory constraints of storing our dataset on our local hard drive.
  2. Next, we used the glob package to define a variable that stored a list of files that we specified using glob’s wildcard feature. This allowed for automated reading of our .csv files into a pandas data frame.
  3. Then we discussed the benefits of beefing up our read-in algorithm with the “nrows” parameter to allow for sampling and adding print statements and timing functions to help us see our code’s progress, fail fast, diagnose, and debug before scaling. Then we walked away and let our code build a data frame of all 159 Stock Market India files.
  4. Finally, we used parquets to save and store our data frame (using less space than a traditional .csv), and now we can read in our entire Stock Market India dataset in just a few minutes whenever we need.

With these tips, you are now ready to explore and conquer ever larger and new datasets. Happy adventuring!

Click here for the complete Jupyter Notebook with all code.

Say hi on LinkedIn.

--

--

Jamie Winger
Analytics Vidhya

Currently studying Data Science at Loyola University Chicago's School of Business. Former director of an urban farming job training program. Eternal BSG fan.