Optimized ways to Read Large CSVs in Python

Published in

Analytics Vidhya

5 min readJul 29, 2020

Hola! 🙋

In the current time, data plays a very important role in the analysis and building ML/AI model. Data can be found in various formats of CSVs, flat files, JSON, etc which when in huge makes it difficult to read into the memory. This blog revolves around handling tabular data in CSV format which are comma separate files.

Problem: Importing (reading) a large CSV file leads Out of Memory error. Not enough RAM to read the entire CSV at once crashes the computer.

Here’s some efficient ways of importing CSV in Python.

Now what? Well, let’s prepare a dataset that should be huge in size and then compare the performance(time) implementing the options shown in Figure1.
Let’s start..🏃

Create a dataframe of 15 columns and 10 million rows with random numbers and strings. Export it to CSV format which comes around ~1 GB in size.

df = pd.DataFrame(data=np.random.randint(99999, 99999999, size=(10000000,14)),columns=['C1','C2','C3','C4','C5','C6','C7','C8','C9','C10','C11','C12','C13','C14'])df['C15'] = pd.util.testing.rands_array(5,10000000)
df.to_csv("huge_data.csv")

Let’s look over the importing options now and compare the time taken to read CSV into memory.

PANDAS

The pandas python library provides read_csv() function to import CSV as a dataframe structure to compute or analyze it easily. This function provides one parameter described in a later section to import your gigantic file much faster.

1. pandas.read_csv()

Input: Read CSV file
Output: pandas dataframe

pandas.read_csv() loads the whole CSV file at once in the memory in a single dataframe.

start = time.time()
df = pd.read_csv('huge_data.csv')
end = time.time()
print("Read csv without chunks: ",(end-start),"sec")Read csv without chunks:  26.88872528076172 sec

This sometimes may crash your system due to OOM (Out Of Memory) error if CSV size is more than your memory’s size (RAM). The solution is improved by the next importing way.

2. pandas.read_csv(chunksize)

Input: Read CSV file
Output: pandas dataframe

Instead of reading the whole CSV at once, chunks of CSV are read into memory. The size of a chunk is specified using chunksize parameter which refers to the number of lines. This function returns an iterator to iterate through these chunks and then wishfully processes them. Since only a part of a large file is read at once, low memory is enough to fit the data. Later, these chunks can be concatenated in a single dataframe.

start = time.time()
#read data in chunks of 1 million rows at a time
chunk = pd.read_csv('huge_data.csv',chunksize=1000000)
end = time.time()
print("Read csv with chunks: ",(end-start),"sec")
pd_df = pd.concat(chunk)Read csv with chunks:  0.013001203536987305 sec

This option is faster and is best to use when you have limited RAM. Alternatively, a new python library, DASK can also be used, described below.

DASK

Input: Read CSV file
Output: Dask dataframe

While reading large CSVs, you may encounter out of memory error if it doesn't fit in your RAM, hence DASK comes into picture.

Dask is an open-source python library with the features of parallelism and scalability in Python included by default in Anaconda distribution.
It extends its features off scalability and parallelism by reusing the existing Python libraries such as pandas, numpy or sklearn. This makes it comfortable for those who are already familiar with these Python libraries.
How to start with it? You can install via pip or conda. I would recommend conda because installing via pip may create some issues.

pip install dask

Well, when I tried the above, it created some issue aftermath which was resolved using some GitHub link to externally add dask path as an environment variable. But why make a fuss when a simpler option is available?

conda install dask

Code implementation:

from dask import dataframe as ddstart = time.time()
dask_df = dd.read_csv('huge_data.csv')
end = time.time()
print("Read csv with dask: ",(end-start),"sec")Read csv with dask:  0.07900428771972656 sec

Dask seems to be the fastest in reading this large CSV without crashing or slowing down the computer. Wow! How good is that?!! A new Python library with modified existing ones to introduce scalability.

Why DASK is better than PANDAS?

Pandas utilizes a single CPU core while Dask utilizes multiple CPU cores by internally chunking dataframe and process in parallel. In simple words, multiple small dataframes of a large dataframe got processed at a time wherein under pandas, operating a single large dataframe takes a long time to run.
DASK can handle large datasets on a single CPU exploiting its multiple cores or cluster of machines refers to distributed computing. It provides a sort of scaled pandas and numpy libraries.
Not only dataframe, dask also provides array and scikit-learn libraries to exploit parallelism.

Some of the DASK provided libraries shown below.

Dask Arrays: parallel Numpy
Dask Dataframes: parallel Pandas
Dask ML: parallel Scikit-Learn

We will only concentrate on Dataframe as the other two are out of scope. But, to get your hands dirty with those, this blog is best to consider.

How Dask manages to store data which is larger than the memory (RAM)?

When we import data, it is read into our RAM which highlights the memory constraint.
Let’s say, you want to import 6 GB data in your 4 GB RAM. This can’t be achieved via pandas since whole data in a single shot doesn’t fit into memory but Dask can. How?
Dask instead of computing first, create a graph of tasks which says about how to perform that task. It believes in lazy computation which means that dask’s task scheduler creating a graph at first followed by computing that graph when requested. To perform any computation, compute() is invoked explicitly which invokes task scheduler to process data making use of all cores and at last, combines the results into one.

It would not be difficult to understand for those who are already familiar with pandas.

Couldn’t hold my learning curiosity, so happy to publish Dask for Python and Machine Learning with deeper study.

Conclusion

Reading~1 GB CSV in the memory with various importing options can be assessed by the time taken to load in the memory.

pandas.read_csv is the worst when reading CSV of larger size than RAM’s.
pandas.read_csv(chunksize) performs better than above and can be improved more by tweaking the chunksize.
dask.dataframe proved to be the fastest since it deals with parallel processing.

Hence, I would recommend to come out of your comfort zone of using pandas and try dask. But just FYI, I have only tested DASK for reading up large CSV but not the computations as we do in pandas.

You can check my github code to access the notebook covering the coding part of this blog.

References

Dask latest documentation
Book worth to read
Other options for reading and writing into CSVs which are not inclused in this blog.

Reading and Writing CSV Files in Python - Real Python

Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written…

realpython.com

3. To make your hands dirty in DASK, should glance over the below link.

From chunking to parallelism: faster Pandas with Dask

When data doesn't fit in memory, you can use chunking: loading and then processing it in chunks, so that only a subset…

pythonspeed.com

Feel free to follow this author if you liked the blog because this author assures to back again with more interesting ML/AI related stuff.
Thanks,
Happy Learning! 😄

Can get in touch via LinkedIn.