Reading large Datasets using pandas
In data science, we might come across scenarios where we need to read large dataset which has size greater than system’s memory. In this case your system will run out of RAM/memory while reading such a huge amount of data. This also might lead to kernel shutdown in jupyter notebook or system crash. To avoid such scenarios there are some pretty good techniques that will help us read large datasets.
For any data scientist, obvious choice of python package to read a CSV file would be pandas. In this blog, we will be unraveling some cool techniques to read this datasets.
Before we dive deep into these techniques, it would be better if we have our environment setup and ready:
conda install -c conda-forge jupyterlab
conda install -c anaconda pandas
jupyter notebook
Data
I am using PUBG Finish Placement Prediction (Kernels Only) kaggle competition’s dataset. We will be using training dataset for our purpose of analysis. Training set consists of 4.4 million rows which sums up to 700 MB of data!
Methods
Using normal pandas method to read dataset
>>>> pd.read_csv('train_V2.csv')
This is a standard method to read a csv file. Now let us look at the time take by this piece of code to execute. Using IPython’s magic commands :
>>>> %time pd.read_csv('train_V2.csv')
CPU times: user 16.5 s, sys: 1.33 s, total: 17.8 s
Wall time: 27.5 s
This took us about 27 seconds to read our training set. Still we can increase the speed and reduce the wall time.
To calculate the amount of memory consumed by this method, we can use a magic commands which is available in memory_profiler
extension of IPython called as %memit
>>>> %memit pd.read_csv('train_V2.csv')peak memory: 3085.73 MiB, increment: 3001.68 MiB
Here peak memory indicates the amount of memory consumed by read_csv
function. For 700 MiB of training data it took around 3 GiB of memory! Let us look at our next method for how much memory does it consumes.
Using data-type modification method for reading data from dataset
This method simply involves changing the data-types of each column of our dataset to less memory consuming data-types. Here in our example, let us first list out all the data types in our training set.
We first need to know what is the current data-type of the training set. We won’t be reading again the training set via traditional method. But we will take few samples of the training set and save it in another file. To do so, type the following command in your jupyter notebook cell,
!head train_V2.csv > sample_train.csv
head
command will print first 10 lines of train_V2.csv
file and will add it’s standard output to sample_train.csv
. So now that we have got our subsampled training set, therefore we can now easily read the dataset via normal pandas.read_csv
function:
sample_train = pd.read_csv('sample_train.csv')
dict(zip(sample_train.columns,sample_train.dtypes))
The last line will give us dictionary of column names by their respective data-type:
{'Id': dtype('O'),
'groupId': dtype('O'),
'matchId': dtype('O'),
'assists': dtype('int64'),
'boosts': dtype('int64'),
'damageDealt': dtype('float64'),
'DBNOs': dtype('int64'),
'headshotKills': dtype('int64'),
'heals': dtype('int64'),
'killPlace': dtype('int64'),
'killPoints': dtype('int64'),
'kills': dtype('int64'),
'killStreaks': dtype('int64'),
'longestKill': dtype('float64'),
'matchDuration': dtype('int64'),
'matchType': dtype('O'),
'maxPlace': dtype('int64'),
'numGroups': dtype('int64'),
'rankPoints': dtype('int64'),
'revives': dtype('int64'),
'rideDistance': dtype('float64'),
'roadKills': dtype('int64'),
'swimDistance': dtype('float64'),
'teamKills': dtype('int64'),
'vehicleDestroys': dtype('int64'),
'walkDistance': dtype('float64'),
'weaponsAcquired': dtype('int64'),
'winPoints': dtype('int64'),
'winPlacePerc': dtype('float64')}
Generally, whenever a dataset is loaded and pandas finds any numerical columns then by default it is assigned to int64 or float64 data-type. Since defaults are 64bit data types, you can imagine even how enormous amount of space it will take for data-sets with more than millions of rows. For understanding purpose, following table will show what are the range of values allowed for int and float data type:
int8 Byte (-128 to 127)
int16 Integer (-32768 to 32767)
int32 Integer (-2147483648 to 2147483647)
int64 Integer (-9223372036854775808 to 9223372036854775807)
float32 Half precision
float64 Full precision
For more information on data types used in pandas please refer to pandas and numpy documentation.
It is obvious that larger datatypes will consume more space. In our scenario, there are 19 int64, 6 float64 and 4 object
data type columns. Now let us create a dictionary which will contain int16 and float16
as datatypes for all the integer and float columns.
dtype_list = list()
for x in sample_train.dtypes.tolist():
if x=='int64':
dtype_list.append('int16')
elif(x=='float64'):
dtype_list.append('float16')
else:
dtype_list.append('object')
dtype_list = dict(zip(sample_train.columns.tolist(),dtype_list))
dtype_list
The above code snippet will help us to get a data-type dictionary similar to something below:
{'Id': 'object',
'groupId': 'object',
'matchId': 'object',
'assists': 'int16',
'boosts': 'int16',
'damageDealt': 'float16',
'DBNOs': 'int16',
'headshotKills': 'int16',
'heals': 'int16',
'killPlace': 'int16',
'killPoints': 'int16',
'kills': 'int16',
'killStreaks': 'int16',
'longestKill': 'float16',
'matchDuration': 'int16',
'matchType': 'object',
'maxPlace': 'int16',
'numGroups': 'int16',
'rankPoints': 'int16',
'revives': 'int16',
'rideDistance': 'float16',
'roadKills': 'int16',
'swimDistance': 'float16',
'teamKills': 'int16',
'vehicleDestroys': 'int16',
'walkDistance': 'float16',
'weaponsAcquired': 'int16',
'winPoints': 'int16',
'winPlacePerc': 'float16'}
Now that we have got our custom dictionary which consists of smaller datatypes, let us look into how we can include this data type dict in read_csv
function, so that pandas reads our training set in exactly the way we want i.e. columns with smaller datatypes.
>>>> %time pd.read_csv('train_V2.csv',dtype=dtype_list)CPU times: user 13.4 s, sys: 667 ms, total: 14.1 s
Wall time: 16.6 s
This took around 16.6 seconds to read the entire training dataset. If we compare this time with our trivial method then it is about 40% increase speed of reading the training dataset.
Now it is time for us to again calculate the memory consumed using the same magic command in previous method:
>>>> %memit pd.read_csv('train_V2.csv',dtype=dtype_list)peak memory: 1787.43 MiB, increment: 1703.09 MiB
So this method consumed about almost half the memory as compared to our traditional method which is pretty good.
Using chunk creation method for reading csv
This method involves reading the data in chunks with chunksize
parameter in read_csv
function. Let us create a chunk size so as to read our data set via this method:
>>>> chunk_size = 10**6
>>>> chunk_size
1000000
Let us divide our dataset into chunks of 1000000. So our dataset will get divided into 4 chunks of size 1000000 and last chunk size is smartly calculated by pandas. Now let us calculate how much time this method takes:
>>>> %timeit [chunk for chunk in pd.read_csv('train_V2.csv',chunksize=chunk_size)]29.4 s ± 2.26 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
So for each chunk it took around 30 seconds to load the data i.e. total of 150 seconds for all the 5 chunks.
Also memory consumed by this method is:
>>>> %memit [chunk for chunk in pd.read_csv('train_V2.csv',chunksize=chunk_size)]peak memory: 1966.75 MiB, increment: 1880.20 MiB
Comparing the time and memory consumed in all the methods:
Conclusion
In general, it completely depends upon the dataset you are trying to read. Method 2 were we discussed changing the datatype of columns in the dataframe while reading, can most certainly be used in scenarios where you have low system resources or in kaggle competitions. But I think this method has one caveat i.e. there might be some data loss as we are shrinking the acceptable number range as we are moving from large to small dataset. If speed is not that much of a concern and you want to maintain the integrity of the data then Method 3 is much advisable.
Thanks to Intro to machine learning course from fast.ai and their awesome forums for inspiring me to write this blog.
You can find my code for this blog post:
Please let me know how you like this blog post in the comments section or you can reach out to me for any constructive feedback on my twitter handle @keurplkar