Working efficiently with Large Data in pandas and MySQL (or any other RDBMS)

James Mukuya
Analytics Vidhya
Published in
5 min readJul 3, 2020
pandas and mysql

Hello everyone, this brief tutorial is going to show you how you can efficiently read large datasets from a csv, excel or an external database using pandas and store in a centralized database.

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

The dataset we will read is a csv file of air pollution in Seoul. You can download it from here (if you have a kaggle account) or here.

Thanks to Seoul City, Seoul Open Data Plaza, and Air Quality Analysis Center for providing data.

Without taking much of your time, let’s jump right into it…

  1. Importing the required libraries

import pandas as pd
import mysql.connector as sql

2. Reading the csv file (traditional way)

df = pd.read_csv(‘Measurement_item_info.csv’,sep=’,’)

let’s have a preview of how the file looks

df.head()

lets check how many rows and columns we have

df.shape >> (3885066, 5)

That is approximately 3.9 million rows and 5 columns.

Since we have used a traditional way, our memory management was not efficient. Let us see how much memory we consumed with each column and the sum of memory consumed. NB: the output is in bytes

df.memory_usage(index=False, deep=True)

Measurement date     283609818
Station code 31080528
Item code 31080528
Average value 31080528
Instrument status 31080528

df.memory_usage(index=False, deep=True).sum()

407931930 bytes.

Approximately 408MB, while the actual file size is 124.4MB on disk. NOT GOOD!!!

3. Memory Management

While sometimes we may get away with it using memory like above is not the best practice. To be able to manage memory we need to be able to understand the data types of our dataframe. Lets do that…

df.dtypes

Measurement date      object
Station code int64
Item code int64
Average value float64
Instrument status int64

By default pandas allocates the above “dtypes” to prevent any data from being too large to fit in storage. e.g the maximum value for a signed int64 i.e 64-bit integer is 9,223,372,036,854,775,807.

Lets check if any of our int64 dtypes actually come close to this number

df[‘Station code’].max() >> 125

df[‘Item code’].max() >> 9

df[‘Instrument status’].max()>> 9

df[‘Average value’].max() >> 6256

None of our values come remotely close to the assigned memory space!! Let us reallocate the memory more efficiently.

NB: As a developer especially when working with DBMS, it is always important to know or anticipate the maximum value(s) that your data will hold

In this example we read a new dataframe

df_eff_memory = pd.read_csv(‘Measurement_info.csv’,sep=’,’,dtype={‘Station code’:’int8',’Item code’:’int8',’Instrument status’:’int8',
‘Average value’:’float16',’Measurement date’:’string’})

df_eff_memory.memory_usage(index=False,deep=True)

Measurement date     31080528
Station code 3885066
Item code 3885066
Average value 7770132
Instrument status 3885066

# total memory used
df_eff_memory.memory_usage(index=False,deep=True).sum()

50505858 bytes

Approximately 51MB. This is a reduction of 87.61% memory usage!!!

4. Storing the data in MySQL

Sometimes we want to centralize and protect the data for our organization. In this case we will use community version of MySQL.

a) let’s create our credentials

db=’mysql’
user=’db_admin’
paswd=’db_admin_password'###SHOULD BE KEPT HIGHLY SECRETIVE!!!!
host=’localhost’

b) create the connection object

conn = sql.connect(db=db,user=user,host=host,password=paswd,use_unicode=True,
charset=”utf8", auth_plugin=’mysql_native_password’)

c) create cursor to read or write
cur = conn.cursor()

d) insert query
measurement_insert_query = “””
insert into measurement_info (measurement_date,average_value,instrument_status,measurement_item_info_id,
measurement_station_info_id) values(%s,%s,%s,(select measurement_item_info.id from measurement_item_info
where item_code = %s),(select measurement_station_info.id from measurement_station_info where station_code = %s))
“””

e) data to insert

data = list(zip(df_eff_memory[‘Measurement date’],df_eff_memory[‘Average value’],df_eff_memory[‘Instrument status’],
df_eff_memory[‘Item code’],df_eff_memory[‘Station code’]))

f) now finally insert the data

try:
cur.executemany(measurement_insert_query,data)
conn.commit()
print(‘success’)

except (sql.Error,sql.Warning) as e:
print(e)

2055: Lost connection to MySQL server at 'localhost:3306', system error: 32 Broken pipe

Wait! What just happened! After a couple of minutes, we receive an error, even though we utilized the memory efficiently. It seems the server got tired of waiting and closed the connection even though we did not ask it to in our code

5. Final step Reading and inserting data in CHUNKS

Since we were not able to insert the data into MySQL as a whole, we will do it in chunks or small bits. pandas provides chunksize option when you initially read the data, so we need to write some kind of iteration/loop to read and write

here we are reading 10000 rows at a time and writing to the database. We also get a success output after each insert

success
success
success
success
success

We were finally able to successfully read and write Large Data to our database!!

Bonus

Speed improvement using numpy vectors

NumPy is The fundamental package for scientific computing with Python. Using a small tweak on our chunks we can achieve faster read of our data.

All you need to do is add to_numpy() at the end of each column feature

Conclusion

Your output and memory usage may be slightly different from mine depending on the computer you are using.

In the event you fail to read the data completely, you may be in BigData territory. You should consider libraries such as pyspark which are primarily built to work on them.

pandas is a great tool to use when doing data analytics in Python. I have barely scratched the surface with what you can do with the data and I encourage you to experiment and research more.

Thank you so much for you time. Happy coding :)

About the Author

James has over 15 years experience in Information Technology — software development, data science and web development with extensive experience and ability to drive strategic innovations in areas of business growth, leadership and organization effectiveness. He designs and build algorithms for data analysis and visualization for both the web and stand-alone systems.

James is also a licensed in Flight Operations and Dispatcher with over 7 years experience in aviation. In his role he ensures implementation of safety, security, efficiency and emergency standards at all times. He also takes part in controlling and coordinating the daily fleet operations of the airline over total route network — International, Regional and domestic.

--

--

James Mukuya
Analytics Vidhya

Software developer, data scientist, machine learning and web development. Licensed in flight operations