Working efficiently with Large Data in pandas and MySQL (or any other RDBMS)
Hello everyone, this brief tutorial is going to show you how you can efficiently read large datasets from a csv, excel or an external database using pandas and store in a centralized database.
pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
The dataset we will read is a csv file of air pollution in Seoul. You can download it from here (if you have a kaggle account) or here.
Thanks to Seoul City, Seoul Open Data Plaza, and Air Quality Analysis Center for providing data.
Without taking much of your time, let’s jump right into it…
- Importing the required libraries
import pandas as pd
import mysql.connector as sql
2. Reading the csv file (traditional way)
df = pd.read_csv(‘Measurement_item_info.csv’,sep=’,’)
let’s have a preview of how the file looks
df.head()
lets check how many rows and columns we have
df.shape >> (3885066, 5)
That is approximately 3.9 million rows and 5 columns.
Since we have used a traditional way, our memory management was not efficient. Let us see how much memory we consumed with each column and the sum of memory consumed. NB: the output is in bytes
df.memory_usage(index=False, deep=True)
Measurement date 283609818
Station code 31080528
Item code 31080528
Average value 31080528
Instrument status 31080528
df.memory_usage(index=False, deep=True).sum()
407931930 bytes.
Approximately 408MB, while the actual file size is 124.4MB on disk. NOT GOOD!!!
3. Memory Management
While sometimes we may get away with it using memory like above is not the best practice. To be able to manage memory we need to be able to understand the data types of our dataframe. Lets do that…
df.dtypes
Measurement date object
Station code int64
Item code int64
Average value float64
Instrument status int64
By default pandas allocates the above “dtypes” to prevent any data from being too large to fit in storage. e.g the maximum value for a signed int64 i.e 64-bit integer is 9,223,372,036,854,775,807.
Lets check if any of our int64 dtypes actually come close to this number
df[‘Station code’].max() >> 125
df[‘Item code’].max() >> 9
df[‘Instrument status’].max()>> 9
df[‘Average value’].max() >> 6256
None of our values come remotely close to the assigned memory space!! Let us reallocate the memory more efficiently.
NB: As a developer especially when working with DBMS, it is always important to know or anticipate the maximum value(s) that your data will hold
In this example we read a new dataframe
df_eff_memory = pd.read_csv(‘Measurement_info.csv’,sep=’,’,dtype={‘Station code’:’int8',’Item code’:’int8',’Instrument status’:’int8',
‘Average value’:’float16',’Measurement date’:’string’})
df_eff_memory.memory_usage(index=False,deep=True)
Measurement date 31080528
Station code 3885066
Item code 3885066
Average value 7770132
Instrument status 3885066
# total memory used
df_eff_memory.memory_usage(index=False,deep=True).sum()
50505858 bytes
Approximately 51MB. This is a reduction of 87.61% memory usage!!!
4. Storing the data in MySQL
Sometimes we want to centralize and protect the data for our organization. In this case we will use community version of MySQL.
a) let’s create our credentials
db=’mysql’
user=’db_admin’
paswd=’db_admin_password'###SHOULD BE KEPT HIGHLY SECRETIVE!!!!
host=’localhost’
b) create the connection object
conn = sql.connect(db=db,user=user,host=host,password=paswd,use_unicode=True,
charset=”utf8", auth_plugin=’mysql_native_password’)
c) create cursor to read or writecur = conn.cursor()
d) insert querymeasurement_insert_query = “””
insert into measurement_info (measurement_date,average_value,instrument_status,measurement_item_info_id,
measurement_station_info_id) values(%s,%s,%s,(select measurement_item_info.id from measurement_item_info
where item_code = %s),(select measurement_station_info.id from measurement_station_info where station_code = %s))
“””
e) data to insert
data = list(zip(df_eff_memory[‘Measurement date’],df_eff_memory[‘Average value’],df_eff_memory[‘Instrument status’],
df_eff_memory[‘Item code’],df_eff_memory[‘Station code’]))
f) now finally insert the data
try:
cur.executemany(measurement_insert_query,data)
conn.commit()
print(‘success’)
except (sql.Error,sql.Warning) as e:
print(e)
2055: Lost connection to MySQL server at 'localhost:3306', system error: 32 Broken pipe
Wait! What just happened! After a couple of minutes, we receive an error, even though we utilized the memory efficiently. It seems the server got tired of waiting and closed the connection even though we did not ask it to in our code
5. Final step Reading and inserting data in CHUNKS
Since we were not able to insert the data into MySQL as a whole, we will do it in chunks or small bits. pandas provides chunksize option when you initially read the data, so we need to write some kind of iteration/loop to read and write
here we are reading 10000 rows at a time and writing to the database. We also get a success output after each insert
success
success
success
success
success
We were finally able to successfully read and write Large Data to our database!!
Bonus
Speed improvement using numpy vectors
NumPy is The fundamental package for scientific computing with Python. Using a small tweak on our chunks we can achieve faster read of our data.
All you need to do is add to_numpy()
at the end of each column feature
Conclusion
Your output and memory usage may be slightly different from mine depending on the computer you are using.
In the event you fail to read the data completely, you may be in BigData territory. You should consider libraries such as pyspark which are primarily built to work on them.
pandas is a great tool to use when doing data analytics in Python. I have barely scratched the surface with what you can do with the data and I encourage you to experiment and research more.
Thank you so much for you time. Happy coding :)
About the Author
James has over 15 years experience in Information Technology — software development, data science and web development with extensive experience and ability to drive strategic innovations in areas of business growth, leadership and organization effectiveness. He designs and build algorithms for data analysis and visualization for both the web and stand-alone systems.
James is also a licensed in Flight Operations and Dispatcher with over 7 years experience in aviation. In his role he ensures implementation of safety, security, efficiency and emergency standards at all times. He also takes part in controlling and coordinating the daily fleet operations of the airline over total route network — International, Regional and domestic.