100 Scripts in 30 Days challenge: Script 2 & 3 — Loading a 3 GB csv file into DB using Pandas & Odo
I always download data from various sources but one big issue in analyzing big csv files is the size of data and SQL makes it easy to analyze data using SQl queries. Now for such a use case I used Pandas or SQLAlchemy usually but I found odo part of the pydata family of libraries that makes salable data analysis possible.
So my first script is written in pandas where I load a 3GB csv file to a postgres database. And my 2nd script loads the same csv file with minor modifications using odo. Things that I had to do before uploading the csv file.
- Added column headers it helped
- Ran dos2unix utility on the csv file as postgres only parses files that have \n as line terminator.
For more details contact me via twitter https://twitter.com/twitmyreview
Or via linked in: https://www.linkedin.com/in/priyabrata-dash-21616b15/
For me the 1st script using Pandas took almost 4 times the time to process the 3gb file which odo took 30 mins.
Documentation on Odo:
Documentation on pandas:
Data can be found here
Download the Price Paid Data (PPD) in text or csv format and access our linked data.www.gov.uk
Code details given below: