How to use Pandas to access databases
Pandas is a great tool to explore the data stored in files (comma-delimited, tab-delimited, Parquet, HDF5, etc). In fact, most tutorials that you’ll find on Pandas will start with reading some kind of a sample file (or files), most likely using .read_csv:
Sometimes, the data that you want to analyze is stored in a different kind of storage, for example, a relational database. It is possible to load this data into Pandas dataframes, with an SQLAlchemy connection (or a DBAPI2 connection for sqlite):
The code is very simple, and it looks nice and easy. Load all the tables into dataframes, and do the analysis on them in Pandas.
But should you?
As it often happens, the answer it not black-and-white. It depends, mostly, on the size of your data. Why does it matter? Remember that with Pandas, things are not lazy. The Pandas dataframe is a structure in memory. If your table has lots of fields and millions of records, and you try loading the whole thing into memory, you might just crash your computer, or at the very least, have an OOM (Out Of Memory exception). And loading multiple large tables? Not going to happen.
So what are the options then? When the data is stored in a database, start exploring the data from there. The database is a storage highly optimized for querying. It would be wrong not to take advantage of it.
Explore the database using a CLI
Note that since these queries only return one record, or a few records at most, they can be safely issued via Pandas as well:
On the other hand, you’d lose SQL autocompletion, syntax highlighting, and perhaps other features that your database client provides.
For the queries above, it comes to preference — whether you use pandas to query the database or the database client application. This is not the case with queries that retrieve a lot of data. When using Pandas, it makes sense to minimize the amount of data you load into memory. There are a few strategies for that:
Limit the fields to retrieve
Hopefully, when doing the initial exploration of the data (as shown above), you zeroed out on the subset of tables and fields you’re interested in. So query for only those fields and tables when loading your Pandas dataframes, with pd.read_sql_query:
Limit the records to retrieve
This means either sampling the records (LIMIT XXX clause might do that), or only retrieving records that fit a specific criteria (if you only want to analyze payments in the current year, it does not make sense to load the whole table, so you’re going to need a WHERE condition in that SQL query):
Let database server handle joins
Any time you think of retrieving records from tables into Pandas dataframes with the purpose of later joining these dataframes, it’s probably not the best idea. Databases are highly optimized for joins. It makes much more sense to unload this task to the database server. Besides, one dataframe in memory (the resulting view of the JOIN) is better than two:
Estimate memory usage
It is very easy to estimate memory usage with Pandas, with .memory_usage call on dataframe:
If a sample of 100 records take up 5,728 bytes, then 1,000,000 records will take up approximately 57280000 bytes, or 54Mb. This is very inexact, but at least it can give you an idea.
Reduce memory usage with data types
Sometimes, it is possible to use more memory efficient datatypes on the dataframe fields. By default, Pandas will read all integer data types in database as int64, even though they might have been defined as smaller data types in database. For example, let’s look at this table:
When loaded with Pandas, data types are not equivalent to those in database:
Database smallint and tinyint types got converted into int64 — which means using more memory that we have to. Can we do better?
Here is one way to optimize things:
- Read the table (or query) in chunks, providing the chunksize parameter.
Since we assume the table is large, we can’t load the whole result into memory at once.
- Convert datatypes of each chunk to smaller datatypes.
Here, we know which fields can be converted into smaller types, because we saw the table definition in database.
- Concatenate updated chunks into a new dataframe.
Not bad — we now have 4500 bytes instead of 5728 bytes, a 21% reduction in memory. This excellent IPython notebook shows how to optimize data types even further, and provides a generic function to do that:
Reducing DataFrame memory size by ~65%
Download Open Datasets on 1000s of Projects + Share Projects on One Platform. Explore Popular Topics Like Government…
To reiterate the important points of using Pandas to explore a database:
- Dataframes in Pandas are not lazy, they are loaded into memory, be aware of the memory usage.
- Start exploring with a SQL client to determine the size and shape of data.
- Proceed based on the size of data, to either load whole tables into Pandas, or query for only selected fields and possibly limit to a sample of records.
- Let database do the joins, it’s good at it.
- Estimate the size of data. If necessary, use memory-efficient data types.
Special thanks to Sheila Tüpker, who inspired this article by asking a very good question.