Write Data from Google Sheets to an Internal Database Using Python
There are plenty of tutorials on how to read data into Google Sheets from a database, but I recently found myself needing to do the opposite: take data that was only available in a private Google Sheet and move it to our database. This practice may be far from the gold standard of database management, but it’s a scrappy way to allow non-technical users to create and manage data sets that need to be ingested and linked to other internal data sources and then queried through tools like Looker, Wagon, and Tableau. Moreover, in my data science work I often need to pull data from Google Sheets into a Pandas dataframe for analysis. This is easy using StringIO for a public Google Sheet, but I needed to find a method to retrieve private company data. In this tutorial, I’ll go over how to read a Google Sheet into a dataframe and then write this dataframe to a database.
Step 1: Writing from Sheets to a DataFrame
The simplest way to get data from Google Sheets into Pandas is to export the sheet as a csv and use the read_csv function. However, this process can be cumbersome and introduces latency for any source data that changes over time. Another approach is to use the Google API, but it’s a bit overwhelming for this simple use case. My preferred method is to use gspread, a python wrapper developed specifically for google sheets. You will still need to authenticate by using/creating OAuth credentials for your Google account, but from there the process is super easy. Here’s step by step directions:
- Open up terminal and install gspread and oauth2 (and pandas if you have not done so already)
pip install gspread
pip install oauth2client
2. Create OAuth credentials for your Google account. Open the developers console and select or create a project. On the left hand side you should see “API Manager,” and under that click on “Credentials.” Click create credentials → service account key. The service account doesn’t matter, and the key type should be JSON. Hit ‘create’ and you will download a json file with the credentials.
3. Move the credentials download to whichever folder you’ll be using to store your python script. I renamed mine ‘sheets_client_secret.json’ to make it easier to reference in my code. I kept the name generic since I’ll use this same json file for multiple data sources.
4. Share your google sheet with your oauth email address. You can find this if you open up the json file, or check your credentials in the Google developers console.
5. Create a python script with the following code. Note that this is a bare bones code that assumes you want all the data from your Google sheet and that it’s stored in a standard format (e.g. column headers in first row, column names in the right format):
from oauth2client.service_account import ServiceAccountCredentials
import pandas as pd
spreadsheet_key = "yourspreadsheetkeywhichisfoundinthesheetsurl"
scope = ["https://spreadsheets.google.com/feeds"]
credentials = ServiceAccountCredentials.from_json_keyfile_name(‘sheets_client_secret.json’, scope)
gc = gspread.authorize(credentials)
wks = gc.open_by_key(spreadsheet_key).sheet1
data = wks.get_all_records()
cols = wks.get_all_values() #keeps cols in right order
df = pd.DataFrame(data,columns=cols)
Super easy! Now you have your data in a dataframe, ready to analyze, transform, or load into a database. Of course, you will need to replace ‘spreadsheet_key’ and the json file name. For example, in this spreadsheet the spreadsheet key would be “1PiBQ5mYp8FB9WWeoqskvsnktPYKujZunn5NSuk5M5oo”
‘sheet1’ is a convention to reference the first tab in the Google Doc. If you want to reference a different tab, use the following convention instead:
wks = gc.open_by_key(spreadsheet_key).worksheet('tab name')
If you need help on the above steps or want to read about additional gspread options for accessing Google sheets and reading data, check out http://gspread.readthedocs.io. You can also write to Google Sheets using gspread, but that’s beyond the scope of this tutorial.
Step 2: Writing from Pandas to a Database
This part is relatively easy now that you have your data in a DataFrame.
- Clean or transform your data. This is one of the advantages to using Pandas as an intermediate step: this will make it easy to clean your data, pivot it, add columns, drop columns, drop nulls, etc. Since I’m working working with Postgres, which has specific conventions around field names that must be followed, I apply a string function at this point to change the dataframe column names. I also will often drop certain columns that don’t need to be stored in our database, such as personally identifying information or lengthy text columns. Finally, I may leverage the vast world of Python libraries at our disposal to do anything that would be difficult in Sheets before retrieving the data or via SQL after it’s entered the database, such as NLP.
- Connect to your database using sqlalchemy. This is relatively straightforward; you just need credentials to your database. You can either type these in directly to your script (not recommended), prompt the user for credentials, or use an environment variable. Since this is a bare bones tutorial, I’ll show you the first method even though it’s the least secure:
connstring = “dbtype://username:password@hostname:port/dbname”
engine = sqlalchemy.create_engine(connstring)
‘hostname’ is often a url to your database connection. However, when creating a pipeline for the first time, I like to connect to my local database first and test it out. Here’s an example of my connstring for a local connection:
connstring = “postgres://lucas:secretpassword@localhost:5432/sandbox”
3. Write your dataframe to the database:
By default, Pandas will throw an error if you try to overwrite an existing table, so you may need to add if_exists=’replace’. Of course, deleting and re-creating a database table every time you want to update it is not a good long-term solution. A faster and more robust process would be to read the existing database table into Pandas (you can use psycopg2 for this), compare against your dataframe, and only update changed rows or append new rows. Read the documentation for help. You may also want to write a cron job to update your database on a periodic basis, but we’ll save that for later.