Scaling From Manual to Automated Reporting Using Pandas, Google Scripts, and Google Sheets

shaina.jordan
Ibotta Engine
Published in
5 min readJul 29, 2016

For growing companies like Ibotta, aggregating data, ensuring it’s accurate and actionable, and streamlining the reporting process can be tedious and time consuming. The process is freckled with inaccuracies and inefficiencies, and it often becomes a monotonous burden on team members across the organization thus resulting in double, sometimes triple the amount of work of an automated process. As companies scale, removing manual reporting becomes a critical piece in increasing organizational efficiency, improving trust in reported data, and helping teams and leaders make better data driven decisions.

Plugging into third party solutions and apis, and owning as much data as possible for the entire user flow is ideal, but such a situation would be an unheard of novelty for a fast moving start-up. Plus, these solutions require internal engineering resources as well as a budget for external tools. Fortunately for Ibottalytics, the founders and early Ibotta team had the foresight to lay a strong analytics and data foundation. Currently we rely heavily on Looker’s business intelligence software on top of an ultra fast Redshift database to provide transparency across the organization for internal and external daily and monthly reporting. This solution works impeccably well across a variety of use cases.

However, we cannot produce the type of high level and granular analyses of customer acquisition that we need to efficiently move the company forward with our current analytics solution. Because a large portion of our mobile acquisition crosses the black box of the iTunes store and the mystery world of Facebook advertising, reporting is difficult, time consuming, and inaccurate. As a result, growth teams resort to making decisions on the shaky ground of aggregate KPIs. While attribution reporting is improving across the board, mapping data from the first touch ad impression down to valuable in-app behavior remains a difficult problem for most mobile-first startups.

Beyond that, marketing, growth, and executive teams who have a vested interest in seeing and manipulating user level data by acquisition source are often accustomed to reporting that ties into existing easy-to-access tools like Google Sheets and Excel. For many marketing teams, including Ibotta’s, existing reporting efficiently tells the high level story of how marketing efforts are performing, but it relies upon copying and pasting, typing, and a lot of keystrokes to create a repeatable and consistent report. It’s a system that lags when someone is out on vacation, is sick, or generally has higher value items to work on.

Enter Pandas and Google Sheets.

Leveraging the power of python and the accessibility of Google Sheets can provide the much needed bridge between disconnected third party acquisition reporting and high value customer data.

Our Background:

For years, the Marketing team at Ibotta has been reporting weekly KPIs in a Google Sheet. The sheet has a few summary tabs with things like 7 day spend, 7 day installs, cost per install, and other key metrics that the team looks at on a 7 day, 14 day, and 28 day scale. The summary tabs are key high level views and rely on native Google Sheet functions to create comprehensive summaries. These summary tabs rely on a mess of other tabs that feed into the summaries including one that is updated manually every day with values typed in/copied and pasted from a number of different sources.

Each one of these cells must be manually updated everyday:

A picture is worth everything here: the update process involves filtering, pasting, typing data across many columns and double checking work — scrolling back and forth and making sure values populate. Not only is this time consuming, it’s error prone, and it’s extremely limiting into how it highlights our growth performance. By revamping the reporting process, we can increase visibility into how our advertising campaigns are performing down to the highest value actions within our app.

The approach:

  1. Start high level. Be able to answer what’s the goal? What question are people looking at this report aiming to answer? Reporting that evolves over time with many people contributing is riddled with inefficiencies — not all of these are worth improving.
  2. Begin with quick wins. The current reporting process for our marketing data is partially automated with existing queries. That’s the easiest place to start.
  3. Set the stage. Once the bones of the script are in place, it’s easy to tack on additional data and analyses.

The ingredients:

Pandas, psycopg2*, oauth2, and gspread. Documentation around these modules is robust.

*Python module for Postgres. If you’re using MySQL check out mysqldb.

What we did:

  1. Using existing queries unioned together, we created a job to populate an existing sheet. This starts with a simple psycopg2 connection reading data from our redshift database into a pandas dataframe.

Example:

#input database credentialsuser = raw_input(“username:”)
password = raw_input(“password:”)
#connect to dbtry:con = psycopg2.connect(database=’yourdatabase’, user=user, password=password, host=’yourhost’, port=’yourport’)
cur = con.cursor()
#run query and read to dataframe
query_results = pandas.read_sql_query(“””SELECT * FROM table“””, con)
#if there's an error, print out what the error is
except psycopg2.DatabaseError, e:
print ‘Error %s’ % e
sys.exit(1)
#close the database connection
finally:
if con:
con.close()
#print out the first five rows of the query to verify that results match what’s expected
print query_results.head()

2. Once the data is in the dataframe, we opened a connection to Google sheets.

#initiate and execute gspread connection to connect to google sheets
scope = [‘https://spreadsheets.google.com/feeds']
#input correct name of json file - you will get this when you set up your gspread module
credentials = ServiceAccountCredentials.from_json_keyfile_name(‘yourjsonfile.json’, scope)
#authenticate the connection
gc = gspread.authorize(credentials)
#open the google sheet - see gspread docs for the many ways to do this
wks = gc.open(“thegooglesheet”).sheet1

3. Set up system to batch update a google sheet (reference here: https://www.dataiku.com/learn/guide/code/python/export-a-dataset-to-google-spreadsheets.html)

# columns names
columns = query_results.columns.values.tolist()
# selection of the range that will be updated
cell_list = wks.range(‘A1:F1’)
# modifying the values in the range
for cell in cell_list:
val = columns[cell.col-1]
if type(val) is str:
val = val.decode(‘utf-8’)
cell.value = val
# update column headers in batch to google sheet
wks.update_cells(cell_list)
# number of lines and columns
num_lines, num_columns = query_results.shape
# Select a cell range for results
cell_list = wks.range(‘A2:’+numberToLetters(num_columns)+str(num_lines+1))
# Update values
for cell in cell_list:
val = query_results.iloc[cell.row-2,cell.col-1]
if type(val) is str:
val = val.decode(‘utf-8’)
elif isinstance(val, (int, long, float, complex)):
# note that we round all numbers
val = int(round(val))
cell.value = val
# Send results to update sheet in batch mode
wks.update_cells(cell_list)

This recipe provides a lightweight framework to remove manual processes from current reporting that pulls from disparate sources, but it should be considered temporary. The most sustainable, least manual solutions involve pulling in user level data from attribution APIs (we are in the process of a full integration with Tune and Singular), owning that data, and reporting through existing BI software (Looker).

Looking forward, we intend to move completely away from Google Sheets, but this solution alleviates a pain that will help us move our growth forward at a faster rate. What solutions are you currently using to accurately report on your mobile acquisition funnel?

--

--