5 different data sources accessed from one Jupyter notebook — Sroka library use cases

Published in

Fandom Engineering

9 min readDec 11, 2019

Sroka is an open source Python library that enables quick access to various data sources (e.g. S3, Athena, Google Ad Manager). Name of this library comes from Polish language — “sroka” means magpie.

Having introduced Sroka to the public in the first Medium article , I want to now share an example of a real analysis, where Sroka shows a wide spectrum of its capabilities (though not all of it!). As an Ads Data Analyst, I will show Sroka usage in a project that leveraged advertising technology. If you are interested in how to use Sroka to access specific data source, feel free to jump straight to the respective section.

Why

Before diving into the code I would like to point out major benefits of using Sroka in data analyses:

Simplicity — after initial configuration, each API call is as simple as a one-line function.
Usability — getting familiar with multiple data platforms’ UI takes time. Sroka can serve as a single interface that is also favoured by more technical members of the team.
Reproducibility — there are no local files referenced, anyone at the company can rerun the analysis on their laptop.
Flexibility — Jupyter notebook can be reused for similar analyses, just by modifying arguments passed to Sroka functions.

Project intro

In Ad Engineering team we work on technology behind serving ads on Fandom wiki pages. We use data to optimise various metrics related to e.g. layout of the page (how many / what kind of ad placements are available on the page) or ads behaviour (e.g. disappearing, sticking ads).

Some time ago we set up an experiment where instead of showing ads, we asked users some questions. The image was interactive and each of the possible answers was clickable. An example of such ad-based survey looks like this:

The goal was to see how can we influence users’ engagement, as measured by clicks. (Psst. If you want to learn more about how this project developed then go check the article written by Julia — summarising one of her internship UX/design projects. I will be referring to this ad as Feedback from now on.

Analysis goals

In this analysis we wanted to retrieve the following information:

Calculate click-through-rate (CTR) for Feedback and compare it to a benchmark advertisement.
Cross check our own click tracking data with the one from Google Ad Manager.
Calculate % share of each of the answers.
Check if session duration differs between users that chose different answers.
Compare answer distribution between different groups of users.

The data to answer these questions sits in 5 different data sources.

Google Ad Manager

GAM (previously known as DoubleClick for Publishers) is a platform to configure advertising campaigns, as well as to track various metrics on their performance and delivery status. As Feedback is technically configured as an ad, this is where we’ll check the CTR and compare with a chosen benchmark.

GAM can be queried directly from within a notebook:

# Download data from Google Ad Manager query = "WHERE ORDER_ID IN (2488556116)" dimensions = ['DATE', 
              'CREATIVE_ID', 
              'CREATIVE_NAME'] 
columns = ['TOTAL_LINE_ITEM_LEVEL_IMPRESSIONS',
           'TOTAL_LINE_ITEM_LEVEL_CLICKS'] 
start_date = {'year': '2019', 
              'month': '02', 
              'day': '13'} 
stop_date = {'year': '2019', 
             'month': '02', 
             'day': '23'} df_gam_raw = get_data_from_admanager(query, dimensions, columns, start_date, stop_date)

In query you can pass conditions on how your data should be filtered (specific countries, devices, key values as well as GAM entities ids etc.). Values that can be filtered that way are listed in GAM API documentation, along with accepted dimensions.

Running this cell may take up to a few minutes depending on date range and granularity of the report. Once done, data is returned as a pandas DataFrame.

After basic data cleaning, we see two ad types — Feedback and a benchmark that we aimed to compare.

Visualising this data answers our first question — initial Feedback data shows that it’s doing really well in terms of CTR. Please do not treat this CTR and any other metrics presented here (Feedback ad or benchmark ad) as reference. Those were campaigns with specific targeting and may not reflect situation on your traffic.

Athena

We use Athena to query our internal data warehouse, where we gather custom data based on events sent from Fandom pages. For this project we wanted to track not only a general event of a click, but also — which answer was connected to that specific click.

I wrote the query using Athena’s Query Editor and simply copied it’s id from the url.

df_dw_raw = done_athena('fb4dd4e7-39c9-4fc1-ab01-20f30ab995c2') df_dw_raw.head()

Again data is returned as pandas DataFrame with columns as defined in query. Here you can see it aggregated on creative_id level — which is a reference to a specific image loaded, as they differed e.g. in size.

I usually write queries in Query Editor and pass id as an argument to done_athena, as it helps with identifying typos or syntax errors. It also allows you to control query running progress and amount of data scanned. However, there is also an option to get the same result by passing query as a string to query_athena as presented in the example below:

example_query = query_athena("""
                             SELECT DISTINCT day 
                             FROM statsdb.fact_adengadinfo_events                    
                             WHERE year = '2019' AND month ='10'
                             """)

Now that the data is ready I’ll check two things:

Compare CTR from our custom tracking to the one measured by GAM. Since we do not record click events by default in data warehouse it will tell us if the tracking implementation was successful. I am aggregating data from column clicked_any that was calculated in the query. As we can see there is an almost perfect match and I can proceed with the analysis and extract information how users were responding.

See the answers distribution:

Google Analytics

Google Analytics is a tool for gathering web analytics data, providing you with information on e.g. site usage and performance. For Feedback project — we will be able to extract aggregated information about users who have seen and/or interacted with it. Precisely, we’ll see whether there is a difference in session duration between users’ answers.

To do that, I’ll use GA’s Session Duration Bucket dimension.

Request to Google Analytics API requires you to specify few things. Except for the usual date range, metrics, filters, and dimensions — you need to provide an id for a specific View you would like to query. You can find it on the home screen after logging into GA. The allowed values for dimensions and metrics are then listed here.

request = 
{"ids" : "ga:12345678", 
"start_date" : "2019-03-18", 
"end_date" : "2019-03-19", 
"metrics" : "ga:users", 
"filters" : "", 
"segment" : "", 
"dimensions" : 
"ga:sessionDurationBucket" } df_ga_all = ga_request(request, print_sample_size=True) 
df_ga_all

The result shows global distribution of Fandom users by bucketed duration of their session in seconds. In the above ga_request I passed an additional argument print_sample_size=True. This is what printed the message on sample size used to generate data. We use it as GA often provides data based on data sample and not the whole traffic. Consider below request, which is the same as the above one with the exception of metric chosen.

request = 
{"ids" : "ga:12345678", 
"start_date" : "2019-03-18", 
"end_date" : "2019-03-19", 
"metrics" : "ga:totalEvents", # metric changed 
"filters" : "", 
"segment" : "", 
"dimensions" : 
"ga:sessionDurationBucket"} df_ga2 = ga_request(request, print_sample_size=True) 
df_ga2

With this seemingly innocent change, the sample used now decreased to less than 4%, and there is no way to force larger sample with an API call. While it may be fine in the case of total traffic outlook, in other situations, where we have only limited amount of data (e.g. small segments) it may result in an artificially high variability in data or no data returned at all. The former was precisely the case we encountered with Feedback ad analysis.

Currently, this situation does make it necessary to take a step back and use Google Analytics UI, but not all hope is lost — we’ll still preserve the online-only analysis workflow, that will keep the notebook easily reproducible.

Google Sheets

Most of online data platforms allow users to export the query results to Google Sheets. Feedback data retrieved from Google Analytics UI can be easily exported to Google Sheets. From here all we need is the document id from URL and the name of the tab to download the contents directly into the notebook (both marked in red on the screen).

df_ga = google_drive_sheets_read('{spreadsheet_id_from_url}',                    
        sheet_range='content-event-events!A7:C266',   
        first_row_columns=True)

After some data cleaning we have our data ready to be aggregated and visualised:

Now we can again match data from different sources, total Fandom users downloaded directly through GA vs. Feedback users read in from Google Sheets.

From the plot above we see that users that chose “found” and “still searching” more often had longer sessions than those who chose “didn’t found”. We can read e.g. that 0.6 of of all “didn’t find” sessions were up to 80 seconds long. For “found” sessions, the same ratio — 0.6 — had sessions up to 100 seconds long, 20 seconds longer. Interestingly, the “didn’t find” group is much closer to total traffic than the other answers.

S3

In order to make the analysis complete and keep all the insights together we have gathered along this project’s data in one notebook, I also want to include analysis that was done by my fellow analyst on the team. She did a specific check, on how the answer distribution differs among user groups — split by their Fandom usage. The data is available on S3 and all that I need is to know it’s location on S3.

df_s3 =    s3_download_data('s3://folder_name/martyna/feedback/clicks_users', 
                 prefix=True)

By setting prefix key word to True, s3_download_data treats the string as the prefix, otherwise it would be treated as a direct file path.

Downloaded data:

And plot:

Summary

We now have a good overview of how Feedback project is performing, along with some insights how users are interacting with it.The above analysis seamlessly connected to 5 different data sources from a single Jupyter notebook. The use cases presented are a perfect match to our needs at Fandom. However, as each of the API connections offers a wide range of functionalities — Sroka may not cover them all. If you notice that this is the case, join us and contribute to this open source library to make it better!

Originally published at https://dev.fandom.com.