Sroka — a Python library to simplify data access

Published in

Fandom Engineering

5 min readApr 24, 2019

Some say that in the data world, getting and preparing the data constitutes 80% of the whole process, leaving the rest to actual analysis and visualisation. That’s true in many cases, but one thing is certain, combining a few data sources together can be a tedious task. We, at Fandom, saw this as an area for improvement and built a Python library — Sroka (https://github.com/Wikia/sroka) to ease our lives. Sroka is a library that with simple Python functions lets us directly query or download data from various sources (through APIs), in an analysis-ready format.

What is Sroka?

Sroka is a bird species in Polish (eng. magpie). It is known for loving and collecting all shiny things. By analogy, Sroka the library, allows you to access all the important data and gather them in one place easily.

Sroka provides a set of custom functions for each data source. Their descriptions are gathered in respective README.md files. It is a product of teamwork, you can find all contributors on github.

The library currently supports following data sources:

AWS (Athena and S3)
Google Ad Manager
Google Analytics
Google Sheets
Moat
Qubole
Rubicon

Why we built it?

Working in the Ad Engineering team, we have access to many kinds of data — information that we gather in our own data warehouse, ads data from Google Ad Manager, webpage data from Google Analytics, and if that’s not enough — we also have data and dashboards provided by various third-party partners. Apart from differences in how data is aggregated and what information it provides, each of the APIs that we can use to gather data has its own way to query results. Each of them requires you to provide a separate set of credentials too. Sroka keeps them all in a sroka_config.ini file in your home directory and automatically passes them with your function calls.

Who can use it?

Anyone that uses data and has access to the above mentioned data sources. We have built it within an analytics “fraction” of the engineering team as it is analysts who used to query all of the sources on a daily basis. However, the ultimate goal is to make data querying easier and more accessible to a wider audience. Taking the data out of clickable UI’s into code based analysis encourages engineers to take part in the analysis process too.

Example

Before you use sroka, you have to install the library. You can do it using pip:

pip install sroka

Most of our analyses are conducted within Jupyter Notebooks. It is common for us to combine two (or more) data sources. Sometimes the goal is to combine different information from two sources, other times — we want to make sure that the data matches. The latter allows us to cross check e.g. whether our custom events are sent or whether Google Ad Manager is configured correctly. Sroka makes it very easy. We start off with all the right imports, including Sroka functions. Here we will use data stored in Athena and Google Ad Manager.

# Athena API
from sroka.api.athena.athena_api import query_athena, done_athena# GAM API
from sroka.api.google_ad_manager.gam_api import get_data_from_admanager# data wrangling
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.dates as mdates# seaborn default plot design
sns.set()

We then query the sources directly from notebook. Simplified data acquisition and wrangling workflow would look like this:

# GAM query
start_day = '01'
end_day = '31'
start_month = '03'
end_month = '03'
year = '2019'query = """WHERE CUSTOM_TARGETING_VALUE_ID IN ([wiki_ids])"""
dimensions = ['DATE']
columns = ['TOTAL_LINE_ITEM_LEVEL_IMPRESSIONS']
start_date = {'year': year,
              'month': start_month,
              'day': start_day}
stop_date = {'year': year,
             'month': end_month,
             'day': end_day}print('starting...')
data_raw_gam = get_data_from_admanager(query, dimensions, columns, start_date, stop_date)
print('data gathered')# create df copy
df_gam = data_raw_gam.copy()# change column names
df_gam.rename(
    columns={
        'Dimension.DATE': 'Date',
        'Column.TOTAL_LINE_ITEM_LEVEL_IMPRESSIONS': 'Impressions'
    },
    inplace=True)# change column format to datetime, it is needed to plot
df_gam['Date'] = pd.to_datetime(df_gam['Date'])# set index
df_gam.set_index('Date', inplace=True)# check first rows
df_gam.head()

Querying and preparing data from Athena:

# download Athena data
df_athena = query_athena(
'''
SELECT 
    concat(year, '-', month, '-', day) AS day,
    count([ad_impressions])
FROM [fandom_ads_data_table]
WHERE 
    year = '2019'
    AND month = '03'
    AND wiki_id IN ([wiki_ids])
GROUP BY
    CONCAT(year, '-', month, '-', day)
ORDER BY
    day ASC
''')# change column names
df_athena.rename(
    columns={
        'day': 'Date',
        'count[ad_impressions]': 'Impressions'
    }, 
    inplace=True)# change column format to datetime, it is needed to plot
df_athena['Date'] = pd.to_datetime(df_athena['Date'])# set index
df_athena.set_index('Date', inplace=True)
df_athena.head()

Having both data sets prepared, we can visualise and compare the results.

# initiate plot
fig, ax = plt.subplots(figsize=(15,6))
ax.plot(df_athena['Impressions'], label='Athena')
ax.plot(df_gam['Impressions'], label='Google Ad Manager')# plot elements, title, labels etc.
plt.ylabel('# impressions', fontsize=13)
plt.title('[wiki_id] all ad impressions by source',  fontsize=16, pad=20)
plt.xticks(rotation=30)
plt.legend(bbox_to_anchor=(1.25, 0.5), frameon=False, fontsize=14)
plt.ylim(0)
plt.yticks([])
ax.xaxis.set_major_locator(mdates.DayLocator(interval=3))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y/%m/%d'));

Output plot generated within Jupyter notebook with code above.

With these few code cells we are able to get a lot of insight, with no need to jump between data sources and manually downloading reports. We can immediately see that both data sources follow the same pattern, it shows seasonality, but also that as the impressions increase, so do the discrepancies between data sources. Depending on the situation, this is maybe a known issue, or a signal to verify the reporting.

What’s next?

We hope that Sroka will help to simplify the process of data acquisition in places outside of Fandom. If you have any suggestions on how to improve or extend Sroka’s capabilities, feel free to open an issue on GitHub (https://github.com/Wikia/sroka). Also if you feel that additional data sources should be included everyone is welcome to contribute and create a PR.

This article was co-authored by Dorota Mierzwa

Sroka — a Python library to simplify data access

Written by Martyna Urbanek-Trzeciak