Automated Scheduled Reporting using Airflow

Ayush Dosajh
Pasarpolis Product + Tech
4 min readMay 27, 2020
yellow sky, green open grassland, rising sun

We have 4 types of scheduled reports in PasarPolis. Each uses a different platform. Most of the following use the same data source though, the Reporting dataset on BigQuery -

  1. In-house app/cron scheduled reports
  2. Metabase Pulse which are reports sent over Slack or Email
  3. Tableau scheduled reports and dashboards sent as emails with attachments
  4. Airflow scheduled reports and dashboards sent as emails with attachments

Create and Run Scheduled Reports on Airflow

Here, we are elaborating on the Airflow scheduled reports, category 4 in the above list. Reports which are currently in category 1, are being moved to belong to category 4 as we phase out the custom apps. The consumers of these insights get the reports and dashboards as Excel or CSV attachments.

A canonical scheduled reporting requirement looks like this — Send weekly reports to our partners and providers that help them track the performance of our insurance products. It contains a detailed report, and a summary dashboard with visualizations of changes of metrics, for example, study the change in payout ratio over time.

Broadly, each such requirement is the same — run a query on BigQuery, write the output to GCS, send the email with the link to download the output. Things that change per requirement are — the specific query, whom to send the output to, format choice, the schedule, etc. With this understanding, we implemented a Standardized Automated Reporting approach.

The graph for any fresh report implementation looks like the diagram below.

Reports could have any schedule — daily, weekly, or monthly. We create a unique DAG for each fresh requirement using our standardized approach. That makes it easy for us to track their historic runs.

We created a custom operator on Airflow that allows us to use our in-house Notification Service to send an email (parameterized content) with the reference URL to download the output file upon completion of the data processing job.

Anybody within the tech team can create scheduled reports using the Reporting Automation module we built on Airflow. That individual need not be familiar with Airflow at all. It provides a configuration-based generic implementation to automate the process of making a DAG, each with a unique specific requirement.

The report requestor/creator has to provide a payload that contains the information like the SQL to fetch the data, list of email recipients, etc.

Its structure looks like this -

This payload is then appended to the master dictionary. The master dictionary that contains multiple payloads, is in turn consumed by the Driver Code to spawn all the DAGs.

Driver Code leverages the Airflow feature of picking up DAG references from the global scope.

  1. The Driver Code loops over the Master dictionary and makes a DAG in the global scope by using globals()[dag_id] = dag_refrence(preferably a function which takes in DAG_id as an input and returns a DAG)

DAG reference function looks like this -

To assign a global reference to a DAG -

Note here, general_payload is the master dictionary that has all the payloads.

  1. Operators are attached to each of the DAGs. We scale the whole workflow to accommodate any number of reports with a minimal dev-time invested.
  2. We can customize the DAGs however we want. By leveraging airflow variables, we can run a DAG for any date without tinkering with the child DAG or adding another key. Eg — { “pp_dag_1”:”date_to_run”}

Features and advantages

  1. Allow the report creator to choose the output type — CSV or Excel. The option to write CSV to GCS from BigQuery is available out of the box using the BigQueryOperator. We created a cloud function that would take the CSV as input and convert it into Excel. It would be triggered if that choice is made in the input payload.
  2. Break a detail output file into smaller sized units based on row-count since opening a file with 500,000+ rows is difficult.
  3. Overcome file size/data extraction limits that self-service reporting tools such as Tableau and Metabase impose, which become roadblocks for detailed reports. Tableau could theoretically load unlimited rows but it gives away when trying to view over a million rows in tabular format. Metabase cannot create an Excel/CSV to download from its GUI if the count of rows is greater than 16,000 and the data source is BigQuery.
  4. Integrated with our inhouse Notification Service that sends out emails and SMS — smaller files are sent as attachments, the larger files have their GCS download paths linked within the email content.
  5. Allows the report creator to run back-dated jobs by simply giving those beginning and ending dates as input in the JSON file
  6. Report creator need not learn Airflow
  7. Save time scheduling a fresh report
  8. Quick deployment
  9. Standardization means there is an extremely small probability of things not working as expected for the creator, debugging became easy
  10. Out-of-the-box advantages of using Airflow — easy maintenance and management of all the jobs

--

--