How discreETLy helped us improve trust and communication between tech and business.

The context

At Fandom we love data. We process vast amounts of data every hour, moving it between services and systems. In this commotion it is easy to lose track of all ETL (Extract-Transform-Load) pipelines deployed across the organization. Moreover, plethora of technologies, like Apache Hive, Apache Spark, AWS Kinesis, Lambda or S3, does not make the monitoring any easier. Fortunately, there is Apache Airflow.

The challenge

Handling as much data as we do can become dirty and messy. There are literally hundreds of issues that may emerge daily, and they usually do. Data is constantly shifting, so malformed input is more than common. Network partitions, cluster unresponsiveness, dependency hell. On top of that information may also be delivered by external vendors, whose API’s stability and resiliency is beyond the reach of Fandom’s Data Engineering team. Yes, we are equipped with ton of patience and knowledge to tackle the whole bestiary of possible issues in the land of data processing — our users, however, are not.

The Data Engineering team, besides doing their usual work, became at some point in time a communication hub related to issues with data, whether we could do something about the malfunction or not (in that case we just load balanced the flow of information). We wanted to make this communication process more robust and meaningful, and in order to do so the users needed to be equipped with a powerful weapon, the sceptre of knowledge.

I know your first thought — let’s give the users access to Airflow dashboard, right? Problem solved. Well, actually, we found out that such an approach is treacherous. Business users do not usually think in terms of ETL pipelines, their focus oscillates around the data that is the result of running multiple ETL tasks. In that sense, the DAG (Directed Acyclic Graph) view available on Airflow UI might be misleading, as it may contain dummy tasks, staging tasks, single task populating multiple tables or some other intermediate steps.

Moreover, giving access to Airflow means the users receive more power that they can handle. When there is power, there is also temptation to use it.

At the pinnacle of possible solutions was a dashboard that would map business user’s point of view with parts of ETL pipelines, namely concrete DAG tasks. As Data Engineering haven’t found such a solution available, the team decided to forge a new product — a dashboard that would empower the users of data with arcane knowledge of data engineering processes. This is how discreETLy was born.

DiscreETLy

DiscreETLy is a dashboard service written on top of Flask, Python micro-framework. The dashboard retrieves most important information regarding status of ETL processes from Airflow’s database. The very bare minimum setup allows the users to browse the status of particular DAGs without the need to visit Airflow UI.

DiscreETLy is secure. It supports adding Google OAuth service to restrict the access. For development purposes the OAuth functionality is optional and can be switched on by providing specific options in discreETLy configuration file.

DiscreETLy is visually appealing. The dashboard offers concise views that provide just right amount of information. The users do not feel intimidated by an ocean of irrelevant pieces of data.

DiscreETLy is flexible. Airflow is only one way of governing ETL processes across company. Although discreETLy focuses mainly on data provided through Airflow, it can display any kind of information provided (this requires a little bit more work while setting up the service).

DiscreETLy is easy to set up and stateless. Just pull the image from Docker Hub and provide your configuration file with `-v` flag when executing docker `run` command. A voila, ready to use!

DiscreETLy joins two worlds — business and data engineering. The dashboard maps real warehouse tables with specific DAG tasks through a user friendly yaml definition. Moreover, there is a way to join tables in groups if they constitute to a particular report (also with yaml configuration file).

If you are interested in more details have a look at discreETLy documentation.

DiscreETLy at Fandom

At Fandom, we use discreETLy in production. As Data Engineering team we have found out that sometimes we start the day by navigating first to discreETLy to get a high-level view of current situation and only later to Airflow to fix any emerging issues. Our users are happy that they can easily verify if a particular task that populates a table in a warehouse is still running, has already finished or failed. We have noticed that the users have become more thoughtful about the processes and the tickets they issue are more concrete and diligent.

The future

We are extremely proud of our solution. DiscreETLy stood up to the task and has been an invaluable ally for data engineering at Fandom since its deployment. However, it is still very young and has a lot to learn.

We plan on developing discreETLy further. In order to do that, we need any help possible in the process. We would like to invite anybody to try out discreETLy and share his experience with us in any form possible — send us a hi message, comment under this blog post, file an issue on GitHub. There are issues that are already available on GitHub, if you feel like it, fork the repo and submit a PR.

We hope that discreETLy will find its place in many data engineering teams as a means of monitoring and communication.