How to build a coronavirus dashboard

Published in

The Startup

7 min readMay 3, 2020

An introduction to data pipelines

Ever since the novel coronavirus has emerged, I’ve been closely monitoring the numbers. But all dashboards that I found were a bit limiting with regards to the different representations of the data they were offering. Having previously worked with Druid, I wanted to be able to easily switch between cumulative and daily cases or filter by country or create my own formulas and order by them.

So I built it. I found Johns Hopkin’s dataset and created a data pipeline for Covid-19. I deployed it to my company’s servers, where we already have Hadoop and Druid clusters, as well as other necessary analytics tools (ex. Airflow, Metabase) and ta-da, there was my dashboard.

But then I thought, why not deploy it somewhere public and be able to send a link to my friends or other people to see it? Great idea, but deploying Airflow as well as EMR and Druid clusters to AWS would be very costly. Is there a way to minimise the number of required VMs? Can you deploy everything to a single VM?

Turns out you can. In the following article, we’ll describe how to built all the different parts of the pipeline and how to bring them all together on a single machine.

Disclaimer: this won’t be so much an in depth explanation of the technologies used as a beginner’s introduction to data pipelines. We’ll see the process of designing and building a data pipeline, as well as containerizing it and having it all run locally.

Our plan

First, we need to set a plan. Our goal is to schedule a workflow which would execute daily and load Johns Hopkins dataset into Druid. Sounds easy right? But wait, when we take a look at the dataset we can immediately spot a problem. There’s no “date” column, the table is pivoted and the dates are columns instead of rows:

This format is not suitable for Druid - a time partitioned data store - so we need to transform it. After we’ve transformed the data, we can ingest it into Druid. So in short, we’ll be building an ETL (extract, transform, load) process which would look something like this:

After our data has been loaded into Druid, we can use a visualization tool to query it. And with our plan set, let’s start building. But before we start, now is a good time to git clone the repo of this dashboard, to help us get along.

Transforming our dataset

For the first step of our pipeline, we need to download two CSV files from the dataset (one for confirmed cases and one for deaths) and transpose them. And what better tool to do that than Spark, an analytics processing framework built for big data? Below is the code for transposing the table:

Spark unpivot function

You can find this function as well as the rest of the Spark project here. It might look a bit complicated at first, but in essence what we are doing is:

load the two CSV files from GitHub into DataFrames
unpivot the two DataFrames
join them into one
subtract from each day the previous day in order to get the daily numbers
write the result to a new CSV file

We can execute the code by running CoronavirusJob class with the following arguments:

CoronavirusJob <confirmed_url> <deaths_url> <output_path>

where:

confirmed_url: https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv
deaths_url: https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv
output_path: ../.filesystem/spark_out

Also, be sure to include dependencies with “provided” scope (in Intellij it’s just a click inside the run configuration). After the job is finished, if we go to <project_folder>/.filesystem/spark_out we should see a bunch of CSVs, which are in fact one table split into multiple files. Notice that the table now contains a “date” column, which is exactly what we wanted from this step.

Loading our transformed data

For our next step, we want to load our transformed data into Druid, a high performance analytics database. This step is much easier, as the data are now in the format that we want and Druid provides a user interface that we can use. The only problem is that we actually need access to a running Druid instance. To setup a new cluster is normally a very time-consuming process, which involves sizing and provisioning numerous nodes with specific resources for the different services (zookeeper, broker, coordinator etc.).

Luckily, the people of Apache have got us covered, as they’ve included scripts to run light versions of all those processes on a single machine. And with the help of our old friend docker, we can have it up and running with just one command:

$ docker-compose up druid

It might take a while for the image to build and the processes to start. Once they are up and running we can go to http://localhost:8888/ and select:

Load data -> Start a new spec -> Edit spec

and paste the following json:

Druid spec file

With the use of docker volumes, the job will be able to access the files generated from our previous step. After approximately 10 seconds the ingestion task should be completed, which means our data are now loaded into Druid and ready to be queried. Time to put a nice front-end to our data pipeline.

Adding a presentation layer

There is a wide variety of visualization tools that work well with Druid. For the purposes of our dashboard we chose Metabase, a powerful yet user-friendly business intelligence tool. The configuration of the dashboard is already included in the repo inside folder metabase, so all we need to do is:

$ docker-compose up metabase

After that we can view our dashboard here or we can go to http://localhost:3000 to create and edit new plots with our freshly crunched data:

Email: admin@admin.com
Password: qwerty123

Scheduling our pipeline

So we’ve loaded our dataset and we can query it from our shinny new dashboard, looks like we’re done right? Sure, that is if we don’t care about getting the latest numbers. But of course we care, what’s the use of an analytics dashboard that doesn’t refresh? So for our last step, we’ll schedule our pipeline to run periodically to get latest changes to the dataset.

For that we’ve chosen Airflow, another great tool from Apache, which will allow us to schedule our workflow and monitor it through a graphical interface. Airflow works with DAGs (Directed Acyclic Graphs), so we need to define a simple one that will automate the steps that we manually executed earlier:

In other words, we want our workflow to first (a) execute the Spark application and then (b) send the output files to be loaded into Druid via an indexing job.

For Airflow to be able to trigger a Spark application, it needs Java installed, as well as the Spark binary. Once again our friend docker is going to help us skip all the tedious work, by using an image built specifically for our needs. Also, we’ll use a build stage in our Dockerfile, to compile the Spark app and copy it to our desired location in /usr/local/airflow/deps/spark.jar.

All that’s left is to write a Python file with the definition of our two tasks. For our Spark task, the BashOperator can be used to execute any bash script:

Airflow Spark task

And here is the definition of our Druid task, where druid_spec.json is the index specification json that we used earlier:

Airflow Druid task

Finally, we need to tell Airflow in which order to execute these tasks, as well as how often to execute them. Although we can see from the GitHub repo that there’s an automated update to the dataset every night at 23:50 GMT, we can also see some sporadic corrections during the rest of the day. In order not to miss those updates, let’s set out DAG’s schedule interval to every 3 hours, which would be defined as:

0 */3 * * *

You can see the whole DAG definition here, and with everything in place we can start our scheduler:

$ docker-compose up airflow

Then we can go to Airflow UI at http://localhost:8080 and switch on coronavirus DAG:

And that’s it, our scheduler is now up and running. It will “wake up” every 3 hours to reload the updated dataset into Druid. We can now say with confidence that we have a solid data pipeline.

Wrapping up

In this tutorial we saw how to build a small but complete analytics pipeline, which includes all basic elements including an ETL process which pulls data from one datasource and loads it into another, a scheduler which runs the process periodically and a presentation layer to display the results.

All these components can be easily started locally with the following command:

$ docker-compose up

While we focused on building a coronavirus dashboard, the same principles can be used to solve a wide range of analytics problems. You can find all the code in the GitHub repo, feel free to copy or fork to create your own dashboards.

alexnu/coronavirus

A simple analytics pipeline for SARS-CoV-2, built with Airflow, Spark, Druid and Metabase, powered by Johns Hopkins…

github.com