COVID-19 — Data Correlation among Cases, Tweets, Mobility, Flights & Weather with Azure

Published in

SOUTHWORKS

12 min readApr 5, 2021

Continuing with the Azure Data Services family, we bring you this time a scenario that demonstrates how to use data analysis tools to build a modern data platform capable of handling one of the most common data challenges in an organization.

The solution described in this article combines a range of Azure Data Services that are used to ingest, process, store, serve, and visualize data from different sources, both structured and unstructured.

This solution architecture shows how a single unified data platform can be used to meet the most common requirements for:

Handling traditional relational data pipelines
Performing big data transformations
Serving insights for data-driven applications and rich data visualization

Defining the scenario

For the sake of this exercise, we intend to identify the impact of different factors in the COVID-19 pandemic evolution and how its evolution is reflected in social networks, in this case Twitter. For this purpose, we collect, via public APIs, daily information on incoming flights, weather temperatures, traffic and mobility data, tweets with the tag #covid and, of course, data on the pandemic (new cases and demises).

The plan is to leverage different tools provided by Azure, and with 2 high-level use cases mainly:

Dynamic reports: Which consume data on demand and perform the corresponding calculations to generate reports in real time.
Batch reports: Which display previously normalized and calculated information. This process must be executed once a day.

Related use cases

This approach can also be used to:

Establish an enterprise-wide data hub consisting of a data warehouse for structured data and a Data Lake for semi-structured and unstructured data. This data hub becomes the single source of truth for data.
Integrate relational data sources with other unstructured datasets with the use of big data processing technologies.
Use semantic modeling and powerful visualization tools for simpler data analysis.

Selected Data Sets APIs

All datasets’ API used in the development of this application are free to use or provide a free subscription account:

OpenSky for airport arrivals datasets.
Weather Bit for current and historic weather information.
Covid-19 public historical pandemic data.
Mobility for JAM factor of different cities.
TweePy for tweet data.

Designing the platform architecture

A modern data platform using Azure Data services:

Data Ingestion

Azure Data Factory pipelines pulls data from a wide variety of semi-structured and structured data sources, both on-premises and in the cloud.
Still part of the Azure Data Factory pipeline, Azure Data Lake Store Gen 2 stores big data files retrieved from the data sources. You can save the data in delimited text format or compressed as Parquet files. Azure SQL stores relational data that we want to keep that way.

Data Preparation & Processing

1. Dynamic Reports

Azure Synapse PolyBase is used for fast ingestion into data warehouse tables.
It also normalizes and processes data via Stored Procedures and Views.
Relevant data from the Azure Synapse DW is loaded into Power BI datasets for data visualization. Power BI models implement a semantic model to simplify the analysis of business data and relationships.

2. Batch Reports

Azure Databricks notebooks are used to process the data and apply the necessary transformations to it before it is used for reporting. Resulting datasets are stored as Parquet files in the Data Lake.
Just as with dynamic reports, Power BI models implement a semantic model from Data Lake files to simplify the analysis of business data and relationships.

Data Visualization

Power BI dashboards which are embedded into apps that can be shared with different stakeholders and parties.

Considerations

The technologies chosen in this architecture provide the necessary functionality to handle the scenario described above. These services also meet the requirements for scalability and availability, while giving you the chance to control costs accurately.

The massively parallel processing architecture of Azure Synapse provides scalability and high performance.
Azure Synapse has guaranteed SLAs and recommended practices for achieving high availability.
When analysis activity is low, the company can scale Azure Synapse on demand, reducing or even pausing computing to lower costs.
The Azure Synapse security model provides connection security, authentication and authorization via Azure AD or SQL Server authentication, and encryption.
Using Free tier Data Set APIs can cause data truncation per date or geographic area.

Alternatives

For situations where device management, authentication, and provisioning are required, Azure IOT Hub may be a preferred solution to Event Hubs. Event Hubs should still be considered for streaming other data sources.
In the architecture above, Azure Data Factory is the service responsible for data pipeline orchestration. Azure Databricks can also be used to perform the same role through the execution of nested notebooks.
In the architecture above, Azure Databricks and Azure Synapse Analytics were used to normalize and process the data independently for different reports. Azure Databricks could first be used to normalize and preprocess the data to be consumed later by Azure Synapse Analytics.

Cleaning & preparing the ingested data

Incoming null values

The information provided by the used APIs usually contains null values. Therefore, the first challenge encountered is trying to infer information to fill in the blanks.

Databricks approach: Notebooks provide a comfortable environment to manage data and perform calculations. We chose Python (within the supported languages — other options are R and Scala) to develop our notebooks, accessing all useful libraries in a Spark cluster, and this includes Pandas. So, for this case we just switched our PySpark DataFrame into a Pandas DataFrame and leveraged the interpolate function provided by Pandas.
Synapse approach: Unfortunately, with Synapse the solution was not so simple, in terms of development. We decided to use only the SQL pool and, for the time being, discarded any development with Spark in Synapse. So, just using T-SQL and several additional features, we applied a Linear Interpolation as the mathematical method to estimate null values.

Processing ingested data & building the reports

Data correlation

Correlation is any statistical association, though it commonly refers to the degree to which a pair of variables are linearly related.

To find the correlation per country between the different data sources and the COVID data, we used the Correlation Coefficient to identify how each separate data source correlates with COVID data.

Databricks approach: To obtain the Correlation Coefficient we used the advantages that Databricks offers us to execute Python code, and we set up the corr function of Pandas. In this case, given the simplicity in terms of development, we calculated it the Pearson, Kendall, and Spearman correlation coefficients. These provided different interpretations which allowed us to understand a little bit more how the data is interrelated.
Synapse approach: We used the SQL pool, and wrote down the required calculation to get the Pearson correlation coefficient in T-SQL language. The population correlation coefficient ρX,Y between two random variables X and Y with expected values µX and µY and standard deviations σX and σY is defined as:

where is the expected value operator, means covariance, and is a widely used alternative notation for the correlation coefficient. The Pearson correlation is defined only if both standard deviations are finite and positive. An alternative formula purely in terms of moments is:

Building dynamic reports using Azure Synapse Analytics

We worked with both a dedicated SQL pool, and a Serverless SQL pool. The selection depends on how accessible your data is, and the amount of data you plan to process and store. In our case we decided to use the Serverless SQL pool, it was the option that best adjusted our budget needs, and the functionality that we want to achieve with the expected performance.

This is how we are using Azure Synapse in this scenario (Synapse configurations and T-SQL code are included in the provided templates — check the repo):

Setup PolyBase to load the data into tables. Twitter data is copied directly to the warehouse using a CopyData pipeline in Azure Data Factory, the rest of the data is loaded from the files in Azure Data Lake Storage using external tables.

Collecting Twitter Data: Since Twitter’s Public API does not offer a way to directly get the number of tweets containing a specific hashtag or word, we developed a Python script that individually queries the API for set of tweets containing the #COVID hashtag and stores them. To run the script, we used an Azure Function running on a daily timer. The Azure Data Factory pipeline is triggered when a file is uploaded to the corresponding storage container. Upon being triggered, the data from the container is copied for further processing.

2. Setup external tables to serve data.

The use of External Tables ensures that the data is queried directly from the files in Azure Data Lake Storage without going through an intermediate table. In this way we were able to consult the data while it was being generated.

3. Setup SQL views to perform calculations and join data from different tables.

As we need to perform certain calculations while inferring data to fill empty data points, and we need to normalize the data of the different sources, we decided to use a SQL View instead of performing that process in Power BI for performance reasons. The report loads faster if data processing is delegated to Synapse.

4. Setup Synapse copy data pipelines to generate static and processed data to visualize in Power BI.

There is data that is static, for example: countries, cities and airports, and does not change frequently. We decided to use a pipeline to copy them into Synapse, which is less expensive than querying it directly from Data Lake Storage. In addition, in this way we were able to apply complex data transformations that is a benefit when loading a report in Power BI.

Historic tracker report

A simple report that just shows all the captured data (number of Demises, Flights, Jam Factor and Temperature) by date in a multiline chart. We are using two Y axis to show two different scales. One for the Cases and Flights, Y Axis on the left, and the other one for the Jam Factor (Mobility Index) and Temperature, Y Axis on the right.

Tweets, Demises, and Cases report

Shows the relation between Tweets containing the #COVID hashtag, Cases and Demises by Date. We are using two Y axis to show two different scales. One for Tweets and Demises, Y Axis on the left, and the other one for the Cases, Y Axis on the right. This chart is global, in order to understand pandemic evolution and people’s concerns about it.

Pandemic evolution vs people’s interest/concern

COVID forecast report

Based on historical data, we are forecasting the growth of the COVID-19 cases. We are using an external visualization tool of Power BI that handles the time series forecasting and one of the most popular forecasting algorithms, the ARIMA algorithm. There are other algorithms that we could be used; another very popular algorithm is ETS, you can learn more about time series forecasting algorithms and the differences between ETS and ARIMA in this article.

Building batch reports using Azure Databricks

Our Databricks cluster configuration has a notebook that normalizes, processes raw data and provides useful information for reporting.

The behavior of the process developed in the notebook can be detailed in the following 6 steps:

Establishes the connection with Azure Data Lake to upload the files with the raw data.
Mounts the containers and folders that stores the files with the raw data, this operation improves the performance of the reading/querying process.
Loads the information from the files to data frames for normalizing and processing.
Normalizes the information of the different data frames and infer the missing and/or null data to unite it all in a single data frame.
Processes the data of each report.
Correlates the data evaluating how it influences and how relevant is each of the evaluated factors in the evolution of COVID.

Past 15 days tracker report

In this report we are showing data for the past 15 days (similar to the Historic Tracker report) but introducing a 15 days time offset among COVID dataset and others. We tried to identify the relation of cases and reported demises to several factors like flights, mobility, etc. We are gathering data from the following countries: USA, Brazil, Spain, France and Japan.

COVID to flights, mobility, temperature relation using a 15 days offset for US

Data correlation report

These gauges show factors influenced in COVID evolution. We are using the Correlation Coefficient to evaluate the relation. The value of this coefficient is in between -1 and 1; where 1 is highly correlated and -1 is not correlated at all.

Sharing data visualizations

We created 2 Web Apps leveraging different approaches of embedding Power BI reports into them. A simple serverless JavaScript/React application which uses Azure AD B2B for authentication and login through our tenant; and a server-side MVC application developed with .NET Core which uses Azure AD B2C to allow external users authentication and login.

Azure AD B2B approach

Using Azure AD B2B authentication service, the user is redirected to a Microsoft account login site when first entering the web app. Upon logging in successfully, the user is redirected to the original website, where the reports are embedded utilizing the access token from B2B (you can see more about it in the repo).

Azure AD B2C approach

Using Azure’s B2C authentication service, the user is redirected to a Microsoft login site (in a pop-up browser window) after pressing the “Sign-in” button. Upon logging in successfully, the user is redirected to the original website, where the reports are embedded accessing a custom API (using the access token from B2C) which is connected to Power BI API as Service Principal and get the required embed token.

We enabled a custom “User Flow” to sign up and sign in users via standard Azure Active Directory and some social networks, called “Identity Providers” in Azure (you can see more about it in the repo).

Results & Conclusions

The data sets used to perform calculations are from the beginning of March 2020 for COVID (cases and deaths), Mobility and Flights. For Temperature, we have data from November 2020 (when started this exercise).

With Databricks, and the advantages of python execution in Spark, we used Pandas library to calculate the correlation coefficients. The use of Pandas library became easy thanks to the data preparation and processing done, which allow us to tune the datasets to increase the accuracy results. Furthermore, by adding a fifteen days offset to every piece of data from statistics not related with COVID (flights, mobility and temperature), we tried to align contagion and virus development times till detected and reported.

As you might have expected, most of the results for Flights and Mobility have a direct correlation with the cases and demises of COVID, although we got a strange result for the United States showing an inverse correlation with Mobility (*).

Finally, the temperature factor (which know we know it does not have much incidence on the pandemic evolution as determined by the WHO), gave correlation values that basically are simple to interpret but add no value to the whole picture. The only fact there is that by the time when the data was collected, Brazil was entering its summer so we got a positive correlation value, all other countries in the north hemisphere were entering winter (temperature was decreasing)— anyway makes no sense to go further through these correlation values as temperature does not seem to be a significant factor on pandemic’s evolution.

(*) For sure we haven’t discovered anything new here. Far from being statistics experts, all these result values showed us that correlating this kind of data is not so easy as you might had thought at the beginning. The ‘randomness’ of the offset selected vs virus incubation period (which varies from person to person) makes even more complicate to align all data points in direct time-series fashion. Bottom line, we brought our experience into building this ingestion, correlation and visualization engine leveraging Azure at least as an exercise, and the good news is that the whole platform can be used for any other scenario with similar requirements with low to none architectural changes.