Data warehouse tech stack with MySQL, DBT, Airflow

Objective of the project

Aerial video surveillance using a wide field-of-view sensor has provided new opportunities in traffic monitoring over such extensive areas. Unmanned aircraft systems equipped with automatic position stabilization units and high resolution cameras are used for data acquisition in sufficient quality.tracking is done on multiple vehicles in order to obtain detailed and accurate information about the vehicles’ trajectories in the course of their passage through the intersection.

The visual data for analysis acquired by an unmanned aerial vehicle (UAV) are then recorded on a memory card and subsequently post-processed on a high-performance computer.

The project will focus on building a scalable data warehouse to contain the vehicle trajectory data collected from footage captured by swarm drones and static roadside cameras.

The pNEUMA public dataset was used for the project’s data. A unique experiment by a swarm of drones in the congested downtown area of produced pNEUMA, an open large-scale dataset of naturalistic trajectories of 500,000 cars. The size of each file for a single (region, date, and time) is roughly 87MB.

Source csv data:csv data file

Data loader:A python script that combines a dataset and a sampler, and provides an iterable over the given dataset.

Data migration tool: Python scripts used to move data from one location to another, one format to another, or one application to another.

Apache Airflow: is an open-source platform for authoring, scheduling and monitoring data and computing workflows. Airflow uses Python to create workflows that can be easily scheduled and monitored.

A workflow as a sequence of operations, from start to finish. The workflows in Airflow are authored as Directed Acyclic Graphs (DAG) using standard Python programming.

PostgreSQL: is a system that utilises the Structured Query Language (SQL). It is popular because it is open-source and can be adapted for numerous purposes. It can be used to interact with PostgreSQL using a number of paid and free Graphical User Interfaces (GUI) or the command line and terminal.

Data Build Tool(dbt): is an open-source Python application that uses modular SQL queries to allow data engineers and analysts to transform data in their warehouses. It takes care of the ‘T’ in ETL procedures and handles the data engineering that comes before analytics. It does not handle the extraction or loading of data and assumes that the raw data is already loaded into the data warehouse.

Redash: is an open source web application. It’s used for clearing databases and visualizing the results.quickly integrate with Data Warehouses, write SQL queries to pull subsets of data for visualizations, and share dashboards with greater ease.

Data Build Tool (better and simply known as “dbt”) is a fantastic tool that will help you make your transformation processes much simpler. dbt fits nicely into the modern Business Intelligence stack, coupling with products like Redshift, Snowflake, Databricks, and BigQuery. Its main function is to take your custom code, compile it into SQL, and then run it against your warehouse. The code is a combination of SQL and Jinja (a templating language used in Python).

Multiple databases are supported, including:

  • Postgres
  • Redshift
  • BigQuery
  • Snowflake
  • Presto

Initialising a project in DBT is very simple; running “dbt init” in the CLI automatically creates the project structure for you. This will ensure that all engineers will work with the same template and thereby enforces a common standard.

Airflow allows you to define pipelines or workflows as Directed Acyclic Graphs (DAGs) of tasks. These graphs are very similar to the examples sketched in the previous section, with tasks being defined as nodes in the graph and dependencies as directed edges between the tasks.

  • The Airflow scheduler — which parses DAGs, checks their schedule interval, and (if the DAG’s schedule has passed) starts scheduling the DAG’s tasks for execution by passing them to the Airflow workers.
  • The Airflow workers — which pick up tasks that are scheduled for execution and execute them. As such, the workers are responsible for actually ‘doing the work’.
  • The Airflow webserver — which visualizes the DAGs parsed by the scheduler and provides the main interface for users to monitor DAG runs and their results.

Load the data

Migrating the data

Redash

Redash is a data engineering tool used to show the outcomes of clearing database tables. Redash will instantly construct a dashboard for you where you can view these outcomes. We’ll work with redash to complete the following tasks.

Alot of our job will be easier if we integrate multiple tools for real time ELT project.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Melaku Mekonnen

Melaku Mekonnen

Machin learning Engineer | Data Engineer