From Batches to Streams: Streamlining Data Engineering for Comprehensive Internet Oversight (Part 3)

Mohamed Awnallah
4 min readSep 10, 2023

--

TLDR

This article concludes my Google Summer of Code project focused on Internet alarms correlation and aggregated reports using JavaScript, Vue.js, and Plotly. It enhances internet infrastructure monitoring, adopts ETL MVC architecture, features dynamic visualizations, automated CI/CD with GitHub Actions and Jest, and robust data validation with Joi. The project aims to expand with user feedback and new data sources.

Table of Contents (TOC)

  • I. Introduction
  • II. Business Value and Context
  • III. Technical Components and Practices
  • IV. Expanding the Project

Introduction

This article marks the conclusion of our three-part series. In this final part, we delve into my Google Summer of Code Project at the Internet Health Report (IHR), focusing on alarms correlation and aggregated reports through an online tool. This project is conducted under the mentorship of Romain Fontugne and Emile Aben. To witness the project in action, visit the IHR Global Report Page.

Business Value and Context

In today’s ever-evolving Internet landscape, effective monitoring of Internet alarms concerning BGP hijacking, BGP routing, Internet delays, and outages is not just essential but critical. This project precisely addresses this need and has received recognition through publications on the RIPE NCC and APNIC Blogs. For a deeper understanding of the business value and context, please refer to the following articles:

  1. RIPE NCC Article: RIPE NCC is one of the five global regional Internet registries responsible for managing the allocation and registration of Internet resources, including IP addresses and autonomous system (AS) numbers. Read here.
  2. APNIC Article: APNIC, the Asia Pacific Network Information Centre, serves East Asia, Oceania, South Asia, and Southeast Asia as one of the five regional internet registries. Read here.

Technical Components and Practices

TLDR

This project uses MVC architecture for Javascript ETL processes, offers dynamic visualizations using Plotly, uses GitHub Actions for automated CI/CD, employs Jest for testing, Joi for data validation, and provides comprehensive documentation. Communication is facilitated through GitHub Discussions and GitHub Projects for effective collaboration.

ETL with MVC Architecture

The project adopts the Model-View-Controller (MVC) architecture for extraction and transformation, enhancing modularity and data consistency. It effectively handles missing/nullable data, diverse data types, and various timezones across all sources.

Figure 1 — ETL with MVC Architecture

Dynamic Visualizations with Plotly

We introduce dynamic visualizations such as World Maps, Time Series, and TreeMaps, offering adaptable insights at different levels, from countries to Autonomous System Numbers (ASNs).

Figure 2 — The World Map shows how many alarms happened in Brazil on 15 August 2023.
Figure 3 — The Time Series shows the significant spike of alarms when the power outage occurred in Brazil.
Figure 4 — The TreeMap shows the details of ASes that are impacted by the power outage in Brazil

Automated CI/CD with GitHub Actions

Our development cycles are streamlined, and efficiency is boosted through Continuous Integration (CI) with GitHub Actions, which automates testing and deployment processes.

Transformation Code Testing with Jest

We employed Jest, a JavaScript testing framework, for testing the transformation code. We mainly focused on integration tests because they provide greater insights into system reliability and module interactions.

Figure 5 — By Kent C. Dodds

Data Validation with Joi

We employed Joi for robust data validation, ensuring data quality and adherence to predefined schemas in the Internet Health Report project. This enhances the accuracy of integrated external data sources.

Comprehensive Documentation with README

We provide extensive documentation on adding data sources and alarm types to ensure system extendability. You can access the documentation on our IHR GitHub Repository.

Communication with GitHub Discussions

We love to build and discuss in public. We utilized GitHub Discussions, GitHub Projects, and video meetings if necessary for effective collaboration.

Expanding the Project

User and stakeholder feedback is invaluable for refining and expanding this open-source project. Collaboration with network administrators, policymakers, and analysts will aid in enhancing the alarms correlation and aggregated reports tool.

While the project meets its GSoC proposal requirements, a future feature of a dedicated table for filtered alarms is planned for version 2. Potential developments include incorporating additional data sources such as DNS query data and HTTP request data for broader monitoring. Collaborative contributions and suggestions for new data sources are welcome.

Future work also aims to refine visualization techniques to provide more actionable insights. An open approach to development will pave the way for a more comprehensive and impactful tool. Your input on possible data sources is highly appreciated, and we look forward to collaborating on the project’s evolution.

Read Also

From Batches to Streams: How to Navigate the Data Product Life Cycle: A Comprehensive Guide (Part 2)

From Batches to Streams: Different Ways for Ingesting Data (Part 1)

Credits

- Written by Mohamed Awnallah

- Reviewed by Stanley Ndagi

--

--

Mohamed Awnallah

Data Engineer with a strong understanding of the Data Product Life Cycle and fully passionate about contributing to Open Source.