Building an IoT Monitoring System with Spark Structured Streaming, Kafka and Scala

Rafael VM
3 min readJul 24, 2024

--

In my recent project, I had the opportunity to develop a comprehensive IoT Smart Farm Monitoring system using Scala, Apache Spark Structured Streaming, and Kafka. This project was a part of my master’s coursework and provided a valuable hands-on experience in real-time data processing.

Github Repository

Project Overview

The primary objective was to monitor various environmental parameters in a smart farm, such as CO2 levels, temperature, humidity, and soil moisture, using sensor data. The data is sent to Kafka topics and then processed in real-time using Spark Structured Streaming. The processed data is stored in Delta Lake for efficient querying and further analysis.

Key Components

1. KafkaDataGenerator

This data generator simulate sensor data for CO2, temperature, humidity, and soil moisture. This generator sends data to specific Kafka topics, simulating real-time data flow from IoT devices.

2. Apache Kafka

Kafka acted as the message broker, holding the sensor data in different topics. Kafka was configured using Docker and Docker Compose, ensuring a seamless setup and integration.

3. Spark Structured Streaming

Spark Structured Streaming was used to read data from Kafka topics, process it, and perform real-time analytics and transformations. Key features implemented include:

  • Watermarks and Windows: To handle late data and compute windowed aggregations.
  • Error Monitoring Service: To track defective sensor data using accumulators and custom streaming query listeners.

4. Delta Lake

Processed data was stored in Delta Lake, providing ACID transactions and efficient data management. This enabled us to handle schema evolution and perform scalable queries on large datasets.

5. Real-Time Analytics

The sensor data was enriched by joining it with static zone data from a JSON file. This allowed to add contextual information such as sensor location (latitude and longitude) and zone type, enhancing the overall data quality and usability.

Technical Details

Environment Setup

The following technologies and versions were used:

  • Scala: 2.13.14
  • Spark: 3.5.1
  • Delta: 3.2.0
  • Kafka and Zookeeper: Confluent images (version 7.2.1)
  • Java: 17

Development and Testing

A structured approach was followed to development, ensuring that the code adhered to clean code principles and SOLID principles. The refactoring was done to improve maintainability and scalability.

Testing was an integral part of the process. I wrote unit tests for various components, to ensure that defective sensors were correctly identified and processed. Mocking and dependency injection were used extensively to simulate Kafka streams and test the streaming queries. However they were implemented at very early version stages, so due to the lack of time I wasn’t able to implement updated tests. I dear you to develop them if you want. Just sent a PR ;)

Challenges and Learnings

One of the main challenges was handling late data and ensuring data consistency. Implementing watermarks and window functions in Spark helped us manage late arrivals and perform accurate aggregations. Another challenge was ensuring efficient joins between streaming data and static data, which we tackled using broadcast joins.

This project has been a significant learning experience, enhancing my understanding of real-time data processing, stream analytics, and big data technologies.

Acknowledgements

I would like to extend my heartfelt gratitude to Professor Mario Renau for his fantastic mentorship throughout the course. His extensive material, exceptional explanations, and insistence on pushing us out of our comfort zones have been invaluable. I hold him in high regard for his dedication and support.

--

--