Building an IoT Data Pipeline for Energy Consumption Monitoring on the Edge with DuckDB and Malloy

Zsombor Földesi
Hiflylabs
Published in
5 min readMay 4, 2023

As many of you probably know, energy prices in the EU skyrocketed in 2022. My primary heating source is electric, so last winter, I wanted to understand my home electricity usage better.

That’s why I developed a data pipeline that allows me to monitor the electricity consumption of my home.

At work, I spend most of my Engineering related time interacting with Cloud Services, so to explore something different, I challenged myself to develop a data pipeline that only computes on the edge.

My first idea was to invest in smart plugs that enable the end user to pull the measured consumption data from the device, but at the time, I wasn’t lucky enough to find one that matched my preferences because most of them lacked a public API. In the past, I spent more time than I’d like to admit trying to reverse-engineer things like my Nobo Hub, but in this situation, it seemed too much of a hassle. And while I’m impressed by the flourishing smart home community, the direction of utilizing a home assistant such as home-assistant.io didn’t appear worthwhile for me since I was only going for energy monitoring.

Fortunately, 2022 was the year I discovered DuckDB, and ever since, I have been looking for opportunities to get more out of it: whether it comes to building CLIs or bringing the “Cloud” to my laptop.

After a bit of tinkering, I came up with this simple architecture:

To sum it up, I set up an ESP8266 to read the measurement data from the Smart Meter via its serial port. The collected events were then parsed and sent to an API running on a Raspberry PI. Afterward, I persist the raw events as parquet files, and lastly, I utilize DuckDB with Malloy for transformation and analysis. In the next few paragraphs, I will walk you through each stage.

Event Collection

Fortunately, the Smart Meter provided by my electricity supplier is equipped with a DSMR standard P1 port. DSMR stands for Dutch Smart Meter Requirements, a standard for smart meters created in the Netherlands. This standard enables end users to monitor their energy consumption/billing data.

I came across several projects that aimed to parse data from Smart Meters, such as dsmr-reader, but ultimately, none of them were compatible with the particular Smart Meter I have. However, with some troubleshooting and online research, I found a datasheet that helped me parse the messages received via the serial interface. The raw data gained from the Serial Connection looked like this.

I programmed a NodeMCU to collect the events from the Smart Meter. The app retrieves data from the P1 serial connection, then parses the relevant information from the raw event, and finally transfers it as a JSON object to a simple Python API. Additionally, I added a DHT22 sensor to measure outdoor temperature and humidity.

Example output from the ESP:

{
"power_timestamp": "230421091600S",
"power_device_name": "AUX103030299999",
"power_breaker_status": "000",
"power_consumption": "009994.902*kWh",
"l1_phase_voltage": "240.5*V",
"l2_phase_voltage": "239.8*V",
"l3_phase_voltage": "236.3*V",
"l1_phase_current": "000*A",
"l2_phase_current": "000*A",
"l3_phase_current": "000*A",
"humidity": 40.4,
"temperature": 7.6
}

You can check out the PoC version of the event collection here.

Persist Service

Storage structe

To store data collected by the ESP8266, I created a simple Flask service with Python that persists the received data as partitioned parquet files. I chose this approach because it’s simple and maintainable.

@app.route("/sensor", methods=["POST"])
async def dummly_handle_request():
raw_data = json.loads(request.data)
transformed = transform(data=raw_data)
persist_to_parquet(data=transformed)

return "ok"

Since the update interval is between 1–10 seconds for the DSMR protocol, I didn’t worry about the throughput being too large. A simple HTTP service was enough to transform the serialized JSON objects into parquet. Moreover, since my plan was to deploy it on a Raspberry PI, I didn’t want to introduce more complexity and service costs with a regular streaming platform/ message broker like Red Panda. Realistically, a lightweight MQTT service would also have worked, and obviously, I would choose those options in a production environment.

from deltalake.writer import write_deltalake
write_deltalake('/raw/',
df,
mode='append',
partition_by=['year', 'month', 'day']
)

Transformation and exploration with DuckDB and Malloy

Once the data is available in date partitioned parquet structure, I used DuckDB to query and explore the dataset directly. DuckDB is optimized for analytical workloads — therefore, it boasts exceptional query performance thanks to its columnar storage and vectorized query execution engine.
I chose DuckDB because it’s lightweight and embeddable, which makes it an excellent fit for edge computing.

With DuckDB, reading from parquet files is as easy as:

select
measure_timestamp,
power_consumption_khw,
outside_temperature
from read_parquet('~/raw/year=*/month=*/day=*/*.parquet')
limit 10;

After applying some basic wrangling of the collected data, I wanted to explore and visualize it to uncover patterns and insights. I used Malloy to analyze and create charts to display energy consumption trends and check the correlation with outside temperature and humidity levels.

A little background info: Malloy is an innovative language designed to define data relationships and transformations. It serves as both a semantic modeling language and a structured querying language (pun intended), enabling users to run queries on relational databases. DuckDB is natively supported in Malloy, along with BigQuery, and Postgres. Therefore, it is a superb choice to create reusable explores against your data model.

I used the Visual Studio Code extension in a Remote Explorer to streamline the process of constructing queries and generating visualizations and dashboards well.

Monthly Power Consumption

I let this pipeline run in the past five months to ensure I have enough data to make my observations. Through these visualizations, I identified several energy consumption peaks throughout the day that could be addressed by adjusting my home’s heating schedule. Armed with this knowledge, I was able to take steps to optimize my home’s energy usage and reduce consumption, cut down my electricity bill, and reduce my impact on the environment.

Hourly Power Consumption in a day

It was very instructive for me to build a data processing pipeline only using open-source technologies without touching cloud services or large on-prem infrastructures. Most of the prototype code is available here. If you want to learn more about Malloy, check out this interactive demo.

In conclusion, I hope this blog post has provided you with valuable insights into the process of creating efficient IoT pipelines and demonstrated the impressive potential of DuckDB for enhancing local performance that rivals the Cloud.

Feel free to reach out if you’re interested in building a similar data pipeline or have any questions about my setup.

--

--

Zsombor Földesi
Hiflylabs

Lead Data Engineer @ Hiflylabs. Python enthusiast.