Weather Driven Boiler Automation Done “RIGHT”

Published in

Fundbox Engineering

9 min readJan 20, 2022

It’s winter again. And like every winter, I start the “boiler dance.”

The first month is the worst. First, I have a cold shower realizing I forgot that boilers are a thing. Next, I set up the cheapest smart timer I could locally find to heat the boiler for 30m at a specific time. Then there would be a sudden temperature drop the following week, which means another cold shower and some timer tweaking. The cold streak doesn’t last, which triggers The Optimizer in me to tweak the timer back.

This dance continues until my lazy side (in charge of some extra budget for utility bills) wins The Optimizer and sets the timer higher than needed for a couple of months. Then in Spring — the reverse dance begins.

Not this winter; this winter — I’ll fix the problem once and for all, and I’ll do it RIGHT!

Why RIGHT? Because my previous projects were not done RIGHT, and eventually were not done at all because I didn’t want to deal with THAT mess.

So what does “RIGHT” mean?:

I don’t have to deal with any consequences.
It should please the future me.
It should please my current and future wives (probably the same person).
It should work longer than two winters.
It shouldn’t cost me anything (apart from electricity and ISP)

Let’s throw everything I know about making systems work onto this project and see if it’s worth it!

So, what does “RIGHT” mean in this context?:

Good design
CI/CD
SRE

Let’s go!

Requirements

I want to have hot water at certain hours.
I want to change the schedule quickly.
I want the system to be efficient with electricity.

Inventory

A relatively smart timer of a local brand, surprisingly controllable by Google Assistant.
Some scrap metal parts like a couple of Arduinos and a Raspberry Pi Zero.
A short session of searching the internet for an existing solution (negative).

Design

This one looks straightforward:

I set up a schedule in a calendar
Some cron picks it up and calls the Scheduler script
Which fetches the weather data from some provider
Checks the schedule
Asks “AI” to figure out how much time is needed to heat the boiler to the requested “hotness” (“intensity” is the term I went for)
Switches the boiler on or off accordingly
Via the timer’s API or Google Assistant

Development

I decided to extract the technical bits into its own post, as this one is long enough as it is.

See the ”boring details” post for: Design decisions, Sniffing, Google Assistant POC, Research, Classes design, Raspberry Pi, and Datadog agent rPi issues.

Here is a couple of TL: DR;s:

TL: DR; Classes Design

**Self-explanatory, both CalendarSync and WeatherProvider run as stand-alone jobs.** **Repo**.

TL: DR; Calculations

To calculate how many hours do I need to heat the boiler from temperature A to B, we need to know:

How much kWh is required to heat water?

kWh = 4.2 × liters × temp. delta ÷ 3600 (source: Google)

Convenient, isn’t it? Suspiciously convenient if you ask me, a fact that still amazes me: you can take a battery, release some of it into a kitchen pot, and despite all the transformations that the energy goes through — the water will be linearly this much hotter. Nature is supposed to be messy, isn’t it? So why is this “energy” concept so neat?

How much kW is my boiler? What is the capacity?

kW = 10 Ampers × 220 Volts / 1000 (source: Ohm)
100L (source: roof)

What is the current temperature of the boiler water?

Boiler water temperature: I assumed it’s the same as the average temperature in the last X hours (currently configured as six hours)
Sun energy output: visualcrossing has an API that provides how much energy the sun outputs onto an area; I take into account the sun angle depending on the sunrise/sunset times and multiply it by a factor (currently 0.1) to take into account the amount of dirt I have on the solar panel and other factors, I can probably tweak it later to be much more precise if I calculate the geometry between my boiler and the sun.
Boiler sun panel area: 2m² (source: roof)

What temperatures range do I want in my shower?

37–41°C shower temperatures range (source: Google)
+10°C so it won’t be “just enough” (source: common sense)
+ some wiggle room (source: desire for freedom)
= 30–55°C

Results

Surprisingly enough, this chain of guesses, some physics, and simple math, gave a decent model to work with. Except for the sun “nurfing,” I didn’t have to tweak anything to get the same baseline results. I was expecting a huge error margin instead.

SRE

Now that the coding part is done, manually tested, refactored, and cleaned up — it’s time to ensure I won’t have to take another cold shower in winter.

My main concern is that the program would stop working: no matter where I deploy, it’ll eventually fail, and I need to know about it — I need monitoring. No monitoring — no shower.
Another concern is unwanted behavior: I need a way to fix a problem from anywhere anytime a user complains — I need CI/CD. No CI/CD — dirty wife at home.
A cat eats the server: it’s a legit concern — I have multiple cats and no high availability in place. My solution is backups (not covered in this post).

Remember (from the “boring details” post that I’m sure you haven’t read) how I was upset about the Google Assistant code and how much time it took out of my life? Little did I know.

CI/CD

First of all, let’s deal with the easy and fun part — CI/CD.

Well… Calling it CI/CD would be a stretch. But it does its job. The goal is to be able to do the critical tasks remotely:

Testing: if tests don’t pass, nothing should be deployed.
Code deployment: as I decided to use cron for all my needs, this should be easy. I just need to make it safe so a deployment won’t collide with a running process.
New packages: adding a package to requirements.txt should be automatically installed.
Cron: I need to be able to change the cron configuration remotely.
Credentials: sometimes credentials expire. I need a way to fix them remotely.

The deployment happens in three stages:

Get the code stage: git clone into “pre-ready” dir, copy secrets, install requirements, run tests.
Mark it as “ready”: replace the old “ready” directory with the new code.
Modules self-deploy: each module before running deploys itself — the same way, replacing the “module” directory with the new code.

Here is the script that orchestrates the process and the self-deploy script.

Monitoring

What does monitoring mean? For me — it’s the feeling of security that I know what happens in my system. For that, I need:

Metrics
Logs
Alerts

And I’d like it accessible remotely if possible. (For larger systems, I would want tracing too, but it’s not critical at all in this case.)

The only two solutions I considered were Datadog and Grafana (cloud or self-hosted).

Datadog is an excellent product with a free tier that includes some basics. One of their main killer features is quick onboarding (ironically, in this case). It is my preferred monitoring weapon of choice, given that someone else is paying for it, so I wanted to try the experience when you refuse to pay for anything.

A note from future me: long story short — I discovered (only after the trial period ended) that the free tier doesn’t include alerts . I was very sad that day. And it was a huge blunder on my side not to research this part, knowing this would save me a week of pain.

Grafana self-hosted: just the good old Grafana with Prometheus. I only worked with Grafana on Graphite before, so it was an excellent opportunity to learn about Prometheus.
Grafana Cloud: the same thing but managed. It also has a free tier. One much more generous than Datadog.

TL; DR: Making the Datadog agent do anything on a Raspberry Pi is not trivial at all. For example, even after installing it, I couldn’t make it to collect logs. It’s not really Datadog’s fault; the target audience of “Datadog on rPi” is probably meager; they don’t have to support every platform out there. But. It’s not Grafana’s target audience either, and their agent did work out of the box. So I went for the Grafana Cloud solution.

Logs

Logs were easy to do; Grafana’s Loki knows how to collect logs from text files.

Grafana’s logs UI, though, wasn’t something I was willing to use daily. Which made me invest more in metrics to increase visibility, and I feel like this was the right way to go. I have only the basic logs to keep an eye on to make sure nothing breaks, and of course, for when exceptions start to rain.

Metrics

A note about collecting Prometheus metrics from short-running jobs: Prometheus collects data by pulling the information from other services. In the case of short-lived jobs, it won’t work as the jobs die before Prometheus can scrape anything. So there are two solutions: Push Gateway — an always-up process that you can push metrics into, Textfile Collector — write your metrics into a text file, and it’ll scrape them from there. After trying both, I went for the latter as it makes more sense for my goals.

Business metrics

As an end-user, what interesting information can I pull out from the system?:

The boiler state right now, and history.
The weather right now, and history (temperature, sun energy output).
When is the next schedule?
How long will the boiler be heating next?
How long was the boiler heating in the last 24h?
Configuration changes history.
Schedule changes history.

Technical metrics

If something goes wrong, what do I need to know to figure out what happened?:

Amount of schedules.
Intermediate calculations.
Data I can rely on to tweak the formula.
Logs.
Alerts statuses.
CPU and memory.
Deployments history.

Alerts

What can go wrong? Technically:

Does the Grafana agent report to the cloud?
Do the jobs finish when they should?

Sanity checks:

If the temperature was lower than 20°C for the last six hours, the boiler should heat at least 30m.
If the temperature was higher than 30°C for the last six hours, the boiler shouldn’t heat at all.

Finish Line

Some screenshots

Google Calendar to control the schedule:

Grafana dashboard in all its glory:

The Timeline

The timeline below is approximately how much time I spent on each task (not specifically in that order); pretty much as I estimated except for monitoring, I wasn’t expecting this much friction when it comes to Datadog and Raspberry Pi combo.

Was it worth it?

This was generally a positive experience. I didn’t enjoy dealing with the Google ecosystem. And Datadog agent was a colossal waste of time. But I think I did reach my goal:

It’s stable — I know it’ll work this evening.
It works surprisingly well — I always have just enough hot water.
I can see if everything is ok at a glance.
The code is extendable; all the parts are encapsulated and communicate via agnostic channels.
It’s somewhat efficient (I’m working with what I have here, any ideas are welcome)
Although it’s a small project, it seems like it’s worth putting effort into SRE even at this scale.
No more unintentional cold showers again!

PS The link to the somewhat boring part of this post in case you missed it: https://medium.com/fundbox-engineering/weather-driven-boiler-automation-done-right-boring-details-7f63f0cbe89d