Underfloor Heating Optimisation using offline Reinforcement Learning

6 min readSep 12, 2022

Charles Prosper, Chen Wu, Verdi March, Wei Yih Yap, Eden Duthie @ AWS

Introduction

In this post, by using underfloor heating use case, we demonstrate a practical approach to temperature set-point optimisation to reduce energy consumption using Amazon Accessible RL (A2RL), a python library for offline reinforcement learning, to recommend optimal set-point.

Buildings account for 28% of global energy-related CO2 emissions in 2019 according to International Energy Agency (IEA) building tracking report 2020 when indirect emissions from upstream power generation are considered. Energy saving has become a high priority feature when it comes to design and development of advanced modern HVAC system.

Underfloor Heating

Underfloor heating is a form of central heating that achieves indoor climate control for thermal comfort using heated water loop embedded in a floor. Heating is achieved by conduction, radiation and convection.

Central heating is commonly done via combined heat and power plant which is more energy efficient because the heat generated can be transferred to a district heating network via heat exchanger to reduce heat waste. For district heating network, the heat generated by producing plant is distributed to consumer via heated supply water for underfloor heating, and then the colder return water is circulated back to the district heating plant.

During winter time, in order to keep room at optimal comfort temperature, the amount of return water flow rate is adjusted dynamically by PID controller based on the return water temperatures set-point, room temperature and other external variables. The higher the return water flow rate, the higher the amount of heat transfer to the room. Optimal comfort temperature is the preferred room temperature which can be set on the thermostat. This varies from person-to-person, but generally 21°C is the ideal temperature for living areas.

Amazon Accessible RL

Amazon Accessible RL (A2RL) is a custom-built open-source package that provides everything a data scientist needs to work on time series related to sequential decision-making problem. It is low-code package the guides data scientist through problem formulation, initial data analysis to understand variance in the offline data, train a simulator, and providing recommended actions. Under the hood, A2RL is based on transformer technology that powers GATO, trajectory transformer and decision transformer. A2RL uses offline data such as states, actions and rewards to train a simulator, then simulate potential rewards by different valid actions, and recommend the best action to take in order to maximise your reward.

The following diagram shows the A2RL workflow:

Problem Formulation

In this use case, the goal is to control the room temperature to be close to optimal temperature by adjusting return water temperature set-point, which eventually control the amount of heated water flowing through the underfloor water loop. By maintaining the room temperature close to optimal temperature, there is less energy wastage due to overheating.

This is done by first identifying all the state variables and actions, then defining the objective (reward) which is the measure of temperature deviation from optimal comfort temperature. A transition function is built using random forest model to predict next hour room temperature, given the current room temperature and states. Finally the transition function is incorporated into a custom underfloor heating OpenAI Gym environment so that A2RL and other offline RL agent can interact, learn, recommend an action and observe the reward.

States

Outside air temperature
Room temperature
Supply water temperature
Encoded date and time of day and weeks

Actions (Decision variable):

Return water temperature set-point

Transition Function:

Room temperature (t+1) = F(state, action)

Reward:

The reward function measures how much room temperature is deviate away from the optimal room temperature. As the return value is negated, hence the larger the better.

Data Ingestion

The synthetic offline data is in hourly granularity and consists of the follow columns.

Overview of dataset columns

Before moving to the next step, A2RL requires you to have a clear problem formulation by indicating the states, actions and rewards columns in order to create a Whatif Dataframe.

A2RL provides simple API to create Whatif Dataframe as shown in the following:

Data Property Check

For many sequential decision-making problems, we look for some key patterns in the data. More precisely, the data should exhibit the MDP (Markov Decision Process) property for offline RL techniques to be effective.

A2RL enables you to check for data Markov property using plot information api to show the relationship between states, actions and rewards. You can check whether your offline data has the desired Markov property before proceeding further, and consider whether there is a need to gather more data, have more variance in the type of actions taken etc.

Take the first Markov Order property in preceding graph as an example, it shows that current state and action can affect next state. Given that Lag 1 has higher normalised score, it means the current state and action like to have higher influence on next state, instead the state after next state.

Train a Simulator

Training a simulator is just a few lines of code, where the offline data will be tokenized and converted into sequence of tokens for training.

Recommendation

Once a simulator is train, you can pass in current context information to the sample api (link) to return recommended actions. The sample api will generate the next action and estimated rewards based on the given context multiple times, and you can select the action that maximise or minimise the reward.

The package will do the heavy lifting to ensure the simulator will generate the states, actions and rewards in the correct sequence, and perform validation to ensure the values returned are valid based on offline data.

This is following is an end to end flow example.

End to end example

Evaluation

We performed evaluation over an episode of 500 steps to assess how well A2RL’s recommendation is doing, by comparing against Conservative Q-Learning for Offline Reinforcement Learning (CQL), an offline RL algorithms.

The previous graph shows the deviation from optimal temperature of 21°C as indicated by the area in blue. The total rewards achieved by A2RL over an episode of 500 steps is comparable, and in fact higher at -189, compared to CQL at -249.

Conclusion

In this post, we demonstrated how to use A2RL to optimise underfloor heating return water temperature set-point to reduce energy consumption. To further improve on your use case, you may explore incorporating other related information such as occupancy, weather data, load etc, or tweak transformer model, perform hyper-parameter tuning when training the simulator.

To find the latest developments to this solution, check out the GitHub repository, which contains the A2RL python package, and underfloor heating example shown in this post. To explore further on how to use A2RL, you can visit A2RL full documentation.

Opinions expressed are solely my own and do not express the views or opinions of my employer.