Oral Temperature Prediction with Python

Viktoriia Untilova
Data And Beyond
Published in
16 min readApr 16, 2024

The aim of this tutorial article is to explore the infrared thermography data and to implement linear regression algorithm step-by-step using python.

We will start by describing the general context of studied data as well as detailed feature description. Next we will perform Exploratory Analysis to understand correlations between features and its relation to target variable.

In the Modeling part we will go through basic understandings and math behind the linear regression as well as step-by-step implementation of the learning algorithm in python. Finally we will compare obtained results for different models.

Photo by Madalyn Cox on Unsplash

Context

Why Clinical Thermography data ?

Clinical thermography serves as a valuable tool in the diagnostic and treatment process, offering important insights into the body’s physiology and helping to improve patient outcomes. It is important for several reasons:

  1. Early Detection: Thermography can detect abnormalities in the body’s temperature patterns, often before they are visible to the naked eye or detectable through other diagnostic methods. This early detection can lead to timely intervention and treatment, potentially preventing the progression of diseases.
  2. Preventative Medicine: By detecting abnormalities early, thermography can contribute to preventative medicine efforts by allowing for proactive interventions to address potential health issues before they become more serious.
  3. Non-invasive: Unlike many other diagnostic procedures, thermography is non-invasive and does not involve radiation or contact with the body. This makes it a safe option for patients, particularly those who may be sensitive to other diagnostic techniques.
  4. Risk Assessment: Thermography can provide valuable information by identifying areas of abnormal heat patterns, it can help to determine the likelihood of malignancy and tailor further diagnostics.
  5. Monitoring Treatment Progress: Thermography can be used to monitor the effectiveness of treatments over time. Changes in temperature patterns can indicate how well a treatment is working or if adjustments need to be made to the treatment plan.
  6. Pain Management: Thermography can also be used in pain management to identify areas of inflammation or nerve dysfunction, helping healthcare providers develop targeted treatment plans for pain relief.

This temperature measurement method gained popularity after COVID-19 for numerous reasons. The pandemic accelerated the adoption of remote healthcare monitoring solutions. Thermography can be integrated into telemedicine platforms, allowing healthcare providers to remotely monitor patients’ temperature patterns and detect potential health issues without the need for in-person visits.

Thermographic cameras have been deployed in various public spaces to screen individuals for elevated body temperatures as part of COVID-19 prevention measures. This increased visibility of thermography has raised awareness about its potential applications beyond pandemic-related screening, contributing to its popularity in clinical settings.

Finally advances in thermographic imaging technology have improved the accuracy, resolution, and usability of thermography systems, making them more accessible and practical for clinical use.

Input Data

The input data was taken from Irvine Machine Learning Repository.

A set of infrared thermography measurements was performed by Wang, Quanzeng, et al. [1]

The image below taken from the original paper shows the facial regions used in the reference to extract data points available in the dataset [2]:

Delineated facial regions and critical points on thermal images (image source: ref. [1])

Different regions of thermal images were defined using identified facial key-points and the temperatures at these regions were further extracted using the method proposed in [3, 4].

The imported dataset contains temperatures read from various locations of infrared face images of different patients, with the addition of oral temperatures measured for each individual. It contains 33 features such as gender, age, ethnicity, ambient temperature, humidity, distance, and other temperature readings from the thermal images. [2]

Variable Description:

Exploratory Data Analysis

Let’s begin by exploring the thermography data described above.

The dataset is available through ucimlrepo python package:

# install the necessary package using pip or conda
! pip install ucimlrepo

Importing data into Jupyter or Google Colab Notebook:

# Import the dataset into code notebook
from ucimlrepo import fetch_ucirepo

# Fetch dataset
infrared_thermography_temperature = fetch_ucirepo(id=925)

# Data (as pandas dataframes)
X = infrared_thermography_temperature.data.features
y = infrared_thermography_temperature.data.targets

What is the distribution of the target variable ? Below the code to visualize it using seaborn library:

import seaborn as sns

# Plot the figure
fig = plt.figure(figsize=(8, 6))

# Set parameters of the figure
sns.set_context("paper", rc={"font.size":10,"axes.titlesize":8,\
"axes.labelsize":10, 'xtick.labelsize': 'small', \
'ytick.labelsize': 'small'})

# Plot the histogram
age_dist = sns.histplot(data=y.aveOralF.sort_values(),
color="green", bins=40, kde=True)
age_dist.set(xlabel='Average Oral Temperature (°C)')

# Plot boxplot
y.boxplot("aveOralF")
plt.xlabel("Average Oral Temperature (°C)")
Distribution of the Oral Temperature

Major part of target data points is concentrated between 36.3°C and 37.5°C, whereas the maximum is situated around 39.6°C and the median around 36.9°C.

Feature Correlation

Let’s explore the relation of each feature with the target variable. Below we are plotting a set of box plots for categorical features and a set of scatter plots for numerical features.

# Merge features and targets into one dataframe
df = X
df["aveOralF"] = y.aveOralF

# Reorganise columns
temp_cols=df.columns.tolist()
cols = temp_cols[2:3] + [temp_cols[1]] + [temp_cols[0]] + temp_cols[3:]

# Remove age category that has only ten rows
# and remove an outlier for distance feature
df = df[cols][(df.Age != "21-30") & (df.Distance < 1)]

# Define color shades for numerical plots
colors = [random.randrange(10, 80, 5) / 100 for i in range(len(X.columns))]

# Set figure dimensions
dims = (11, 3)

# Make subplots
f, axes = plt.subplots(dims[0], dims[1], figsize=(30, 60))
axis_i, axis_j = 0, 0
c = 0

# Define graph settings
sns.set_context("paper", rc={"font.size":20,"axes.titlesize":8,\
"axes.labelsize":20, 'xtick.labelsize': 'small', \
'ytick.labelsize': 'small'})

for col in cols:
# Plotting first categorical features
if df[col].dtype == object and col != "aveOralF" :
order = None
if col == "Age":
order = ["18-20", "21-25","26-30","31-40","41-50","51-60",">60"]
order.reverse()
box_plot = sns.boxplot(x=df.aveOralF, y=df[col], ax=axes[axis_i, axis_j], \
order=order, palette="Set2", hue=df[col])
box_plot.set(xlabel='Average Oral Temperature (°C)')

# Scatter plot
if df[col].dtype != object and col != "aveOralF":
sns.regplot(x=df.aveOralF, y=df[col], ax=axes[axis_i, axis_j], color=str(colors[c]), line_kws=dict(color=str(colors[c] + 0.2)))
c += 1
axis_j += 1
if axis_j == dims[1]:
axis_i += 1
axis_j = 0
Age, Ethnicity and Gender distribution on different categories of each feature

The categorical features present in the dataset are: patient’s Age, Ethnicity and Gender. From the set of graphs above we observe that:

  • median temperature is different between age groups, noteworthy between 18–20 y.o. and 21–25 y.o. as well as between 51–60 y.o. and the rest of the dataset. However it worth to note that 95% of data is taken for ages under 30 y.o.
  • female’s box plot has narrower temperature distribution than the male’s one
  • variation between the median temperature for different ethnicity varies up to 0.2 °C

Below is shown the correlation of the temperature from different face regions with the oral temperature.

As seen from the graph below, most of the continuous features correlate linearly with the target as well as between each other. That is why later on in the modeling part we start by fitting first the univariate linear regression and later fit the whole dataset to compare model performance.

However features such as Ambient Temperature, Humidity and Distance shows practically no correlation with target:

Finally let’s plot the Face Maximum Temperature as a function of the Oral Temperature as it is easily measurable and seems to correlate the most with the target. For the curiosity we will plot two graphics with hue axis representing Ethnicity and Age using plotly library:

import plotly.express as px
import plotly

fig = px.scatter(df, x="aveOralF", y="T_Max1",
labels=dict(aveOralF="Average Oral Temperature (°C)", T_Max1="Face Maximum Temperature (°C)"),
color="Ethnicity", size_max=20,width=800, height=400)

fig.update_layout(
font=dict(
size=16,
)
)

fig.show()
Whole Face Maximum Temperature vs. Average Oral Temperature. The hue axis represents Ethnicity
fig = px.scatter(df.sort_values(by=["Age"]), x="aveOralF", y="T_Max1",
labels=dict(aveOralF="Average Oral Temperature (°C)", T_Max1="Face Maximum Temperature (°C)"),
color="Age", size_max=20)
fig.update_layout(
font=dict(
size=16)
)

fig.show()
Whole Face Maximum Temperature vs. Average Oral Temperature. The hue axis represents Age

Modeling

In supervised learning, each example in the training dataset consists of an input (or features) and the corresponding desired output (or label). In case of regression tasks, the goal is to predict a continuous numerical value based on input features. The output variable is continuous, meaning it can take on any value within a range.

Any supervised machine-learning task can be roughly summarized in the schema below where the training dataset is used by learning algorithm in order to map input data X to the output y using predefined hypothesis function.

A general schema for any supervised ML modeling: defined hypothesis function allows to map X’s to y’s using training dataset and learning algorithm

Let’s consider more in details one of the most common machine learning algorithm:

Univariate Linear Regression

Linear regression is a statistical method used to model the relationship between two variables by fitting a linear equation to observed data. It is one of the simplest forms of regression analysis and is widely used in statistical modeling and machine learning.

Mathematically, this relationship or hypothesis function is represented by the equation below:

where:

h(x) — hypothesis function that best describes target

θ’s — parameters

x — variable (feature)

In our case x corresponds to whole face maximum temperature feature, whereas oral temperature is actual measured y value:

|  Whole face max temperature (T_Max1) | Oral temperature (T_Oral) |
| X | y |
|--------------------------------------|---------------------------|
| 35.6925 | 36.59 |
| 35.1750 | 37.19 |
| 35.9125 | 37.34 |
| ... | ... |

How to choose correct θ’s ? In order to guide the process of parameter estimation and evaluate how well the model is performing we use a cost function J(θ):

where:

m — number of training examples

y’s — observed values

i— corresponds to i’s training example

The primary purpose of a cost function in linear regression is to quantify the difference between the predicted values of the dependent variable (based on the linear model) and the actual observed values from the dataset.

The cost function provides a quantitative measure of how well the linear regression model fits the observed data. By calculating the difference between the predicted and actual values (squared to emphasize larger errors), the cost function gives an indication of the overall performance of the model.

So the goal is to find the optimal parameters (coefficients θ) of the linear equation that minimize the difference between the predicted and actual values.

Gradient descent

In order to optimize J(θ) we will use gradient descent algorithm. In this case the goal is to minimize the cost function J(θ) given above:

Algorithm outline:

  • Initialize some parameters θ’s
  • keep modifying θ’s to reduce cost function until reaching the minimum

So basically we need to iteratively repeat modifying parameters simultaneously until reaching the convergence using the rule below:

where:

α — learning rate hyperparameter

j — parameter indices ( in case of univariate linear regression j∈{0, 1} )

It is worth to note that both parameters must be updated simultaneously during each iteration. After calculating the partial derivatives of J(θ) with respect to θs the final gradient descent equations are:

The pseudocode below summarizes the algorithm:

# Pseudocode

initialize parameters θ

repeat until convergence {
loss := calculate cost function J(θ)

temp0 := θ0 - learning rate * J(θ) partial derivative with respect to θ0
temp1 := θ1 - learning rate * J(θ) partial derivative with respect to θ1

θ0 := temp0
θ1 := temp1

}

Multivariate Linear Regression

In case of multiple features, we need to use slightly different notation and see both θ and x as row and column vectors, respectively in order to correctly multiply both instances and avoid using unnecessary for-loops:

where:

n — number of features

X —feature vector

Θ — parameter vector

Let’s plug the updated hypothesis function into gradient descent equations to obtain correct formulas valid for multivariate case (n ≥ 1) :

where:

i — training example’s index

j — feature index

n — number of features

m — number of training examples

Implementing Linear Regression

Below we will consider in details the code implementation of the algorithm described above. So feel free to skip this part and see directly obtained results. The code given below is available in my GitHub repository so you can also clone it and run locally.

Stack used for the implementation: python, numpy, pandas, mlflow, plotly, seaborn, matplotlib

Data load

We start by importing dataset from remote repository and splitting it into train and test set:

def train_test_split(X, y, SEED, frac=None):
""" Split data into train, validation and test set"""

# Shuffle data before splitting
X = X.sample(random_state=SEED)
y = y.sample(random_state=SEED)

# Train, Validation and Test sets
if frac is None:
frac = [0.6, 0.2, 0.2]

size_train = int(X.shape[0] * frac[0])
size_val = int(X.shape[0] * frac[1])
size_test = int(X.shape[0] * frac[2])

X_train = X[:size_train]
y_train = y[:size_train]

X_val = X[size_train:size_train + size_val]
y_val = y[size_train:size_train + size_val]

X_test = X[size_train + size_val:]
y_test = y[size_train + size_val:]

return X_train, y_train, X_val, y_val, X_test, y_test

For simplicity we keep the validation part empty so far:

from ucimlrepo import fetch_ucirepo

# fetch dataset
infrared_thermography_temperature = fetch_ucirepo(id=925)

# data (as pandas dataframes)
X = infrared_thermography_temperature.data.features.loc[:, ['T_Max1']]
y = infrared_thermography_temperature.data.targets.aveOralM

# Train Test Split
X_train, y_train, X_val, y_val, X_test, y_test = train_test_split(X, y, SEED=29, frac=[0.8, 0.0, 0.2])

Model

Let’s create a python class called GradientDescentRegressor that will have the following attributes:

  • learning_rate — learning rate
  • iterations — number of iterations
  • history (to store learning history after training)
  • model — model parameters found during training
  • and some other additional attributes related to the data
class GradientDescentRegressor:
""" Batch Gradient Descent for Linear Regression """
def __init__(self, iterations: int, learning_rate: float):

self.learning_rate = learning_rate
self.iterations = iterations

self.history = {}

self.model = None
self.std = None
self.mean = None
self.schema = None

To which we add below the following methods:

  • preprocessing

Preprocessing method of regressor serves to prepare input data before feeding it to the algorithm. It consist of encoding categorical features into dummy variables and scaling numerical features and target.

As a rule, preprocessing steps are adapted to a specific data we are dealing with. To be more precise, for this dataset preprocessing consist of:

  • creating one-hot encoding (dummy columns) for categorical feature in order to turn it to numerical values equal to either 0 or 1
  • filling missing values of Distance feature with median value
  • performing feature scaling in order to ensure that all features contribute equally to the learning process and also to reduce the impact of outliers. The used method scales the features so that they have a mean of 0 and a standard deviation of 1
  • completing the categorical feature’s column with zeros in case some combinations are not present in the test set
  • creating the additional column necessary for modeling that represent x₀ feature where all values are equal to 1 (see Modeling part)
  • performing target scaling similarly to feature scaling for homoscedasticity and numerical stability
    def preprocessing(self, X, y):
"""
Preprocess dataset,
make one hot encoding for numerical features

:param df: input X, y
:return: transformed X, y
"""

dummy_cols = []

if len(X.columns) > 1:
# One hot encode
# categorical features
dummy_cols = X.columns.to_list()[:3]

# Fill na values with Distance feature
X.loc[:, 'Distance'] = X['Distance'].fillna(X['Distance'].median())

# Feature Scaling
for i in X.columns:
if i not in dummy_cols:
X.loc[:, i] = (X[i] - X[i].mean()) / X[i].std()

X = pd.get_dummies(X, dtype=float, columns=dummy_cols)

if self.schema is None:
self.schema = X.columns
else:
missing_cols = list(set(X.columns).symmetric_difference(self.schema))
if missing_cols:
for c in missing_cols:
X.loc[:, c] = 0

X = X[self.schema]

X = np.array(X)
# Add column that corresponds to x0 = 1
# for each training example
X = np.insert(X, 0, 1, axis=1)

if self.std is None:
self.std = y.std()
self.mean = y.mean()

# Target Scaling
y = (y - self.mean) / self.std

y = np.array(y).reshape(y.shape[0], 1)

return X, y
  • fit

Fit serves to perform model training:

    def fit(self, X, y):
""" Fit model to data """

X, y = self.preprocessing(X, y)

# Initialize weights
params = initialize_parameters(X.shape[1])

history = {'iteration': [], 'loss': []}

nb_iter = self.iterations

while self.iterations > 0:
# Repeat until convergence
loss = calculate_loss(X, y, params)

mlflow.log_metric("loss", loss)
mlflow.log_metric("iteration", nb_iter - self.iterations)

# Update parameters
params = update_parameters(X, y, params, self.learning_rate)

history['iteration'].append(nb_iter - self.iterations)
history['loss'].append(loss[0])

if self.iterations % 25 == 0:
print(f"Iteration - {history['iteration'][nb_iter - self.iterations]},", "\t",
f"Loss: {history['loss'][nb_iter - self.iterations]}")

self.iterations -= 1

self.model = params
self.history = history

os.mkdir(OUTPUT)

return history
  • predict

Predict method is used for inference part:

    def predict(self, X_test):
""" Predict a target"""

res = np.sum(X_test*self.model, axis=1)*self.std + self.mean
return res
  • plot_learning_curve

This allows to visualize the learning history:

    def plot_learning_curve(self, OUTPUT):
""" Plot learning curve """
plt.title("Learning curve")

plt.plot(self.history['iteration'], self.history['loss'])

plt.xlabel("# iteration")
plt.ylabel("Loss")

plt.savefig(OUTPUT + "learning_curve.png")
plt.close('all')

mlflow.log_artifact(OUTPUT + "learning_curve.png")

Experiment Tracking

Before we consider the main python script for training I would like to add a couple of words on mlflow library and why we use it in this project. This MLOps tool serves to simplify machine learning workflows.

Mlflow core concepts are Model Registry and Tracking. It is a great free tool to perform experiment tracking either in local environment or in the cloud. In this project it is used in order to log the info such as: features used for training, data split, loss values, metrics and artifacts such as images (learning curve and residuals plot).

Training

Let’s create additional modules to store auxiliary functions and parameters utils.models, utils.functions, utils.constant and import it in train.py:

from utils.models import GradientDescentRegressor
from utils.functions import train_test_split, estimate_performance, save_to_csv, log_params
from utils.constant import OUTPUT, FRAC, NB_ITER, ALPHA, SEED, RUN_NAME
import pickle
from ucimlrepo import fetch_ucirepo
import mlflow


if __name__ == '__main__':

# fetch dataset
infrared_thermography_temperature = fetch_ucirepo(id=925)

# data (as pandas dataframes)
X = infrared_thermography_temperature.data.features.loc[:, FEATURES]
y = infrared_thermography_temperature.data.targets.aveOralM

# Train Test Split
X_train, y_train, X_val, y_val, X_test, y_test = train_test_split(X, y, SEED, frac=FRAC)

# Compile the model
model = GradientDescentRegressor(iterations=NB_ITER, learning_rate=ALPHA)

with mlflow.start_run(run_name=RUN_NAME) as run:

log_params(FRAC, NB_ITER, ALPHA, SEED, X)

# Fit to the training set
history = model.fit(X_train, y_train)

save_to_csv(history, OUTPUT + "history.csv")

model.plot_learning_curve(OUTPUT)

# Save model in .pkl file
model_pkl_file = OUTPUT + "model.pkl"

with open(model_pkl_file, 'wb') as file:
pickle.dump(model, file)

mlflow.log_artifact(model_pkl_file)

# Inference on train/test set
estimate_performance(OUTPUT, model, X_train, y_train, "train")

estimate_performance(OUTPUT, model, X_test, y_test, "test")

mlflow.end_run()

Let’s firstly run a univariate linear regression using only one feature - ‘canthiMax1' — maximum value in the extended canthi area. According to the [1] both inner canthi and face maximum temperatures provided the highest clinical accuracy.

Constants declared in constant.py used during the first run:

from datetime import datetime

OUTPUT = "out/" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S") + '/'

# Tracking settings
n = len(FEATURES)
RUN_NAME = f"Train on {n} feature(s)"

FRAC = [0.8, 0.0, 0.2]
NB_ITER = 1000
ALPHA = 0.005
SEED = 29

# List features used for training
FEATURES = ['canthiMax1']

After running a train.py we find that the loss value is decreasing during each iteration which is a great sign that the algorithm works properly.

Inference

Let’s estimate our model performance on train and test set.

The common metrics for linear regression are root mean squared error (RMSE) and mean absolute error (MAE):

where:

y — actual value

ŷ — predicted value

Let’s estimate model performance and plot residuals:

def estimate_performance(model_folder, model, X, y, dataset="test"):
""" Estimate model performance """
# Estimate model performance on training set
# evaluate model
X, y = model.preprocessing(X, y)

y_pred = model.predict(X)

rmse = RMSE(y_pred, y)
mlflow.log_metric(f"{dataset}_set RMSE", round(rmse, 3))

mae = MAE(y_pred, y)
mlflow.log_metric(f"{dataset}_set MAE", round(mae, 3))

print(f"\n{dataset} set\n"
f"RMSE: {round(rmse, 3)}\n"
f"MAE: {round(mae, 3)}\n")

plt.scatter(y_pred, y)
plt.xlabel("y_pred")
plt.ylabel("y_test")

plt.savefig(model_folder + f"/{dataset}.png")

plt.close()

# Plot Residuals
residuals = y.reshape(y.shape[0]) - y_pred

sns.residplot(x=y_pred, y=residuals, scatter_kws={"color": "green"})

# Set plot labels and title
plt.xlabel('Predicted Oral Temperature (°C)')
plt.ylabel('Residuals')
plt.title(f'Residual Plot - {dataset} set')

plt.savefig(model_folder + f"/{dataset}_set_residuals.png")

plt.close()

mlflow.log_artifact(model_folder + f"/{dataset}.png")
mlflow.log_artifact(model_folder + f"/{dataset}_set_residuals.png")

The difference in error between train and test sets is only 0.021 °C for RMSE and 0.004 °C for MAE which means that the algorithm generalizes well to the test data. However, the residuals plot shows that the difference between actual and predicted can go up to 0.6-0.9 °C for some data points.

Results

In this section I share some results for the set of several runs with different parameters especially number of features. For the sake of simplicity I keep only train and test set as well as fixed learning rate as it gave good results during the first run.

Table view of experiment runs in MLFlow

From the table above we can see that univariate models that were trained using canthi maximum and face maximum temperatures have minimal errors: 0.247 °C and 0.213 °C correspondingly for MAE on the test set.

When training on the full set of available features the test set mean absolute error is 0.251 °C.

However removing all categorical and poorly correlated features such as Ambient Temperature, Humidity and Distance results in a poor performance: 0.422 °C and 0.522 °C on the test set for MAE and RMSE accordingly.

To conclude, having too much features that correlates in similar way may confuse a model and give worth results than training on one feature which generalizes sufficiently in our case it was face maximum temperatures.

Conclusions

In this tutorial you learned how to implement the most popular ML algorithm — multivariate linear regression step-by-step using python and infrared thermography data.

I hope you enjoyed this explorative article. Thanks for reading!

Feel free to comment and share your thoughts below ✍

--

--