My Heart Will Go on: Heart Rate Tracking Methods from Electrocardiography to Machine Learning

Neurodata Lab
11 min readDec 3, 2019

--

People have a hard time believing that a person’s heart rate (HR) can be measured by analyzing a video of them. Indeed, if we’ve already trained computers to calculate the HR using a video then why haven’t we learnt to do this ourselves in the course of our evolution? After all, somehow we’ve learnt to interpret tiny movements of facial muscles or detect emotions in each other’s voices. That would be a great skill, especially for people whose work involves measuring the HR — such as cardiologists, sports coaches, and even reporters.

Yet even when we look closely at someone, we cannot estimate their HR by sight or identify whether they’re suffering from cardiac irregularity without using any special equipment. Hower, we are getting close to becoming able to measure HR by just looking at someone. There are appearing new machine learning-based methods — such as computer vision and time series analysis — that can extract data on someone’s HR even from a webcam video. In this article we’ll talk about how these methods work and how you can develop your own algorithm for HR estimation.

Common Methods of Heart Rate Measurement

Let’s start with basics: how do we register heartbeat, and what methods are suitable for human eye and video analytics?

Picture 1. The process of heart excitation and ECG.

ECG
Electrocardiography is a method used in medicine to measure the heart rate. The results of this calculation are depicted on an electrocardiogram — the diagram of cardiac action potential. In fact, it reflects the process of heart excitation that is initiated in the sinus node and then spreads to the rest of the heart. As you imagine, there’s no way you could see this electric impulse with a naked eye (see Pic. 1).

Picture 2. Ballistocardiography.

Ballistocardiography
While you can’t see the electric impulse, you can hear heart contractions it causes by just putting your ear to someone’s chest. Moreover, sometimes you can see slight movements of skin right over the heart — they, too, are caused by heart contractions. When the heart contracts, it pumps blood into the aorta and further into the arteries. Since arterial walls are elastic, they widen and then narrow with every new blood wave. This widening is also visible in places where blood vessels are located close to skin surface such as wrists (see Picture 2). Ballistocardiography is basically a method of calculating the HR by registering such widenings and skin movements. This method is applied in certain pulsometers (such as chest strap heart rate monitors), but they aren’t always convenient to wear. In addition, you can’t always spot ballistocardiography-eligible body parts on a random video, and even if you do, tiny skin movements are difficult to separate from other body movements.

Photoplethysmography (PPG)
When the pulse wave finally reaches capillaries, they widen. This widening manifests in skin reddening — such as the one occurring during physical exercise. The easiest way to see the reddening is to press a finger pad to a clear glass (if your finger pad got white, then you’re pressing too hard and cutting off the blood flow). You’ll probably be able to see the pad alternating between slight reddening and whitening. If you don’t have a glass nearby, just look at this gif:

Gif 1. The finger pad of Neurodata Lab’s Scienitic Director and COO Olga Perepelkina.

When you have a signal this clear, you can detect the peak moments of skin color change and calculate the HR — that’s more or less how modern fitness trackers or smart watches work. They are equipped with a light-emitting diode that casts light upon your wrist (otherwise the area under the band is too dark to detect anything at all), and a light sensor that registers the reflected signal.

Meme 1. The limitations of PPG.

While you can ask someone to press their finger to a glass if you’re in a lab environment, how do you measure the HR in regular videos? In this case, we study face as a body part that is most frequently captured by cameras. In addition, there are many face detectors and training datasets that significantly simplify the development of this method. There are a couple differences between faces and finger pads when it comes to the HR measurement: in fingers, the capillaries lie closer to skin surface and we draw them even closer by pressing pads to the glass. While the capillaries in facial skin still widen with every new heartbeat, we cannot do the same to make the pulse waves more visible (see Meme 1).

ML-based Methods of HR Measurement

(Here comes the part heavy on machine learning terms. Prepare yourselves.)
Measuring HR on faces “in the wild” has several issues:

  1. A video’s color depth usually equals to 24 bit. It means that color of every pixel is coded with 3 values ranging from 0 to 255 for 3 color channels — Red, Green and Blue (RGB). The amplitude of color change in faces in this scale can be <1. In other words, the discretization of color itself is a serious obstacle for the HR measurement.
  2. If the lighting is dim then the video is recorded with high ISO sensitivity setting which generates digital noise. The noise can have an amplitude greatly exceeding that of the HR.
  3. If a person moves or talks, the moving face parts might become either lit or shadowed which can change their color more significantly than the pulse wave we’re trying to spot. Similar synchronized change in pixel color can be seen during a lighting change — for instance, when a person is looking at a monitor or watching a movie.

The first two problems can be solved by simply averaging the color of pixels over the whole face (see Formula 1):

Formula 1

where Sᵢ is the area corresponding to the face in shot i, and r(p), g(p) and b(p) stand for the color of pixel p in the RGB palette.

Picture 3. An example of an Sᵢ area.

The Sᵢ area used for signal collection can be localized by skin segmentation using neural networks, or with the assistance of the facial landmark detector (the easiest way is to start with creating a polygonal face mask using this resource). Some researchers collect signals in the areas of forehead and cheeks. Yet our experiments have shown that the most efficient method is to collect signals from the whole face. We note that you shouldn’t take into account the color in eye and mouth areas since their average color can change significantly during blinking and speech (see Picture 3).

If digital noise has a standard deviation σ for every pixel and a face contains an overall of n pixels, then according to the Central limit theorem an averaged signal for the whole face will be accompanied by the digital noise with a variance equal to σ²/n. In other words, for a face the size of 200x200 pixels we have just reduced the amplitude of digital noise by approximately 200 times. It became possible since digital noise is independent in different pixels.

Now we know the average color of a face for every video frame and can depict them as 3 time series: R, G, B.

These series (normalized) are shown on the picture below (Picture 4).

Picture 4. Normalized time series.

It’s evident all 3 channels are highly correlated. To measure the HR, we can use the linear combination of these channels. One of the ways to select its coefficients is ICA. Moreover, there are methods of extracting the pulse signal from RGB channels developed specifically for photoplethysmography, such as Plane Orthogonal to Skin method (POS). Though to simplify the process, now we’re using only the Green channel. We see it’s not periodic — this 8 second long video fragment is supposed to have 6 to 25 peaks, but we can’t see them yet even though the data has been extracted from the video of a still sitting subject in bright light conditions (see Video 1).

Video 1. Sample from the UBFC-RPPG Dataset.

When a person moves or a lighting changes, many pixels change their color simultaneously, therefore, the averaging of their color cannot free the signal of these artifacts. However, the changes in color caused by the HR rate have certain specifics: they are rhythmic and have the frequency of, say, 45 to 200 beats per minute. To filter out the frequencies not fitting into this range, we can use the FIR filters (see below).

import numpy as np
import scipy.signal
def apply_FIR_filter(
signal,
fps=25, # for 25 fps webcam
filter_size=2, # filter length = 2 seconds
min_f=0.75, # 0.75 Hz = 45 bpm minimum heart rate
max_f=3.3, # 3.3 Hz = 198 bpm maximum heart rate
FIR_filter = scipy.signal.firwin(
numtaps=filter_size * fps,
cutoff=[min_f * 2 / fps, max_f * 2 / fps],
window='hamming',
pass_zero=False)
filtered_signal = np.convolve(
signal,
FIR_filter,
mode='valid')
return filtered_signal
Picture 5. FIR filtered Green channel.

After the signal is filtered, we can see its periodic component and sometimes even the HR frequency or the signs of cardiac irregularity!

If the video contains some digital noise even after the FIR filtering and you can’t detect the exact heartbeat episodes using time series, then you can try the discrete Fourier transformation (DFT). Putting it simply, DFT represents any signal as a sum of periodic signals where each of these signals has a different period. The sequence of amplitudes of such periodic signals is called a spectrum of time series. The maximum amplitude of the spectrum will probably correspond to the HR.

Since human HR might vary, we can use the sliding window method by applying DFT to all fragments that are 10–15 seconds long as HR rarely changes significantly during such a short period. This method is called short-time Fourier transform (STFT). After we apply it, we get a different spectrum for every fragment (see Picture 6):

Picture 6. Sliding window spectrum.

The summary of spectrums for every window makes a spectrogram. It can be visualized as a heatmap like this (Picture 7):

Picture 7. The filtered Green channel spectrogram.

Every column of this spectrogram represents the spectrum at some timestamp. Lighter colors correspond to higher spectral coefficients. We see that in this particular video, a person’s HR first increased from 90 to 120 bpm, and then lowered slightly.

How do we make the HR tracking method more precise?

  1. After applying STFT to a 10-second window, the frequency domain division value of a spectrogram becomes equal to 6 bpm. We can decrease the division value by applying zero padding prior to DFT.
  2. In some moments, the maximum value of the spectrum might not correspond to the HR. It might happen if a person is moving their head in this particular video frame. Yet if we precisely measured their HR prior to them moving and it was, say, 100 bpm and now the spectrum maximum value is at 55 bpm, then we can conclude that the current HR should not be equal to the frequency with maximum amplitude. In other words, for longer pulse rate tracking we have to look at the spectrogram above and define the best track from left to right. The track should go through the brightest points of the spectrogram (high-amplitude frequencies, see Picture 8) while not having too big leaps in frequency domain. To find such a track, we can use dynamic programming. Another way of considering the past spectrums when predicting the HR is to train a recurrent neural network on a sequence of spectra. Though to do so, we’ll require labeled data since we need to know true HR values in any given moment of time.
  3. Experiment with signal pre-processing. You can adaptively choose facial fragments that are most suitable for the HR measurement in every video frame. Try mixing the color channels using POS and other signal separation methods. Experiment with different filter parameters. Before you use the FIR filter, you might want to take advantage of interpolation of R, G, B signals that imitates the higher fps value.

Another means of measuring the HR is to use convolutional neural networks (CNN), with or without RNN. This year there have been published several studies describing this approach (such as this one). Currently 3D CNN and RNN show the most impressive results for analyzing videos. However, they are usually trained to accomplish tasks that can be easily performed by humans: gesture classification, detection of vehicles etc. As we explained above, the heartbeat is not something you can see on someone’s face, so you got to use some tricks if you want to train your network to do this.

  1. Regardless of what you’ll do, first you’ll need to reduce the influence of digital noise — at least by averaging the color of pixels of a face segment. To do so, we can simply reduce the face to a smaller resolution, such as 30x30 pixels. We note that the optimal resolution is lower for low-quality videos and higher for high-quality ones.
  2. If you plan to measure the HR using 3D CNN, it would be great to reduce the movements of a person in the video. Face alignment can help you with getting rid of small head turns and tilts by applying affine mapping to the face in every frame to pin eyes and nose to certain parts of a frame.
  3. Another helpful instrument is the frequency filters we described above. After partially filtering those frequencies that are far from the HR values (in every point of the face), you’ll significantly lower the influence of changing lighting, movements, and breathing on the HR measurement accuracy.

Since the development of the HR estimation method that involves neural networks is by no means easy, we suggest that you test the tool we’ve developed by trying our API. You can also apply our API tools when building your own apps: they can provide invaluable insights for such fast-growing industries as customer analytics, smart mobility, and robotics. In addition, API’s algorithms could be used to improve drivers’ experience, experiment with building natural human-robot interaction (HRI), and advance personalized solutions in digital-out-of-home (DOOH).

Bonus: Video-based Heart Rate Detection by Neurodata Lab

Video 2. Oldie but goodie. Kit Harington Trying to Hide the Truth about the GoT Finale at
The Late Show with Stephen Colbert.

Authors: Mikhail Artemyev, Machine Learning Specialist at Neurodata Lab; Francesca Del Guidice, PR Specislist at Neurodata Lab.

--

--

Neurodata Lab

We create multi-modal systems for emotion recognition and develop non-contact methods of physiological signal processing. Reach us at contact@neurodatalab.com