**Probing the Limits of Irregularly Sampled Short Time Series (ISSTS) Clinical Data: Defining the Problem (Part 1)**

## Time-series data, like blood pressure, is often collected sporadically and inconsistently in primary care. Are they still analytically useful?

Time series data is abundant in healthcare settings, but presents significant analytical challenges. In primary care, where data is only collected during occasional visits to the GP clinic, the difficulty of extracting clinically relevant insights from a longitudinal perspective is especially complex.

Most people typically visit a GP doctor only a few times a year, only during episodes of sickness. These visits occur at random, without any particular seasonality or frequency. Hence, the longitudinal data collected in primary care for blood pressure, body mass index, blood glucose, cholesterol levels, etc. are inherently problematic from an analytical perspective. While potentially useful, they provide a skewed perspective for early risk detection or any inference on disease progression.

Hence, the motivation for this study is to examine possible approaches to overcome the limitations of time series analysis in primary care. Blood pressure was selected as the topic of interest, as it was the most abundantly collected physiological indicator at GP clinics, regardless of a person’s chronic disease profile.

**Shape and structure of our time series dataset**

This analysis was performed on a de-identified dataset of clinic visits over five years. The focus is on addressing the analytical challenges specific to time series data, in particular blood pressure (BP) recordings. Each BP recording has two components — systolic and diastolic blood pressure — which were tagged to a specific date.

A descriptive analysis of the BP time series dataset revealed interesting internal inconsistencies which have analytical implications. An illustration of the statistical inconsistencies:

**Different monitoring durations. **While the dataset covered a range of five years, it also captured patients who registered or deregistered midway. Resultantly, the mean duration of BP monitoring was only two years and the short time series (STS) data is further complication since the quantity of ‘shortness’ is inconsistent with a standard deviation of 19.4 months. A good analytical approach will need to flexibly accommodate or compensate for different STS lengths.

**Inconsistent sampling count.** Since BP is also recorded during each clinic visit, the number of recordings is subject to a patient’s visit frequency. Of the 27,600 patients in our dataset, 75% have at least 11 BP recordings. This manifests as another dimension of ‘shortness’ in our STS dataset, wherein the same monitoring duration could be represented partially by varying number of data points. Hence, there are ‘gaps’ within the STS.

**Non-uniform time intervals.** The duration lapsed between each BP recording is irregular, again depending on the pattern of clinic visits. Even for the same patient, BP could be recorded in quick succession (e.g. follow-up visit a few days after an acute episode) or only a few months (or years) later. On average, the duration between BP recordings is 3–4 months.

**Missing not at random:** Intrinsic to the challenges above is that the variance between patients is not purely random. A patient’s BP is recorded during a clinic visit, which typically only happens when they have fallen ill. At times, these episodes could directly influence their BP levels, leading to a skewed representation of the patient’s BP levels on average.

Altogether, BP is just one example of irregularly sampled short time series (ISSTS) which requires novel approaches to analyse.

**Sketch of possible approaches**

Deriving insights from ISSTS data is uniquely challenging. In particular, clustering these ISSTS into subgroups could have clinically relevant insights, such as whether a specific ISSTS trend or motif could be indicative of early chronic disease progression or deterioration. Learning deep representations of ISSTS could also lead to more robust phenotyping of patients.

However, existing methods often fall short due to inherent complexities of ISSTS. Data-intensive approaches, such as recurrent neural networks (RNN) and long short-term memory (LSTM) models typically work better for longer time series data. These methods are also not well adapted to handle the irregular sampling rate and ‘shortness’ of ISSTS data.

To address the ‘incompleteness’ within ISSTS, researchers have suggested two broad approaches. Firstly, data imputation methods that treat the underlying dataset as having ‘missing data’ and thus requiring some interpolation, e.g. using k-means. Secondly, admitting the ‘incompleteness’ as an intrinsic feature of the dataset, which is to say the degree of irregularity tells us something clinically useful. Though there are merits to both, this study focuses more on the former approach due to the additional complexity introduced by the ‘shortness’ of the ISSTS data.

**Variational Autoencoder for Deep Embedding into Clusters (VaDER)**

One of the more applicable deep learning techniques is the Variational Autoencoder for Deep Embedding into Clusters, or VaDER, developed by de Jong et al. (2019). By integrating data imputation into the training process based on variational autoencoder principles, the researchers address the incompleteness of ISSTS data.

VaDER is in part based on VaDE, a clustering algorithm based on variational autoencoder principles, with a latent representation forced towards a multivariate Gaussian mixture distribution. Additionally, VaDER (i) integrates 2 long short-term memory (LSTM) networks into its architecture, to allow for the analysis of multivariate time series; and (ii) adopts an approach of implicit imputation and loss reweighting to account for the typically high degree of missingness in clinical data.

VaDER emerged as an ideal candidate for several reasons:

**Handling Missing Values**: VaDER’s architecture deals with missing values by directly integrating imputation into model training. It models the auto-correlation between time points, as well as cross-correlation between variables.**Short Time Series Adaptability**: VaDER’s learns latent representations from ISSTS, which can be used for clustering and other task learning (e.g. generating simulated patient trajectories).**Unsupervised Clustering**: VaDER facilitates unsupervised clustering, thus removing the reliance on ground truth labels, which may be absent or incomplete in some cases.

To make VaDER work for the chronic disease dataset, some assumptions were made during the preprocessing:

- Clinical measurements aggregated on quarterly basis (3 months), since the average duration between BP recordings was 3–4 months.
- Select the most recent 12 quarters of data for each patient from the most recent clinical visit, hence every patient’s ‘monitoring window’ was standardised to 3 years.
- Transform static variables (e.g. age, ethnicity, weight) into 12 repeated encoded labels (1,0), thus mirroring the standardised length of the time series data.

From here, training VaDER on the time series data followed the researchers’ public source implementation. Using the trained model, a series of benchmarking and experiments were conducted to test the rigour of VaDER, which will be published in a follow-on article. Thanks for reading and follow us to stay updated on the findings!

*Phoebe Tan is a final-year Data Science and Analytics student at the National University of Singapore, with a keen interest in data science applications in healthcare.*

**Key References**

de Jong, J., et al. (2019). Deep learning for clustering of multivariate clinical patient trajectories with missing values. GigaScience, 8(11), giz134. https://doi.org/10.1093/gigascience/giz134

Harutyunyan, H., Khachatrian, H., Kale, D. C., et al. (2019). Multitask learning and benchmarking with clinical time series data. Scientific Data, 6, 96. https://doi.org/10.1038/s41597-019-0103-9

Kaushik, S., Choudhury, A., Sheron, P. K., Dasgupta, N., Natarajan, S., Pickett, L. A., & Dutt, V. (2020). AI in Healthcare: Time-Series Forecasting Using Statistical, Neural, and Ensemble Architectures. Frontiers in Big Data, 3, Article 4. https://doi.org/10.3389/fdata.2020.00004

Sun, C., Hong, S., Song, M., & Li, H. (2020). A Review of Deep Learning Methods for Irregularly Sampled Medical Time Series Data. arXiv preprint arXiv:2010.12493.