Probing the Limits of Irregularly Sampled Short Time Series (ISSTS) Clinical Data: Hidden Patterns or Hunting Phantoms? (Part 2)

Identifying clinically meaningful insights require a different set of analytical techniques than what we are used to. What does such a toolkit look like?

Marymount Labs
6 min readSep 3, 2023

In the previous article, we defined the problem of analysing irregularly sampled short time series data, or ISSTS for short. VaDER was identified as a promising analytical technique, which simultaneously learns the latent representations and clusters assignments of its input samples.

During the VaDER implementation, several challenges were identified. These included:

  1. How should the optimal number of clusters be determined?
  2. How can ground truth labels be incorporated into model training?
  3. How can clustering results be refined?
  4. How does the data sampling rate (‘completeness’) affect clustering?

Before beginning the experiments, we constructed a baseline of clustering results using manual feature extraction, to provide a comparison to VaDER clustering results. Time series features were extracted using the tsfresh python package. The elbow method was applied to determine an optimal number of clusters. Cluster centroids were visually well separated and could have some phenotypic implications. For example, cluster 1 (orange) exhibited strong variability between quarters and BP levels were generally elevated.

Baseline construction using tsfresh Python package

1. Determining optimal number of clusters

In the original VaDER paper, the authors trained VaDER for different numbers of clusters k. To determine the optimal k, “prediction strength” was computed for each iteration:

  1. Train VaDER on the training data
  2. Assign clusters to the test data using the training model
  3. Train VaDER on the test data
  4. Assign clusters to the test data using the test model
  5. Compare the resulting 2 clusterings, i.e. the fraction of pairs of samples in the test clusters that are also assigned to the same cluster by the training model

This method provides a sense of “cluster stability”, assuming that if robust latent representations were learnt from the training set, then these representations should also be learnt from the test set. In other words, there should be an optimal number of clusters given an adequately strong signal.

However, this assumption ignores the possibility that the same dataset could have multiple sensible ways of clustering. Should this be the case, it is reasonable to expect that clustering results may not converge to an optimum. Thus, the optimal number of clusters should be to learn clinically meaningful outcomes, rather than to achieve a rather abstract benefit of “cluster stability”. Other mathematical metrics, like inertia (elbow method) and silhouette scores, suffer the same weakness.

To give greater weight to clinical relevance, we suggest incorporating ground truth labels into model training and iteratively refining clustering results.

2. Incorporating ground truth labels into model training

To broadly test the feasibility of incorporating ground truth labels, each time series was tagged with either “hypertension” or “non-hypertension” based on the patient’s chronic disease profile. By introducing hypertension diagnosis as an outcome measure, the VaDER model jointly optimises the training for both reconstruction and prediction tasks.

This also allows evaluation of the clustering based on “cluster purity”, which is the percentage of data points in each cluster belonging to the dominant class of outcome variables. For example, if data points in a specific cluster are generally labelled “hypertension”, this could indicate a certain time series pattern is associated with hypertensive patients. Such a clustering could thus have clinical significance. Below, we demonstrate an average cluster purity of 0.9.

VaDER clustering on 75% data sampling with ground truth labels

Visually, however, the cluster centroids do not appear as well separated as the baseline clustering using manual feature extraction. The centroids also appear erratic between quarters. This is likely due to clustering based on higher-dimensional features that are not observable in a 2-D plot.

3. Iteratively refining clustering results

Clinical significance of clustering is also affected by how compactly clustered the data points are to its centroid. To illustrate, a loosely defined clustering with high inertia (measured as the sum of squared distances) would mean that a wide range of time series patterns could be assigned to the same cluster. Whatever clinical implications or recommendations that are associated with a given cluster are thus applicable to patients exhibiting different patterns of BP levels. This diminishes the possibility of targeted recommendations.

To achieve a more well-defined clustering, we implemented a straightforward approach to iteratively refine the clustering. Two parameters are considered:

  • P1: the maximum distance (e.g. Euclidean, cosine, etc.) between a data point and the cluster centroid. Data points that exceed P1 are considered as outliers.
  • P2: the minimum proportion of data points in a cluster that is below the P1 threshold. Clusters that do not meet P2 threshold are considered to be poorly defined clusters.

Applying these thresholds successively to refine the clustering results takes the following steps:

  1. Initial clustering. Apply VaDER to initiate the clustering.
  2. Identify outliers. Every data point that exceeds P1 threshold is labelled as an outlier. Every cluster that contains a proportion of outliers exceeding P2 threshold is removed. All remaining clusters are considered well-defined, and only the outlier data points within those clusters are removed.
  3. Handling outliers. All outliers removed from Step 2 are reclassified by training a new VaDER model. Steps 2 and 3 are repeated until all clusters are considered well-defined by the P2 threshold.

Using this approach, more clusters are generated resulting from iteration. For example, Cluster E (pink), which has a 0.71 purity contained the most non-hypertensive patients of all the clusters. This is also the cluster with the least variance and lowest mean BP levels throughout the 3-year period, indicating well-controlled BP.

VaDER clustering on 75% data sampling with iterative refinement applied

4. Testing effect of data sampling rate on clustering results

While one of VaDER’s benefits is imputation of missing data, we tested the quality of clustering differed between our 75% and 100% data completeness. With refinement, the cluster centroids learnt from 100% complete data appeared visually more separated and the average cluster purity also increased slightly to 0.92.

Potentially clinically significant clusters also surfaced. For example, Cluster B (red), which has a 100% purity of hypertensive labels, seems indicative of patients who had untreated hypertension and elevated BP levels for the first two years, followed by a sharp drop in BP which suggests better control via medication.

VaDER clustering on 100% data sampling with iterative refinement applied

Hidden patterns or hunting phantoms?

Extracting clinically significant insights from irregularly sampled short time series data is not trivial. Do the inherent shortcomings of the dataset remove all potential signals? Or are the methods trained at inducing signals from the short time series data merely hunting phantoms?

There are no simple answers but we know better evaluative techniques are needed. Simple techniques like prediction strength and cluster purity are inadequate to evaluate whether the generated clusters provide clinically meaningful insights. Better ground truth labels and post-clustering analysis (e.g. multivariate analysis) are possible strings to tug at.

Phoebe is a final-year Data Science and Analytics student at the National University of Singapore, with a keen interest in data science applications in healthcare.


de Jong, J., et al. (2019). Deep learning for clustering of multivariate clinical patient trajectories with missing values. GigaScience, 8(11), giz134.

Harutyunyan, H., Khachatrian, H., Kale, D. C., et al. (2019). Multitask learning and benchmarking with clinical time series data. Scientific Data, 6, 96.

Kaushik, S., Choudhury, A., Sheron, P. K., Dasgupta, N., Natarajan, S., Pickett, L. A., & Dutt, V. (2020). AI in Healthcare: Time-Series Forecasting Using Statistical, Neural, and Ensemble Architectures. Frontiers in Big Data, 3, Article 4.

Sun, C., Hong, S., Song, M., & Li, H. (2020). A Review of Deep Learning Methods for Irregularly Sampled Medical Time Series Data. arXiv preprint arXiv:2010.12493.



Marymount Labs

Building the digital infrastructure for preventive healthcare