Analysis of longitudinal data made easy with Leaspy

Igor Koval
8 min readAug 23, 2019

--

This tutorial will guide you through the main features of Leaspy, a python library designed to analyze longitudinal data.

Leaspy : Learning Spatiotemporal Patterns in Python

[ Table of content ]

  • Overview
  • Step 1. Let’s start
  • Step 2. Reconstruct the long-term process associated to the longitudinal short-term data
  • Step 3. Personalize the average trajectory to individual data
  • Step 4. Simulate virtual patients
  • Conclusion
  • [ Installation ]
  • [ Contacts & Reference ]

Overview

Leaspy, standing for LEArning Spatiotemporal Patterns in Python, has been developed to analyze longitudinal (or sequential) data that correspond to the measurements of a long-term progression. Said differently, each sequence of repeated observations derives from a portion of the global process, with a certain variability between sequence.

Figure 1 : Disease progression, measured in term of conversion of a feature from a normal to abnormal state, for three patients (yᵢⱼ corresponds to the j-th observation of the i-th patient), their respective individual trajectory ηᵢ in blue and the group-average process 𝛾₀ in grey. (a) shows the evolution of the feature during aging. (b) corresponds to the mathematical description of the disease progression, embedded into a Riemannian manifold M, with a geometrical shift (wᵢ) of 𝛾₀ and a temporal reparametrization of its timeline (αᵢ : acceleration factor, τᵢ : time-shift)

This scenario encompasses various phenomenons, especially disease progression. For instance, let’s consider patients presenting first signs of dementia that undergo some clinical examinations at different ages, as shown on Fig 1. (a). One can consider that they are short-term fractions of the long-term scenario of progression that can be modeled as a group-average “curve” 𝛾₀, as shown on Fig 1. (b). The individual trajectories derive from 𝛾₀ according to the following individual parameters :

  • a temporal reparametrization of the timeline t ↦ αᵢ (t -τᵢ) such that the time-shift τᵢ delays (τᵢ <0) or advances (τᵢ >0) the disease onset, and, that the acceleration factor αᵢ accelerate (αᵢ > 1) or decelerate (αᵢ < 1) the individual progression.
  • a spatial variability wᵢ that accounts for the geometrical variability between the progression : while technical (see [1] for details and [3] for examples), this variability takes the form, in the case of a sequence of logistic progressions, of a possible different ordering of this sequence at the individual level, compared to the mean progression.

To better outline the potential of Leaspy, we consider repeated cognitive measurements of patients that present signs of Alzheimer’s Disease, at different ages and stages. From this dataset, Leaspy essentially enables us to :

  • Reconstruct the long-term scenario of progression
  • Characterize the individual progressions
  • Analyze how cofactors modulate the individual progressions

An additional feature is presented in Step 4 : while characterizing the individual progressions, Leaspy learns their distribution and is thus able to simulate virtual patients that present the same characteristics as in the initial cohort.

This tutorial is fully reproducible from the example/start folder of the Leaspy website, after installing the software.

Analyzing longitudinal data does not only mean using Leaspy as a black-box but also being able to develop new techniques and algorithms. To this end, Leaspy has been developed with a user-friendly aim such that users with limited Python skills can obtain substantial results, while its internal modularity allows researchers to implement new features.

This mindset is key to bring research from theory to practice, rapidly.

Step 1. Let’s start

Data

Figure 2: Example of a csv file used as input of the model. The first column is the id of the subject, the second corresponds to its age at the given visit, and the next one are different measurements.

The input format is a csv file where the first column corresponds to the id of the subject, the second row its age (or any time-related feature) and the next rows are some features. The first two header names must be
ID, TIME, as shown on Fig 2., followed by the names of the features. In our case, they correspond to the memory, concentration, praxis and language capabilities of the patients.

The following Python code snippet allows to read the csv file and transform it into an appropriate format :

data = Data.from_csv_file(path/to/your/data.csv)

[ Note ] Even though not available yet, Leaspy will shortly offer the possibility to deal with data that embed a spatial structure e.g. the intensity variations of pixels and voxels in images, or, the signal change across the edges of a mesh. This method has been described in [2].

Parameters

While being out of the scope of this tutorial, Leaspy algorithms are based on settings that describe their behavior. To simplify their use, default values have been implemented so that an algorithm can be easily called by its name :

settings = AlgorithmSettings('mcmc_saem')

[Advanced ] It is possible to change the default values of the settings, either by loading a json file (one example can be found in example/start/_inputs/algorithm_settings.json ), or by adding kwargs arguments that corresponds to the one in the json file.

Step 2. Reconstruct the long-term process associated to the longitudinal short-term data

Figure 3 : Type of temporal progression the model allows to fit on the data

Once the data and settings set, you’re ready to reconstruct the long-term evolution of your features of interest. You first need to choose the type of profile you want to deduce from your data, either logistic or linear (soon exponential) with the possibility to enforce a parallelism between the features, as shown on Fig. 3,with the following command:

leaspy = Leaspy("logistic_parallel")
leaspy.fit(data, settings)

The first line describes the type of model you want to initiate, among “multivariate”, “multivariate_parallel”, “linear” and “linear_parallel” (and soon exponential).

Figure 4: Logs during the fitting process : (a) A console print describes the algorithm and model state every n iterations, (b) pdf files show how the model fits some individual observations, every n iterations (here, 1, 100 and 1000 iterations), (c) pdf file shows the convergence of some parameters : the noise, three population parameters and two individual parameters with their mean and variance.

[ Advanced ] Prior to the fit command, it is possible to add optional parameters in a dictionnary that is loaded with the command leaspy.model.load_hyperparamters(dict) .

[ Advanced ] Prior to thefit command, settings.set_logs(path/to/logs) enables to save logs during the algorithm training in order to check the correct convergence of the algorithm, especially with the plots shown on Fig. 4.

Figure 5: Long-term group-average progression of four cognitive capabilities during the course of Alzheimer’s Disease, from 65 to 90 years-old.

Leaspy offers a visualization toolbox to plot various outcomes of the model. Fig. 5. shows the average progression of the four aforementioned cognitive assessments during the course of Alzheimer’s Disease obtained with :

plotter = Plotter(path/to/output)
plotter.plot_mean_trajectory(leaspy.model)

[Advanced] Thepath/to/output is optionnal but maybe useful to save the plots as pdf files.

[Advanced] There are also optional commands that you might want to run in a daily analysisleaspy.save(path/to/output/json/file) andLeaspy.load(path/to/output/json/file) that enables you to store the model and reload it another day.

[Note] Even though the fit procedure computes some values of the individual parameters, as shown on Fig. 4 (b), they cannot be used directly. They have to be accurately estimated in a second step.

Step 3. Personalize the model to individual data

Figure 6: Subject-wise analysis of the longitudinal data with (a) the personalization of the model to individual data, (b) the distribution of an individual parameter in different subpopulations, and, (c) a scatter plot of two individual parameters, the log-acceleration αᵢ on the x-axis, the time shift τᵢ on the y-axis, colored by a cofactor, each dot corresponding to an individual.

While the mean trajectory is of particular interest to better understand the course of a disease, one might be interested in personalizing this trajectory to individual measurements, that present a temporal variability (faster, slower, earlier, later) and a geometrical variability (different ordering of the worsening of the cognitive scores). No much sweat is necessary to obtain it :

settings = AlgorithmSettings('scipy_minimize')
results = leaspy.personalize(data, settings)
plotter.plot_patient_trajectory(leaspy.model, results, indices)

The last snippet corresponds to the Fig. 6(a) that shows the personalization of the mean trajectory to individual progression. The analysis can be taken further by considering cofactors (gender, marital status, genetic mutations, …) that might modulate the evolution of the disease progression. The snippets :

# Considering df a dataframe of cofactors
data.load_cofactors(df, cofactors=['gender', 'mutation']
plotter.plot_distribution(results, param='tau', cofactor='mutation')
plotter.plot_correlation(results, 'xi', 'tau', cofactor='mutation')

enables to get the Fig. 6(b) and (c) that shows how cofactors might affect the individual parameters of disease progression.

[ Advanced ] Yet, Leaspy does not provide statistical tools to analyze the association between the cofactors and the individual parameters. We believe the user must understand the statistical assumptions she or he is making when dealing with statistical tests — rather than relying on black-box tests.

Figure 7: Given the observations (crosses) and their reconstruction (lines), it is possible to temporaly reparametrize the individual trajectories on the average time-line, for the four cognitive assessments. It basically shows that it reconstructs a long-term progression from short-term snapshots. The non-alignment of the individual trajectories is perfectly normal as they might present some geometrical variations to the mean.

A last interesting check lies in the ability of the model to reconstruct the long-term group-average trajectory from short term data. The snippet plotter.plot_patients_mapped_on_mean_trajectory(leaspy.model, results) plots the Fig. 7 that essentially shows that once the individual trajectories are mapped on the group-average time-line, they are consistent with the fact that they are snapshots of a time-lasting disease progression.

[ Advanced ] It is possible to save and load the individual parameters with the commandleaspy.save_individual_parameters(path, parameters) andleaspy.load_individual_parameters(path) .

[ Advanced ] It is very tempting to see if the personalization to individual data provides a good prediction of future timepoints. This is very easy to achieve : from a patient with n time-points, give only (n-1) for personalization and compare the last one with the real value. The same idea might be used to impute missing data at any time-point [4].

Step 4. Simulate virtual patients

As a result ot the previous steps, Leaspy offers the possibility to simulate virtual patients i.e. virtual longitudial trajecories that mimic the one in the initial cohort. Enabled by leaspy.simulate(results, settings), the resulting virtual cohort reproduces the characteristics of the initial one, with an arbitrary number of patients and number of follow-up visits, with potentially a finer granularity between the visits.

This virtual cohorts have been proved to improve the predictive power of algorithms that lacks sufficient training data, as shown in [5], which is often the case with medical data.

The data used in this tutorial and available in the example/start folder are simulated data, derived from cognitive assessments of the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset.

Simulating anonymized data is particularly relevant for medical data whose sharing policies are often complex and time-consuming.

Conclusion

This tutorial presented some of the main features of Leaspy. We believe it to be a powerful tool to analyze the progression of longitudinal phenomena, especially disease progression, both a group-level and at an individual level. It aims to gather a vast community of users, from clinicians with new data to developpers and researchers with outstanding algorithms.

[ Installation ]

The software, hosted on gitlab, can be cloned with the following command :

git clone https://gitlab.com/icm-institute/aramislab/leaspy/
cd leaspy/
pip install -r requirements.txt

The git clone command might ask for a username and password (related to your gitlab account that you might need to create). If prefered, the previous lines can be directly run from the Jupyter notebook in example/start .

The needed modules are :

from leaspy.main import Leaspy
from leaspy.inputs.data.data import Data
from leaspy.utils.output.visualization.plotter import Plotter
from leapsy.inputs.settings.algorithm_settings import AlgorithmSettings

[ Contacts & References ]

This work has been developed at the Aramis Lab based at the Pitié-Salpétrière hospital in Paris, France. It can deeply benefits from your questions, concerns of feedbacks that you can adress on the Leaspy code or directly to igor [dot] koval [ at ] inria [dot] fr.

The development of the model has been made possible by the support of the European Research Council, within the Horizon 2020 Research & Innovation program, and by the following institutions : ICM, Inserm, Inria, CNRS and Sorbonne Université.

  • [ Leaspy ] Sources of the code
  • [1] [ Schiratti et al, 2017 ] A bayesian mixed-effects model to learn trajectories of changes from repeated manifold-valued observations.
  • [2] [ Koval et al, 2018 ] Statistical learning of spatiotemporal patterns from longitudinal manifold-valued networks
  • [3] [ Koval et al, 2018 ] Simulating Alzheimer’s disease progression with personalised digital brain models (preprint)
  • [4] [ Couronné et al, 2019 ] Learning disease progression models with longitudinal data and missing values.
  • [5] [ Koval et al, 2019 ] Simulation of virtual cohorts increases predictive accuracy of cognitive decline in MCI subjects (preprint)

--

--

Igor Koval

PhD Student — Machine Learning @ Brain and Spine Institute