What’s wrong with aging clocks?

Дмитрий Крюков
13 min readNov 21, 2024

--

Hello, Medium!

Recently, our team published a paper at the intersection of aging biology and machine learning, where we critically evaluated the use of so-called epigenetic aging clocks for measuring cellular rejuvenation during the process of cellular reprogramming. The topic of aging clocks has already been discussed on Medium (here, here, and here) — reflecting how popular it has become in modern biology with the rise of machine learning methods. As for cellular reprogramming (rejuvenation by which is discussed, for example, here), it has grown into a massive field in cellular biology and tissue engineering.

In this article, however, I want to share a more personal story. It’s a story of how I slowly delved into the depths of mathematics and the concept of aging clocks. At one point, I was shocked to realize how deeply misconceptions and cognitive biases have taken root in this area of science. To illustrate how machine learning can play tricks on researchers, I will systematically introduce all the key terms and then explain why uncertainty estimation is so crucial in practical machine learning — and in aging biology in particular.

I understand that it’s unlikely I’ll be able to exhaust this topic in a single article. However, I’ll do my best to make you skeptical enough every time you hear terms like “biological age” or “aging clocks.”

Reprogramming-Induced Rejuvenation

After completing my master’s degree in electrical engineering, I entered a PhD program with a clear goal: to do everything possible to help knowledgeable people find a cure for aging, offering my computational skills to those who need them. Even then, I found the topic of aging clocks rather strange. Although it resonated with my skills in computation and coding, I wanted to work on something more biological. Nevertheless, I assumed that the “clockmakers” knew what they were doing, and I had no doubt that it was one of the worthiest branches of aging biology — until I was asked to use aging clocks in my own research.

The topic of my research was cellular reprogramming, specifically the molecular features of rejuvenation that supposedly occur during reprogramming. Cellular reprogramming refers to the process of inducing “stemness” in a cell by exposing it to certain genetic or chemical factors. For example, when you take skin cells, treat them with special substances, and turn them into stem cells. Stem cells are unique in their ability to transform into almost any type of cell in the body — muscle, bone, neurons, etc. (for biological details, see a recent review).

About 10 years ago, it was calculated that the epigenetic (biological) age of cells undergoing reprogramming is reduced to zero! This result has been replicated many times, and I myself have plotted graphs like the one below:

Fig. 1 — Epigenetic age of cells at different stages of reprogramming, calculated using S. Horvath’s aging clock. Source.

In our study, we examined another aspect of the biological age of cells — the transcriptomic age, which is based on the gene expression within the cell. We calculated it and indeed observed how this age dropped to zero. However, in some experiments, it dropped below zero, while in others, it did not decrease at all. These results already raised doubts in my mind, doubts I could not put into words at the time — I lacked the “language” to express them concisely and precisely. I did not truly understand what biological age actually was.

Biological Age and Aging Clocks: A Brief Excursion

Let’s take a moment to discuss biological age. Let’s start with a definition:

Biological age — a number, expressed in units of time (e.g., years), that reflects the health status of an organism.

It is important to note that biological age (B) is generally not equal to chronological age (C). For example, your passport age may be 30 years, but you might look 40, while having the blood sugar and blood pressure levels of an 18-year-old. The difference between biological and chronological age is commonly referred to as the “aging acceleration” Δ, and the three concepts fit neatly into the following simple equation:

When I say that we estimate one age based on a photograph and another based on blood parameters, I can formally write it as follows:

where X represents the data used to assess your biological age (e.g., blood test parameters, urine analysis, facial photographs, psychological questionnaires, etc.), and f is the aging clock algorithm used for this assessment.

Here, I deliberately use an abstract function f to denote aging clocks. In specific cases, f could be a machine learning algorithm: decision trees, linear regression, neural networks, etc. But it could also be any arbitrary algorithm, an expert evaluation, or even your neighbor estimating your age based on a new outfit.

Attentive readers might already feel uneasy about the arbitrariness of this definition of biological age and the fact that it significantly depends on the data X used to define it.

This leads us to the question: “Is there a golden standard for biological age?” Unsurprisingly, the answer is “No.” Biological age is a kind of latent variable — a characteristic that we calculate but do not measure directly. It is, in essence, a health index, the value of which depends entirely on how it is defined. This is both its beauty and its curse (we will revisit this below).

Whenever you hear that “scientists have invented epigenetic aging clocks,” or “cognitive aging clocks,” or “blood-based aging clocks,” or “a biological age calculator based on photographs,” know that this almost certainly refers to another machine learning model trained on specific data. Among the plethora of aging clocks, epigenetic clocks, those where machine learning models are trained on DNA methylation data (a type of epigenetic data), stand apart.

DNA methylation — a chemical modification sitting at many sites in our genome — when measured in a human blood sample, correlates well with chronological age. In other words, this is an unusual type of data that can predict your chronological age with remarkable accuracy. And yes, I emphasized chronological age for a reason. Let me now explain how most (fortunately, not all) of the aging clocks created to date — the so-called first-generation clocks — actually work. Watch closely.

Step 1.

Take a machine learning model g. It could be linear regression, random forest, gradient boosting — it doesn’t matter. Train the model to predict chronological age (C) from some biomarker data (X, e.g., DNA methylation). A well-trained model will satisfy the following relationship:

where ε is the error of the model relative to the true chronological age C.

Step 2.

Now define the model’s predictions g(X) as the biological age:

PROFIT!

If you thought I missed something or made a mistake — no, I didn’t. Here is a recent review in a reputable journal on the subject (URL). This is exactly how biological age is defined in the view of most scientists — through a substitution of concepts.

Fig. 2 — Summary of the first generation aging clock design procedure. The model error with the opposite sign is equal to the acceleration of aging. Source: Jarod Rutledge et al. / Nature, 2022.

It was astonishing for me to observe how, behind layers of complex terminology and scientific expressions, this implicit but simple conceptual substitution was hidden. So, the natural question arises: “How has this approach persisted and self-replicated for over a decade — and, broadly speaking, more than 50 years?” Well, sometimes it actually works!

It just so happens that many biomarkers that correlate well with chronological age also correlate with pathological conditions in the same direction.

This phenomenon is linked to the concept of the identical association assumption, which was thoroughly discussed in a recent landmark paper. It’s remarkable that it took 10 years to articulate this concept. The article certainly deserves its own review and discussion.

For instance, it’s well-known that systolic blood pressure increases with age across populations. But if you suffer from chronic hypertension and ignore your doctor’s advice, your risk of early cardiovascular disease and death rises significantly. Thus, it’s reasonable to assume that biological age for hypertensive individuals will generally be higher. If we train an aging clock on blood pressure using the algorithm described above, it will sometimes work correctly. Unfortunately, “sometimes” is not enough.

Uncertainty in Machine Learning

Before returning to cellular rejuvenation, let’s delve into the concept of uncertainty in machine learning and why it’s so important in our field.

Consider the classic problem of binary classification of cat and dog photos. You diligently train a model on a standard dataset of cat and dog images. You test it and suddenly encounter a photo of a couch. What will the model output? Experts familiar with the topic will recognize the issue of an outlier or out-of-distribution sample and correctly answer that such a model cannot produce a reliable prediction. “It might output something like 0.51 as the probability of the class ‘dog,’” they’d say. But they’d say this as experts in distinguishing cats, dogs, and couches, as all these objects are intuitively familiar to them.

Fig. 3 — A classic example of out-of-distribution prediction for intuitive data.

The situation becomes much worse when the classified objects are represented by some biological data, about which we have far less intuitive knowledge. This is the case when machine learning enters science: our intuition decreases, while the stakes increase. A physician using ML algorithms in practice cannot afford to misdiagnose pancreatic cancer, meaning the ML model must not only be accurate but also reliable. A reliable model in this context is one that informs the physician of its uncertainty in predicting cancer. Formally speaking, the model must output not just a prediction but also a measure of its uncertainty.

Uncertainty in machine learning comes in various forms. Quantifying it can be challenging and often requires advanced ML models. For the purpose of this discussion, the focus is on epistemic uncertainty (classification taken from here), which refers to uncertainty arising from a lack of knowledge about something.

Returning to the example of cats and dogs: the neural network knew nothing about the existence of an object like a couch. To the model, it is something entirely incomprehensible, something outside the original distribution of the training dataset. It is perfectly reasonable to assume that if the observed set of features was never encountered during training, the model cannot know anything about it — much like a person who has never heard of a platypus is unlikely to classify it correctly upon encountering it for the first time.

Now, we are ready to return to the example of cellular rejuvenation, bringing together all the concepts introduced in this article. Analogous to the cats-and-dogs example, an aging clock model, being a regression model, during training only encounters data from healthy human tissue samples. However, in the test dataset, there is something entirely different.

Fig. 4 — An example of out-of-distribution prediction for non-intuitive biological data.

In the test dataset, we observe cells that initially resemble normal human cells but, during reprogramming with the help of chemical or genetic factors, undergo transformations that the model has never encountered before. With each passing day of transformation (see Fig. 1), the cell increasingly deviates from the characteristics of a normal, healthy tissue cell, progressively acquiring traits of a stem cell. This is difficult to imagine conceptually but is clearly evident in the data directly from our publication.

Fig. 5 — Projection of DNA methylation data onto the axes of the first two principal components. Source: Dmitrii Kriukov et al. / Aging Cell

Here, high-dimensional DNA methylation data is projected onto a plane to visualize how, with each additional day of reprogramming (red squares), the cells increasingly diverge from the original training dataset (blue circles). This gradual departure from the training dataset inevitably leads to an increase in the prediction error for biological age.

Finally, to demonstrate the growth of this error, one can train an ML model capable of assessing its own uncertainty and apply it to the reprogramming data — which is precisely what we did.

Fig. 6 — Predictions of the age of reprogrammed cells along with the confidence interval. To estimate the confidence interval, we used the Gaussian process model. Source: Dmitrii Kriukov et al. / Aging Cell, 2024

By day 20, the predicted age had dropped into negative values. This should not surprise us, as the model typically has no constraints on the sign of age — it is simply an algorithm performing its task.

Incidentally, even the aforementioned Horvath clocks yield negative predictions when applied to reprogrammed cells. So where does the zero age come from? As I discovered, this is a result of post-processing applied to the predictions by Horvath clocks. Specifically, the predictions of the linear model for Horvath clocks undergo a transformation resembling the ELU activation function from neural networks, where all negative inputs are set to zero.

Ultimately, we conclude that due to the enormous prediction error (with confidence intervals spanning 100 years), epigenetic aging clocks can hardly be considered a “reliable” tool for predicting rejuvenation in reprogramming. Someone might argue, “But we see a decreasing trend in age — isn’t that sufficient?” To this, I would respond as follows:

First, not always. According to some aging clocks, the trend is actually increasing. Second, this seemingly decreasing trend has nothing to do with normal physiological aging or rejuvenation. It reflects a third, model-unseen state, whose biological age can best be described as “confusing.” This state should by no means be interpreted as an invitation for large-scale in vivo reprogramming of organisms. Injecting Botox into your face might make you look younger, but you wouldn’t call it a systemic rejuvenation intervention.

That said, I must explicitly clarify here: I do not know whether reprogramming can lead to systemic rejuvenation of the organism. So far, the accumulated experimental results (which you can also find mentioned in our paper) suggest it probably cannot.

The arguments I’ve presented above represent only a small portion of the arguments discussed in our article, but they are sufficient to convey the essence of what my colleagues and I wanted to express — and what I hope to discuss further here.

Aging Clocks: A Way to Impress Your Supervisor or a Real Clinical Tool?

Aging clocks are cursed. I became convinced of this during the four years of my PhD studies, which I dedicated to this topic for various reasons. But what exactly is this curse? It affects anyone who, instead of using honest, explicit, and admittedly boring surrogate metrics of aging (such as the Frailty Index), chooses to dabble with implicit ML algorithms, creating latent variables and ultimately getting entangled in them. By juggling their cognitive biases, such researchers produce contradictory or even absurd claims, presenting them as positive scientific results. The hype surrounding ML/DL algorithms only exacerbates the situation, as some doctors and biologists who lack a deep understanding of machine learning (and their math students, conversely, lack understanding of biology and medicine) churn out dozens of papers for the sake of publications. Strangely, this cycle works — grants are awarded, hackathons are held, companies are established, and services for measuring biological age are sold. But does all this have a bright side? Can aging clocks actually be useful in the clinic?

Yes! And again, yes! But it’s far from simple. The first-generation aging clocks I described earlier, thankfully, are only part of the story. According to our second formula, biological age is just a function of data. But what if we trained this function to predict something other than chronological age — for example, remaining lifespan or the time until the onset of a chronic disease? It turns out that, after proper training, the resulting algorithm is no longer a toy but a serious surrogate biomarker that can be relied upon in geriatrics or personalized medicine. If such an algorithm is equipped with uncertainty estimation, it becomes clinically relevant.

The problem arises from an unexpected place. We lack large, open biobanks annotated with diagnosis dates or, at the very least, dates of death that were preceded by biomarker measurements. Existing datasets are either locked behind layers of licensing restrictions, prohibitively expensive, inaccessible in certain regions, or all of the above. The only large and well-known dataset of this kind that I am aware of is NHANES. (By the way, if you know of other extensive open datasets with biomarkers and death dates, please share them in the comments). In some cases, legislation prevents such biobanks from being publicly available; in others, there is simply no initiative to create them. This brings us to what I call the minor transhumanist dilemma:

Are you willing to sacrifice the privacy of your personal medical data in exchange for enabling large-scale research into aging biomarkers?

Epilogue: “…Research for the sake of research…”

Having completed my dissertation and successfully defended, I took a deep breath and began reflecting on the journey I had taken. One more thought about aging clocks came to mind, which I would like to share.

Aging clocks are an example of what happens to science in the absence of a theory. As of today, there is no unified and widely accepted theory of biological aging in organisms. There are monumental works that encyclopedically accumulate all the observable hallmarks of aging, but theories are a different story. Of course, there are attempts to formulate such theories, and there are researchers working narrowly within their frameworks, conducting many honest and elegant experiments (see, for example, this article). However, aging clocks in their current form do not utilize the fruits of these theories — they even ignore them. By proposing more and more new biomarkers, scientists unwittingly flood PubMed with noise — a slew of new correlates of chronological age or other associations with what they interpret as aging (which, incidentally, also lacks a universal definition). In short, they just want to research for the sake of research. Not for solving the problem of aging.

In my opinion, this effectively agnostic approach to aging has run its course and discredited itself. We now know enough about aging to begin a coherent search within the framework of the most successful existing aging theories. Aging biomarkers should no longer be divorced from aging theories but must be tied to them, complementing, expanding, or refuting them. In this, I hope to see the continuation of the path — both my own and that of aging science as a whole.

If, like me, you are deeply interested in the question of combating aging through computational methods, feel free to join our telegram chat, where we discuss such topics, share useful resources, links, and more. And of course, there’s my personal Telegram channel — everything there is concise and to the point.

For independent exploration of the topic, you might find our free online course on computational approaches in aging biology helpful. My colleagues and I developed it, and it also covers the topic of aging clocks, though the chapters are periodically updated as new knowledge emerges.

Thank you!

Dmitrii Kriukov, research scientist in the “AGI in Medicine” lab at AIRI.

--

--

No responses yet