Battling the Noise in Health Data

Published in

Flo Health UK

8 min readJul 8, 2021

At Flo, we work with real-world health data generated by millions of people who have diverse backgrounds, behaviors, and interests. Unlike data from controlled environments, this data is a byproduct of how users interact with the app, which is affected by their current focus and interests, habits, and much more. In short, the data is messy.

That messy data raises issues of causality, model transparency, and explainability. We deal with every one of these in all the ML products we build, including menstrual cycle length prediction. In this article, I explain how we deal with these challenges and how they affect the predictions we make.

Data as a curse and as a blessing

Imagine someone who’s trying to conceive. She consistently logs sex and ovulation test results but doesn’t care about the app otherwise. Another person is sexually active but doesn’t want to get pregnant in the near future. She’s also stressed at work, acutely aware of her mental health, and the happy owner of a brand new smart scale. A third woman likes to approach everything thoroughly by tracking everything that happens to her. She demands maximum transparency from the apps she uses.

Now let’s imagine how this might look in the data. The first woman has a higher chance of getting pregnant, which means her periods will stop. She also logs sex and ovulation tests very consistently because that’s what important to her. The second woman doesn’t use or log ovulation tests, and she doesn’t care about logging sex either, even though she’s sexually active. She’s more concerned about logging instances of anxiety, mood swings, and other mood-related symptoms. She also likes her smart scale and actively tracking her weight. The third woman’s life is very rich with happiness, anxiety, travel, drinking, mood swings, sex, acne, occasional headaches, bloating, and everything else she logs.

These idealized, hypothetical examples illustrate what can go wrong with data. The first woman is more likely to have an abnormally long cycle — a cycle that is actually a pregnancy she simply didn’t report. She also looks like a person with a perfectly stable BMI because the only time she provided information on her weight was during the onboarding. She might appear to be the most sexually active among users, although that’s not necessarily true. The second woman, in addition to being sexually inactive in data, may look like she’s having a much harder time in life with so much logged stress and anxiety. Again, that’s not necessarily true: the pressure of trying to conceive can be immense. Finally, her BMI is unstable compared to people who don’t actively track their weight. The third woman decided to start logging sex. Then she got an updated prediction for her next period from 28 to 32 days. She found this confusing and asked support what was going on. If we tell her this change is because of many unreported pregnancies and miscarriages in our data, she might argue that that makes no sense. She would be right.

There are many issues to unpack here. The first one is true zeroes versus false zeroes or the difference between not experiencing something and not logging it. The second issue is spurious correlations that emerge not from direct causal links but from much more complex, intricate, and often unobserved causal structures. Plus, so far, we have only considered sex, ovulation tests, moods, and weight. There are hundreds more variables that Flo tracks, which complicate the issue even further. An approximation of this mess can be visualized like this:

Some of these problems are fundamentally unsolvable, given the nature of the data we collect: observational or shared voluntarily. For others, there are tools to make things less messy. I will focus on the latter.

Simple solutions and problems they bring

One relatively straightforward way to battle noise in the data is to weight observations based on our trust in them. We can approach the problem of true zeros versus false zeros by assigning a “trust score.” We base our trust in medical insights, as well as users’ behavior inside the app. Consider three cases: One user didn’t interact with the app at all; another user opened the app, maybe even read a post, or interacted with the Health Assistant a couple of times; and the last user had been constantly logging symptoms throughout the cycle. Had we been able to observe the truth, we would have seen different error rates between these three groups, with the last one being the most accurate in what they report. At least, that’s the assumption.

Suppose we reweigh the data and assign a higher weight to observations from those who interact with the app regularly. At the extreme end of weighing, you could also drop problematic observations altogether. Although it would be less noisy, that subsample might be skewed toward users who are more dependent on the app: users who are trying to conceive; people who experience more acute symptoms, and many more possible complications. For instance, women with serious health conditions are more likely to have irregular cycles. That, in turn, can skew the outcome in many different ways.

Another straightforward approach is to simply ask users to clarify. That has its downsides as well. User attention is a valuable resource one does not take lightly. The amount of push notifications and other communication a user can tolerate is pretty limited. Competing with other Flo services for that valuable resource usually leaves little room for maneuvering. Plus, users self-select themselves into answering additional questions as eagerly as they self-select themselves into using the main app functionality.

Taking control over the learning process

Having sex doesn’t cause longer menstrual cycles. If you are worried, neither do ovulation tests. Yet, these patterns inevitably occur in the data.

Let’s consider sex. Sex, unlike acne or backache, is something over which people should have direct control. So how does it affect cycle length prediction? There is evidence that sexual activity increases during the most fertile days of the menstrual cycle [1]. Knowing that helps to pinpoint ovulation more accurately, which marks the beginning of the luteal phase. That phase is more constant in length than the first part of the cycle — the follicular phase — and thus more predictable.

This information comes in handy in feature engineering. Imagine a predictor capturing how sexual activity during the current cycle correlates to the user’s historical pattern. Simply put, are the most active days earlier or later than usual? Then, in order to escape the curse of spurious patterns in the data, one can apply monotonicity constraints. Doing this tells a model that the predictor can only influence the outcome in one direction: either positively or negatively. Then, you can construct a model where having sex later than usual always leads to an increase in predicted cycle length and vice versa. Despite this restriction, a sufficiently complex model is still free to learn sufficiently complex patterns: How much to increase or decrease the prediction is conditioned on the current cycle day, age, BMI, and all other available information. A similar logic can be applied to many other predictors.

An illustration of the effect monotonic constraints have on cycle length predictions

Monotonicity constraints are relatively easy to enforce with simple models such as Generalized Linear Models. However, it works equally well with decision trees [2], various decision tree ensembles [3], neural nets [4] and other complex models [5].

Another useful form of constraint is pairwise trust [6]. As shown above, people differ significantly in their logging patterns. If someone logs sex consistently, the above solution works just fine. Suppose instead that someone has been using Flo for six cycles but only logged sex once. Here, we are not sure whether all these days without logged sex are true zeros, meaning they didn’t have sex, or false zeros, meaning they simply didn’t log.

Naturally, it would be great to adjust our trust in the predictor accordingly. Pairwise trust is how to do this. The average number of times she logged sex, or total count, or other aggregates, all capture some information about the consistency of the logging pattern. One can then make a model less sensitive to the variable “sex” described in the last paragraph when another variable, say ‘sex_logging_consistency,” is low. This approach is not immune to the problems discussed above. However, it offers researchers more flexibility in controlling the model.

There are many other ways to restrict a model: linearity, concavity/convexity, variable-wise regularization, and interaction constraints, among others. When used in combination, they help to gear models we build toward more medically sound, common-sense behavior.

What’s next

Building ML products at Flo is an ongoing process with many challenges and opportunities. We will continue working on improving the cycle prediction model’s transparency and explainability. One of many projects in this direction is building a service that provides users with on-demand explanations of why they receive certain predictions and how these predictions were made. We’ve also recently started working on symptoms predictions with the aim of informing users about inconveniences they might face. This project heavily relies on data obtained from wearable devices. Another project we’re currently focusing on is automated pattern identification. The idea is to provide users with insights regarding patterns in their symptoms, health conditions, and everyday activities to help them form a better understanding of their bodies and themselves.

We are hiring!

At Flo, we use our cutting-edge technology on top of super reliable platforms to enhance the overall experience of our users, giving them access to resources that enable them to take a proactive role in their well-being and to understand how their body works. If the tools mentioned in the article sound familiar to you or you have better alternatives, come and join us!

References

Wilcox, A. J., Donna Day Baird, David B. Dunson, D. Robert McConnaughey, James S. Kesner, and Clarice R. Weinberg. “On the frequency of intercourse around ovulation: evidence for biological influences.” Human Reproduction 19, no. 7 (2004): 1539–1543.
Potharst, Rob, and Adrianus Johannes Feelders. “Classification trees for problems with monotonicity constraints.” ACM SIGKDD Explorations Newsletter 4, no. 1 (2002): 1–10.
Bartley, Christopher, Wei Liu, and Mark Reynolds. “Enhanced random forest algorithms for partially monotone ordinal classification.” In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, pp. 3224–3231. 2019.
Daniels, Hennie, and Marina Velikova. “Monotone and partially monotone neural networks.” IEEE Transactions on Neural Networks 21, no. 6 (2010): 906–917.
Cano, José-Ramón, Pedro Antonio Gutiérrez, Bartosz Krawczyk, Michał Woźniak, and Salvador García. “Monotonic classification: An overview on algorithms, performance measures and data sets.” Neurocomputing 341 (2019): 168–182.
Cotter, Andrew, Maya Gupta, Heinrich Jiang, Erez Louidor, James Muller, Tamann Narayan, Serena Wang, and Tao Zhu. “Shape constraints for set functions.” In International Conference on Machine Learning, pp. 1388–1396. PMLR, 2019.

Battling the Noise in Health Data

Written by Aliaksandr Kazlou