Overcoming Data Science Challenges in Biosensor Analytics

Bálint Kovács
HCLTech-Starschema Blog
7 min readJul 19, 2023

The integration of technology and healthcare continues to revolutionize the way we monitor and promote human well-being. The recent pandemic provided both a rare and alarming glimpse at the limits of our healthcare systems and a boost to solutions aimed at addressing these limits, often by leveraging AI- and ML-driven solutions to automate processes and unlock predictive capabilities.

Our team of data scientists recently completed a project whose key aspects are indicative of the way in which AI-powered healthcare — specifically, the area biosensor analytics — is evolving and which brought us face-to-face with a special set of challenges typical within the field. Here’s how we solved them — and what the project tells us about the place of data science in the future of healthcare.

Photo by jean wimmerlin on Unsplash

Sensors are the New Black

Wearable devices — essentially biosensors that can be attached to the human body without significantly restricting its mobility — have emerged as powerful tools in the realm of healthcare analytics. These devices are equipped with various sensors capable of capturing, often in real-time, an array of measurements, such as temperature, color, pH levels, bioimpedance, heart rate and blood oxygen levels. More advanced sensors can capture even more intricate data, including bioimpedance measurements for body composition analysis or pH monitoring. This technology offers an unprecedented level of insight into our physiological and biological processes and carries the promise of elevating the efficiency and impact of the work that still necessitates attention and intervention by human staff.

However, the amount and nature of data generated by wearables presents unique challenges for data scientists and researchers. With wearables becoming increasingly sophisticated, the potential for data-driven healthcare insights is immense, but the systems built around wearable hardware often require novel solutions to effectively leverage and analyze the wealth of data they generate.

Show Me Where It’ll Hurt

The project we look at in this article was motivated by the need to better utilize biosensor data for diagnostic purposes. Without going into too much detail, the aim was to predict the formation of a postoperative complication. Using currently available tools and practices, it takes roughly a week to detect and begin to treat this pathological change.

Depending on how much the detection time for the pathological change could be improved, healthcare providers could save significant resources — both human and material — currently dedicated to patient monitoring and treatment. The result would be better health outcomes due to more timely treatment as well as a generally more optimal allocation of expert attention. Moreover, sensor-based solutions are expected to be cheaper overall compared to traditional imaging technologies and other alternative diagnostic methods.

By the project’s conclusion, our team was able to reduce the detection time for the condition from the current standard of one week to just 2–3 days. In addition to demonstrating that an AI-driven approach could provide suitably accurate results in a relatively short time, the project validated the sensor-based approach as commercially viable in terms of hardware cost. What this took from a data science perspective was making the most of very scarce data amid a special set of circumstances.

Necessary and Unavoidable Limitations

At the project’s outset, there were no definitive physiological markers or features available that we could’ve relied on as thresholds to clearly identify the change we were after. This necessitated generating actual cases of the change and creating conditions appropriate for enterprise-grade research within the limitations of a pilot-scale project.

This early phase of development involved a limited number of test specimen, each with multiple sensors attached along an area that displayed an instance of the postoperative change to be detected in some cases, and no change in others. Each sensor acted as an independent measurement point for multiple biological features. The resulting system works on the premise that anomalies in sensor values can be detected earlier than the pathological change becomes noticeable by a doctor.

The combination of these practical necessities and the project’s relatively limited scope left us with very scarce training data — which was also often difficult to work with due to values going missing as a result of a variety of common practical reasons that simply become more of a factor when dealing with a smaller sample. On top of this, the first phase of the project was also “unsupervised,” meaning our team received only raw data without labels, adding another challenge layer of having to figure out what data was coming from a sensor in contact with the pathological change. This was to simulate an eventual human trial scenario where the presence of the pathological change wouldn’t even be guaranteed.

Making a Little Go a Long Way

Our team set out to solve the dual challenge of scarce, often incomplete data and the lack of labels by using low-complexity models appropriate for the scarce data and first trying to filter out features that would actually be relevant when seeking to identify the pathological change at an early stage. The purpose of this phase was to select which sensors contain the most useful information for predicting a complication during healing. Due to the low amount of data at our disposal, it was critical to drop irrelevant features, as they could have easily led to model overfitting. An overfitted model would have perfectly learned to separate the pathological changes in our limited testing framework, but this would have also left it with subpar generalization capabilities for specimen different from our test cases — e.g. during eventual clinical trials.

Replacing the occasional missing values in the sensor data — e.g. temperature — was necessary to improve the robustness of the eventual solution. After auditioning a series of potential fixes, including time-series-based averaging of data from subsequent days and taking data from sensors neighboring the one where data was missing from and averaging the missing value from those, we found that time-series interpolation almost always produced the most usable replacement data. Throughout this process, we greatly benefited from access to a biologist who provided a sanity check on our estimations for missing values, verified relevant methodologies and helped identify the most salient features for the predictive model, especially in the unsupervised phase of the project, which would go a long way toward optimizing the eventual biosensor-based analytics system.

In addition to zeroing in on the sensor modalities (temperature, bioimpedance, etc.) which actually measure relevant information with regard to the target variable, the project also aimed to optimize the eventual biosensor analytics system by identifying the sensors that will actually need to make it into the finalized hardware. This process went hand-in-hand with the model-building: we created models for each sensor and looked at how well they worked on their own, which helped arrive at a more mature model that leveraged a shortlist of better-performing sensors.

The Path Forward

Developing machine learning solutions for detecting rare pathological changes has unique challenges: sourcing training data is about as costly as it realistically gets, and monitoring often requires rarely used sensor modalities. And it’s already apparent that more testing will be necessary to allow the creation of more complex models — upscaled data collection architectures leveraging hourly data loads, rather than daily, would already represent a major step up in methodology.

However, despite these challenges, the above sections have hopefully made apparent that the next step in the evolution of AI- and sensor-based healthcare analytics systems is well within reach. The data science solutions that these systems necessitate might be somewhat unusual, and close cooperation with domain experts will be critical, but our current knowledge and toolset appear adequate to effectively tackle them.
And if you’d like to learn more about the data science work that goes into forward-thinking healthcare solutions, reach out — we’d love to talk.

About the Authors

Bálint Kovács a data scientist at Starschema with a background in software development. He has worked on a diverse range of projects in a variety of roles, including as a research fellow and assistant lecturer at a leading Hungarian university, a deep learning developer at a big multinational company and, currently, as a consultant data scientist. He enjoys diving deep into user data to uncover hidden insights and leverage them to create effective prototypes. Connect with Bálint on LinkedIn.

Zsombor Haász is a data scientist at Starschema with a master’s degree in financial mathematics. He started his career in the banking industry as a credit risk specialist before switching to data science consultation. In his work, he uses his analytical and modeling skills to extract non-obvious information from raw data and utilize the acquired knowledge in a novel way. Connect with Zsombor on LinkedIn.

--

--