Feng Zhang
Aetion Technology
Published in
10 min readJul 16, 2019

--

Building a Rule-based Validation Framework (RVF) for Real-World Healthcare Data

Written by Hariesh Rajasekar, Data Scientist and Feng Zhang, Software Engineer

Hariesh: A few years ago, my first assignment at Aetion as a Data Scientist was to download a collection of de-identified electronic health records, which were delivered as compressed and encrypted SAS data files from an FTP server. I was then told to use a language of my choice to convert them into CSV format and upload them to an S3 bucket. It was a good introductory assignment and I was able to finish by the end of the day.

The next day, one of my Science colleagues slacked me a thank you note, and asked: “how can I test if the CSV files are good to use?” My first thought was “What do you mean ‘good to use’? If you open the file, inside is the data in CSV format. What more do you need to know?”

It shortly became clear to me that we could neither rely on the consistency of the source data nor the accuracy of my code’s output simply by scanning a few lines of the CSV. There were innumerable corner cases relating to things like varying date formats, encoding inconsistencies, relational integrity, the encoding of missing values and other similar problems that my code might not address properly and thus would produce problems within the output.

Did I mention that these files were big? How big? In nearly all cases over 10 million rows with some exceeding 100 million. And wide — over 100 columns in some cases. I then learned that after converting the data into CSV format, it would then undergo another transformation into Aetion’s proprietary binary format, which is optimized for fast access by the Aetion Evidence Platform (AEP). So any quirks or inconsistencies that were left behind or created by my code would likely get passed down the line and possibly multiplied by any deficiencies in this 2nd transformation process.

Finally, I learned that this was a process that we would be repeating with new and updated data sets from multiple disparate sources, with new data arriving almost every week. I realized we had a problem — both we and our customers needed to rely on accuracy, consistency, and transparency of all data transformation steps and the output they produced. This needed to be provided at a grand scale even though each new data set carried its own unique schema and encoding variations. This is where I began working with Feng and others to figure out what we were going to do.

Feng: I have been with Aetion almost since its original founding and I had worked on the design and development of many different parts of the AEP. We were in the process of evolving from a start-up to an operating company with multiple customers and thus were beginning to shift from thinking about just making things work to figuring out how to make them work at scale with consistency and reliability. When Hariesh raised his concerns about our need for scalability, my responsibility was shifted to focus entirely on our Data Ingestion Pipeline (DIP). From then on Hariesh, myself and others from our respective teams became laser-focused on the need to build scalable and reliable data ingestion into Aetion’s platform.

When we first started working together a few years ago and with the help of many others, we asked: “how can our end-users trust our data ingestion/integration process?”. We clearly did not have a good answer at that point. We spent a lot of time on the drawing board thinking through the problem to define a clear scope to tackle the challenge. During the course of developing our approach, we employed several methods to validate our direction, conducting user and customer research from the beginning. After lots of discussions, meetings, and brainstorming around possible techniques, we were set on implementing a data validation process that would be as well documented, consistent and transparent as possible. At the outset, our data validation strategy was viewed as a 2-level process to check for:

  1. The integrity of the data (i.e. consistency with the expected IT schema requirements)
  2. Logical and statistical consistency of the data — before, and after data transformation (including schema merge and data enrichment)

We defined our validation framework as a decisional procedure, based on a set of conditional checks evaluating for a purpose. These checks include things like testing for acceptable values in a given field, profiling how many records are missing data for each field, data density, and relational integrity. If data satisfies the conditions, it means that the behavior rules are not violated and the data are considered valid for the final use for which they are intended.

This conceptual definition seemed self-evidently correct, but could it be implemented in a way that would be agile and scalable? Understanding and validating data may seem like a trivial task for tiny amounts of data that can be inspected manually. However, in practice, the data are too large for manual inspection and were arriving with ever-increasing frequency. So it becomes necessary to automate and scale the tasks of data validation. That got us into developing Aetion’s Rule-based Validation Framework (RVF).

Our Design Approach for RVF

The Aetion RVF was designed to be an autonomous framework built-in pyspark and Spark SQL to provide flexibility and scalability to Aetion’s data validation process. It was designed to allow for a more agile way of performing checks on the data transformation workflows by improving the turnaround time and accuracy of the validation process.

Our spectrum of clients ranges from big pharma to healthcare payor — all of whom require research quality data for analyses! But, how good is good enough? It is hard to answer because good is a relative concept and is very much contextual as the level of data quality required will depend on the purpose of the study concerned. Explicitly defining a standard on which metrics can be developed and measured seems to make it less relative.

We thought of RVF conceptually as an automated Turing Test-like methodology that evaluates data quality from the perspective of a Scientist. Data produced by systems created by Data Engineer would be evaluated based on a set of rules provided by Scientists. Reports generated via RVF would provide complete information on the tests that have been applied and their resulting compliance with Science’s expectations. Providing reports such as this would not only enable transparency but also provide a useful profile of study results for investigators/scientists using the dataset downstream for an analytical study. Success would be determined by the ability of RVF to replicate the results that one would expect from diligent manual inspection by a Scientist.

Investigators in specific are looking for data quality metrics relevant to their analyses rather than a set of measures that defines the overall data quality of a dataset. For example, a Scientist can specify that the ratio of patients having a missing primary diagnosis code in outpatient events is an important metric of quality based on their understanding that these metric influences risk scores or confidence intervals. However, this results in an interesting issue and creates a bottleneck where work can’t begin before the data ingestion processes are complete: trade-offs must be defined between efficiencies and usefulness to justify the highly-competitive engineering efforts. Our approach is to provide generic data quality metrics of a dataset deployed to our client’s instance and also enable end-users to measure their own data quality metrics on the platform. The latter is made possible by connecting not only the derived (transformed) data but also the raw data to the platform in Aetion’s data-agnostic model.

Requirements for Reproducibility and Transparency

As mentioned earlier, due to the fact that our platform supports regulatory grade analysis, there are non-negotiable requirements relating to reproducibility and transparency into every step along the pipeline of transformation, filters, and analyses.

To enable end-users to reproduce our data ingestion, and integration process of connecting data to Aetion’s Evidence Platform (AEP), we believe, at the very least, it should satisfy the following 4 conditions:

  • All programs, scripts, and steps need to be well documented so users can follow and understand on their own.
  • End users should be able to redo the whole data connection with “one-click” using the scripts provided
  • Reproducible output(s) should be reported in a format that’s both readable and comparable for the end-users.
  • Data connection or results should be reproducible within a reasonable time frame

Figuring out the right approach from many options…

Feng: To satisfy all of our requirements, it becomes important to understand the full spectrum of people who would be part of the validation process — their domain knowledge, tech stack, preferred coding languages and other factors defining their roles, abilities, and know-how. Healthcare scientists, software engineers, and data scientists come from a multitude of backgrounds, so it was difficult for us to understand what might be the best suitable option to pursue would satisfy the 4 requirements listed above and also make it sufficiently easy for use and comprehension.

We attempted proof-of-concept work with several technology implementations with an eye towards scalability (could it handle our huge data sets), extensibility and maintainability (how efficiently and accurately could we implement new or modified rules from Scientists). Some of these options included

  • multithreading in Java running on AWS instances
  • enterprise licensed SAS software installed on dedicated windows systems accessed from RDP clients
  • Python and R scripts to read data from a mount S3 bucket
  • manually filled validation spreadsheets with numbers copied from random ad-hoc queries
  • hive data store connected via JDBC or ODBC

Each of these above options satisfied some of the requirements, but none of them met all of the required criteria. One notable example was our early alignment that SAS would be the right technology platform for the RVF. A key reason we were so attracted to SAS was that it is one of the most common languages used by Epidemiologists and we initially thought it could serve as a universal language to process, transform, and validate our data. However, it turned out that there were many challenges to integrate SAS as a fit-for-all tool. We found that it was not easy to set SAS up on non-windows environments, scale it to distributed environments, and support multiple-level and heterogeneous input data formats in the way we wanted. We also tried to use two separate technologies (SAS and Java) for data connection and validation, but that approach added more complexity to configuration management and scaling. The process of incorporating new data sets from different sources using different schemas and encoding standards required too much Engineering time and run-time issues relating to setup and configuration management were too numerous to count.

At that point in time, we were fortunate to benefit from convergence in the big data unified analytics space. Apache Spark had achieved a number of notable successes in high-profile use cases — and as a result, a very effective ecosystem of components, managed services, and talent-pool expertise was emerging. We were able to quickly validate several POCs using Apache Spark, and Apache Airflow. I felt that this would provide the right foundation to

  • execute analytic and transformation loads in the scalable infrastructure (Spark),
  • using a language (SQL) that was sufficiently familiar to Data Scientists, and
  • Data Engineers with advanced programming skills can use Python, R, Scala, Java to implement more advanced functions and re-usable extension.

The Missing Link: Databricks

At Aetion, our two biggest departments are Technology and Science. We are tightly knit and closely collaborate on platform-related work, including the data ingestion and integration processes. In order for the RVF platform to achieve maintainability, we needed to empower the Science team to perform semi-technical operations in the definition of new validation rules. At the time, the Science team were highly aligned with using SAS and the idea of having to make a shift from SAS to SQL received some pushback.

Feng: What we lacked was an abstraction layer for Scientists with varying backgrounds and coding skills. In some cases, Scientists need to understand what transformations have been done to the data and what validation checks had been used without having to write any code. So, we decided to create another layer on top of Spark SQL to provide the necessary transparency and reproducibility. And we were able to create this additional layer quickly and effectively by adding Databricks to our technology mix.

Ready — Get Set — RVF!

As shown in Figure 1, RVF automates the comparison between data from Aetion’s longitudinal format and transformed raw data. On both sides, the data are loaded from the same data catalog source.

Figure 1: Rule-based Validation Framework (RVF) — Architecture

Rules applied are designed to explicitly answer questions that may be of interest to investigators. These are generally classified into two different types of checks: descriptive checks and comparative checks.

Descriptive checks: Checks implemented to test and validate the distribution of data variabilities. Some of which includes:

  • The total number of events and number of patients with an event for each event type; the number of events and number of patients with an event for each attribute
  • The number of patients with an event for each attribute having a specific value
  • The distribution for all numeric attributes (mean, stddev, min, max)
  • Time trends for all events over the span of data

Comparative checks: Compare aggregated statistics between results from the Data Engineer and the Scientist to validate data ingestion and integration process. This can be obtained from the RVF report, which provides a detailed overview of checks implemented and how they performed at each of those.

Conclusion

Our search for an optimal solution to answer questions regarding transparency and reproducibility of our data ingestion process was a success. At Aetion, RVF is now a high functioning piece of technology that has become an integral part of our infrastructure. Over the last few months, it has become a core system fueling many of our most critical workflows.

We learned many things from this journey. One of our most important lessons is that the definition of data quality is neither static nor objective. Context and subjective requirements must be applied on a case-by-case basis. There is no single silver bullet to measure the quality of data.

Our key themes for building an RVF were to enable flexibility, modularity, openness, transparency, and reproducibility in the data ingestion and validation process at Aetion. It has been a fascinating, educational and rewarding journey to see how this project has taken shape and how the promise of RVF has actualized into significant gains in our teams’ productivity and effectiveness.

--

--

Feng Zhang
Aetion Technology

Highly accomplished software engineering and development manager with 10+ years of experience leading enterprise-wide implementations.