Synthetic Health Data Generation: My first experience with Synthea
Synthea is an open-source, synthetic patient generator that models up to 10 years of the medical history of a healthcare system. Synthea creates realistic patient data, including the patients’ heath records in a variety of formats, with varying levels of complexity. Having access to realistic data is crucial for any kind of modeling. It is of utmost importance for a complex, and highly connected system. Add sensitive to the list of adjectives for a system like healthcare.
Real health care data, or protected health information (PHI) must be just that — protected. At Optum, we store a large pool of PHI in our data lake. We use the data lake to learn about the health care system to make it easier and more affordable for everyone. We pay for our health insurance, too, so it’s really a win-win. Though we have access to the data lake, for demonstration and testing purposes, it is best practice to use mock data. Not using real PHI maintains the security and anonymity of our members. The power of Synthea is in simplifying the data generation without compromising on its quality. Synthea has a straightforward CLI tool that interacts with Gradle tasks. The output can be customized to specific illnesses, locations, and population sizes using the command-line arguments. Though I was only interested patients with cancer, Synthea can generate data for over 90 different illnesses from dermatitis to PTSD to appendicitis to dementia. There is a guide for creating your own modules as well, if a specific illness you want in model is not there.
In addition to security, synthetic data helps development progress without data-related blockers. In the past, projects have been delayed because there was not access to real data, or enough data. Getting data into multiple environments further complicates this problem. Synthea equips teams with the tool to avoid delays. Since Synthea is Gradle-based, multiple properties files can be used to differentiate between environments. This means data can be generated on the fly, in a variety of locations. There is no need to pass around a large data file. After development, Synthea can help with load and stress testing. Since Synthea’s output is highly customizable, different test cases can be easily generated, again, all on the fly.
My assignment was to populate part of our graph database with mock patients that have different types of cancer. The population size needed to be roughly 200 patients. The consumer of the data is an application that helps nurses manage their populations that have cancer. My step by step process followed the quick start directions in the repository’s ReadMe, with the exception of changing a few lines in two files. I wanted the output to be CSV, so I set all of the exporter variables to be false and set exporter.csv.export to true in the synthia.properties file. This generates a CSV directory in outputs with different categories of outputs (patient, provider, condition, etc.) in their respective CSV files.
Our model required an additional column to be outputted, so I also changed a few lines in CSVExporter.Java. Instead of manually add the column after Synthea runs, I was able to add to the existing code for column generation. The change was twofold. First, in the method writeCSVHeaders, I added the title of the column I wanted to add. The change should look like :
Our model required an additional column to be outputted, so I also changed a few lines in CSVExporter.Java. Instead of manually add the column after Synthea runs, I was able to add to the existing code for column generation. The change was twofold. First, in the method writeCSVHeaders, I added the header of the column. This method is responsible for writing the headers to all of the CSV files. You will need to write a the name of the column you’re adding to its respective FileWriter object. My addition was the “Source” column. The change should resemble:
Then, I appended my column to the string that would be written to the output CSV. For example, the change I made to immunization is below. My additions are on lines 700 and 701.
My command for generating data was:
./run_synthea -p 1000 -m *cancer
The -p specifies the population size I wanted, and -m specifies the modules I wanted to restrict generation to. The “*” is a wildcard character. In this case, Synthea will only use modules that end in cancer. The module could start with anything as long as the ending is “cancer”. Alternatively, you could specify “-m veteran*” for all of the veteran-related modules, or “-m a*” for all modules starting with “a”.
Take note that the population size that I used was 1000 to get the ~200 cancer patients I wanted. What is not made immediately obvious is that the population argument does not generate 1000 patients with the specified illness, it generates a population of 1000 people. In my experience, specifying cancer modules with a population size returns around 200 patients with cancer. This will be different based on which modules you have specified, as condition generation is based on probabilities.
Additionally, I was having trouble creating patients that only had cancer. Even specifying the module to be anything ending in cancer, cardiovascular illnesses persisted. I reached out to the Synthea team on Github by raising an issue. The team responded in less than 20 minutes. To me, that’s incredibly prompt. Big kudos to Jason Walonoski for that! Jason told me that it’s not possible to disable cardiovascular disease using the command line right now. It must be manually disabled by commenting out a single line of code. Directions for that can be found in the issue hyperlink above or here. Once I commented out the line, it was smooth sailing from then on.
Overall, Synthea is an exciting tool to have. It can easily generate a vast amount of quality data, in multiple locations. This pushes development past blockers related to data availability. Since the output is highly customizable, Synthea can be used for a plethora of populations and projects. Just because it’s health data, doesn’t mean it has to be used as such. It also relives security related concerns, as the data is not real PHI and data does not have to be passed around, but is generated on the fly. This makes it easier to get Synthea approved for integration into your project.