Processing Healthcare Data Files with GCP: Easy as 1–2–3!
(Using Serverless Cloud Functions)
Getting data into the cloud is important but taking the right steps to properly prepare the data is even more critical. Data in Healthcare today emanates from a wide variety of sources including exports of patient clinical data, wearables and devices, and data from pharmacies. This means that it can arrive in various formats, so it’s vital to make sure the files are “clean” before uploading them to the cloud. Since data rarely comes without errors like duplicate records, incorrect formatting, or corruption, to name a few, we need a quick and easy way to make sure they’re in good shape when we transfer them to ensure high quality analytics downstream.
Serverless Google Cloud services eliminates the hassle of managing hardware, software and the updates and patching associated with each. Let’s look at an approach to clean files as soon as they land on Google Cloud Storage (GCS), using Cloud Functions to transform the data, so that it can be loaded into Big Query, for example, for analytics and processing.
- File is transferred from an on-prem system into Google Cloud Storage
- Event trigger starts theGoogle Cloud Functions (CF), which reads the file and completes the necessary transformations
- Data is loaded into a BigQuery table
Step1: Data Landing: Google Cloud Storage is a great choice to land your data. It’s easy to transfer, allows for a wide variety of file formats with options to transfer them out as needed, and is very cost effective. You can also upload the data to other storage systems like the one we used in the Workflow Sequence above (BigQuery) for further processing. Files can be loaded from on-prem to Cloud Storage using Storage transfer service or gsutil copy command. You can also set up an incremental sync that runs on a schedule or a daily batch. Here’s the gsutil command:
gsutil cp [INSERT SOURCE FILE] gs://my-bucket
Step2: Serverless ETL: Cloud Functions allows you to do serverless extract, transform and load, using Python scripts or other supported languages. They’re also stateless, so it can be applied to events like responding to new files added to GCS and performing simple validations. It allows you to run Python programs and transform data using pandas library. Pandas is a powerful library for loading, cleaning, and transforming tabular data. You can restructure the data if needed. During the deployment of Cloud Function, you can specify the dependencies to install any libraries needed for transforming the data. An Eventarc trigger can be created to have the cloud function to be triggered by new file uploads to GCS bucket.
# Here’s an example of a requirements file; add your dependencies as appropriate:pip install pandas=0.20.3# Read the CSV file directly from GCS bucket using data frameimport pandas as pddf = pd.read_csv(‘gs://file_location’)# Data Manipulation # Clean the data by removing all null values based on patient_id columndf = df.dropna(axis=0, subset=[patient_id])# De-duplicate based on patient ID and request all the latest admissiondf.sort_values(admission_date, ascending=False).drop_duplicates(‘patient_id’).sort_index()# Convert age from string to integerdf[‘Age’] = df[‘age’].astype(int)
Data Loading: Now that data is ready, you can load the updated dataframe to BigQuery. You can use pandas-gbq library, a wrapper around BigQuery client library, to load pandas data frames to BigQuery. BigQuery provides a native solution to load data using the gbq library and pandas allows for the use of a pandas-gbq library. You can also download the file back into the GCS bucket as a recovery option.
pip install pandas-gbq -Upandas_gbq.to_gbq(df, table_id, project_id=project_id)
Load Optimization : Consider using a Parquet file format to speed up your data frames and optimize the load to BigQuery. Leveraging native libraries such as pandas-gbq makes loading the data much faster. If you’re writing to existing tables, you can use “if_exists = append.” If the table is not there, it will be created automatically.
Workflow Automation: You can automate the above steps by adding an event-driven mechanism. An Eventarc trigger can listen to any file uploads to GCS, and kickstart the cloud function as soon as the file lands.
Latency Reduction : You might have a need to load the data in as soon as possible. Use minimal instances to prevent cold starts and avoid latency. You can also set a minimum number of instances, to avoid having to set up execution environments initially, and to reduce latency.
Larger Workloads: The approach outlined is very suitable for batch files that are typically smaller in size and requiring simple data transformations. For a large volume of data, consider using spark data frame to leverage parallel processing. For streaming jobs, you can incorporate some of the transformations in the Dataflow pipeline.
Orchestration: Cloud Composer is a fully managed service to handle all aspects of workflows. It’s very efficient at orchestrating jobs, because each directed acyclic graph (DAG) can take care of different workflows — including transfer, transform and load. If the data load fails, it can re-run the load job.
SQL Operations: With Cloud Functions and pandas, you can also do SQL specific transformations. Remember though, since BigQuery is optimized for SQL operations, you might want to do those manipulations as DML Statements once the data is loaded.
Here is the GitHub code to get you started: https://github.com/jayjayakumar/gcs2bq-data-load
Wrap-up: Use this method and these tips for quick and easy transformations and upload data into Google BigQuery as soon as they land. I welcome your feedback; let me know if you have any questions.