Automating Data Science Workflows in The Cloud

Dan Gabriel
Building Panorama Education
5 min readJul 13, 2023

Across all the states we serve, Panorama Education receives a large volume of survey data. The data that we get can help us understand national trends, benchmarks, and quantify the validity of the surveys that we partner with our clients to administer. This work is important because it shows the quantifiable, demonstrable value that our surveys provide. Schools and districts can trust Panorama’s surveys — not just for our rigorous testing to ensure valid and reliable results, but also because of our commitment to student data privacy.

Over the years, we have developed a process to collate and normalize the survey results we receive to analyze and better understand the efficacy of our surveys and how they are being used. Some of our clients need to change the specific wording of survey questions to suit their needs. To solve for that, we use a thorough research process to determine which questions are worded similarly to each other. Other clients may use many different demographic categorizations in their student databases. We then follow industry-recognized standards to classify demographics across schools.

By normalizing survey responses, we are able to generate reports like the Reliability and Validity of The Panorama Well-Being Survey and Reliability and Validity of The Panorama Equity and Inclusion Survey. This data is also used to determine national benchmarks for our surveys and to discover trends across the country.

In this blog, I’d like to briefly describe the steps we took to transform our process to generate this normalized data from a manual process that could take days into an automated process that takes just a few hours.

The Problem

As our number of clients grew, our data scaled massively! The processes which had previously run on a researcher’s workstation began to get slower and slower as the volume and complexity of the data increased. The time investment to generate the survey response analytics started to become prohibitively expensive, requiring multiple days of close attention and manual steps for our researchers to process the data.

After we first created our manual process to normalize our response analytics data, by 2021 we were already pushing the limits of what a researcher’s workstation could handle. That amount of data has only increased over the past few years. At the time of writing, we’re normalizing over 800 Gigabytes of response data!

While the original implementation we built had served the immediate needs of our clients and product teams, it became apparent to us that we would need to scale our implementation differently. We needed a way to process the data in a repeatable fashion, and one that would not block our researchers while it was being generated.

Some of Panorama’s Data Scientists and Researchers at a recent in-person event

The Solution

After considering a number of options, we decided to implement our response analytics workflow in the cloud, in AWS (Amazon Web Services). This would give us the opportunity to build our workflow to scale horizontally, allowing it to run much faster than it could on a single workstation. By codifying and automating our process, we would also be able to run our analyses on a regular cadence, or on-demand if needed. And by running these workflows in the cloud, our researchers’ workstations would remain available for day-to-day work, even when the response analytics workflow was running.

Because many portions of this workflow were built in Python, we decided to build an application using Apache Spark’s PySpark interface. Before we began implementation, there was a careful collection and standardization of the steps and queries needed to create the resulting analytics view of the data. Each discrete step in the response analytics workflow was then converted to an AWS Glue job which ran the PySpark script associated with it. The scripts ranged from thin-wrappers of queries used to generate intermediate results, to more complex applications of business logic at runtime. For orchestration of these jobs, we made use of AWS Glue workflows. Using these tools, we were able to compose an equivalent response analysis pipeline, with the following additions: it was automated, it was repeatable, and it scaled horizontally for the parts of the workflow that needed it!

While our previous solution was done entirely on a single researcher’s workstation, the below diagram shows a simplified view of the new workflow. With de-identified data from our source DB, we are able to perform our standardization and normalization processes to the data and create the resulting response analytics data view. The data is then available for use by our securely authenticated users from a Jupyter Notebook or other tools that our researchers need. The process to generate this data view is run on a regular basis (and on-demand if needed) and is available to all of our researchers for analysis.

Conclusion

After building out our automated response analytics pipeline, we have been able to reduce the generation time of the analytics data from several days to just a few hours! All this is done without blocking a researcher and their workstation for days. We built a repeatable workflow, which now allows us to gain access to up-to-date analytics data on-demand, instead of having to manually rebuild it each time. Moreover, now that we have automated and orchestrated our response analytics pipeline, we can expand on our current framework to build even more features upon it in the future.

We’ve learned a lot in the process of automating our response analytics pipeline in the cloud, and we’re looking forward to expanding our capabilities as we continue to develop more innovative features!

--

--