Migrating Large Scale Data Pipelines

Published in

The Socure Technology Blog

12 min readMar 17, 2023

Migrating data pipelines is something that every data engineer is likely to encounter in their career as the company matures and new technologies become available. Our team has been working on the migration of one such pipeline, Feedback, for the past several months. It has been a great experience so far with a lot of opportunities to learn and improve along the way.

The goal of the Feedback pipeline is to ingest data received from our clients. We use an ELTL process to maintain the richness of raw data received while providing some standardization and conformity to a common schema for downstream use by data scientists and products. This article will focus less on the ELTL logic and more on the overall architectural changes made during the migration.

Specifically, we will cover the following topics:

Motivation
Architecture & Design
Migration Plan and Execution
Future Items
Lessons Learned

Let’s dive deeper into each:

Motivation

We had been running our data pipelines on a stack including Airflow, RDS, Redshift, Workbench (internal tool) and some custom tooling. Much of this infrastructure was set up manually several years ago, lacked proper documentation, and did not follow best practices in terms of separation of concerns between the infrastructure and application code. While the feedback pipeline still worked, it accumulated tech debt over the years which made it difficult to add incremental features and upgrade individual components of the architecture. It also required a significant operational burden rather than taking an automation-first approach. It was time for a complete overhaul to a more scalable, fault tolerant and secure solution.

The goal of the migration was to move to a centralized Data Platform ecosystem which is more robust, built around the following core tools:

Process massive data (Spark)
Tooling for proper governance (Immuta)
Modern data warehouse (Snowflake)
Ownership and control via IaaC (Terraform)
Observability (Datadog)
Access controls, encryption/decryption, data lifecycle, etc

This was a company wide goal to move to Data Platform

Architecture & Design

Here is a high level design of legacy system [this requires manual ingestion first before automating]:

We receive a file from a client via SFTP which copies to S3, Workbench (Internal tool that listens to file, alerts, triggers Airflow and has UI on top of S3 and Airflow) triggers Airflow which runs python processes then writes to RDS and Redshift.

This is the new architecture that we recently migrated to [this is fully automated from day 1].

At a high level, Lambda listens to both new s3 and legacy S3 which then triggers the Airflow in Data Platform via SQS (each call can have its own configuration). We use Pyspark to process data that syncs to S3 and Snowflake. The Data Platform in the above design is not a complete depiction, it provides a lot of other functionality as mentioned earlier that is abstracted through an API.

Legacy S3 source is a temporary solution till we migrate all clients to new SFTP, discussed more later.

While overhauling the pipeline architecture, we also used the opportunity to improve the application logic. Significant changes include:

Replacing ad hoc data checks with a robust validation framework built around Great Expectations to ensure data quality
An automation-first approach with common configuration across clients to transform data in common ways
Much richer metadata tracking, including automation generation of metrics that save data scientists time when doing exploratory data analysis (EDA) on the data
Improved logging, helpful especially for troubleshooting failed runs
Improved deduplication logic
Improved capture of the raw data (enabled in part by Snowflake’s variant data type, for which Redshift does not have an equivalent)
Fault tolerance in the data itself, with an implementation of a “failures” table that allows for processing of a batch file upon meeting a certain quality threshold without failing the entire run

We will not dive deep into the specific technologies used but will focus on how we handled the migration.

Migration Plan and Execution

In order to make this a successful migration we had to come up with a solid plan. Below is a step by step plan on how we approached the migration and improved along the way.

Read more: Large Scale Migration Best Practices

Backfilling

Once we had the new pipeline ready, we needed to backfill years of feedback files. We considered a couple options:

Migrate data directly from Redshift to Snowflake (using s3 as an intermediate staging layer), which is quick and straightforward
Re-process all files on the new system. This was a much more time consuming option but provides all the application code improvements such as metrics, data quality checks, etc. making sure our data is consistent and that historical data is just as reliable as new data.

We took the hard road and went with the second approach given the importance of this data, which leads us to backfilling.

Backfilling itself is a very tedious process. We started off with a spreadsheet of files that we had ingested in the old pipeline, divided and conquered the work among the team members. The data size was not that big per file, but the number of files were many.

The backfill ingestion works like this:

We download file from source AWS account bucket
Optional: We manually test it (Next section discusses more on how we test)
We upload it with a file arrival timestamp in the file name to the destination bucket which triggers the pipeline via Lambda. (More on file arrival timestamp later)

We started with easier clients that had fewer files with less data issues and we processed those very quickly, but the challenge came when we started to ingest the older clients that have been sending files daily for the past several years, meaning hundreds of thousands of files. With startup time in the Airflow pipeline, even runs for small files could take 15 mins end to end. This isn’t an issue for new files coming in, but given concurrency limitations, this meant the backfill would have taken many months. In the next section, we will discuss how we solved this problem and the importance of investing in automated tooling.

Scripting and Tooling to Achieve Automation

We decided to come up with some temporary automation solutions to speed up our process. The basic idea was to deal with two types of challenges:

Pipeline failure in production:

Clients have been providing data for a long time, so differences in schema, column values and types have been inconsistent which does not conform to our newer standards set by the Feedback Product Owners and would often fail validation.

We had been running the backfill in two ways:

Directly by moving files to the new s3, manually kicking off a pipeline run, then reading failure logs on the prod pipeline, which would then require configuration updates.
Downloading and manually testing files with custom scripts for basic data quality checks on sampled data.

After having a lot of pain with these, we came up with a tool that serves as a local mini ETL to ease our pain. Since we were already moving files across accounts with the help of local as temporary storage, it was ideal to make sure on a few files per client that they pass our validations. This tool utilized the same validation code as our production pipeline, providing high confidence. It took more time upfront to build but was used widely and frequently within the team, saving hundreds of hours. No more testing on prod!

Combining files per client:

The second scenario where we needed a tool was ingesting too many files. Some clients had thousands of files with an average size of 5MB. Since we were using Spark there was no concern what the resultant combined data size would be, so we again decided to come up with a custom tool to handle the concatenation process and automate things locally.

Our Pipelines took 15 mins per run, but data size does not really matter, as we can scale Spark executors horizontally very easily.

This may look easy but it was not for several reasons, considering that clients have been sending data with different schemas, so we needed to consider that especially with a CSV file format. Second, we needed to make sure to handle duplicates with proper upsert logic when combining the files, so we had to update our application code to deal with this in the case of a combined file.

We used file arrival timestamp extracted from file name in our upsert logic.

With the combined approach, we did lose some granularity in metrics and metadata on a per file basis (e.g. row counts per file), but we decided that this was the right trade off. Speed was more important than keeping metrics for historical data in our scenario where we had strict deadlines and goals, and we did not make any concessions when it came to the more crucial data quality checks.

Lesson: automation is always important to speed up workflows. Even though the migration was a temporary effort, investing in tooling can really make you 10x (and save developers from a lot of tedium).

Config Driven Pipeline

Our pipelines are config driven, with logic controlled via YAML files. For example, types of configuration include column mappings, value mappings, data types, and validations. During the backfill we were running local mini ETL to make sure our validation passes, the beauty of our process was, everything was driven via a config with the ability to override per client via client config. This makes it easy to switch off validations, update mappings etc.

Typical validation was defined like this in YAML config:

<column_name>:
  <validations>:
    - <not_null>
    - <is_unique>

Our validations (also metrics) are abstractions on top of Great Expectations PySpark library. This custom framework would need a separate article. For now, we are in the process of contributing this abstraction back to the core Data Platform API so that other teams may benefit from the work we have done.

Treating Migrations as a Product

The migration in our case was more involved than simply moving data from Redshift to Snowflake, with tradeoffs and discussions needed along the way, and thus it made sense to involve product management. Our product manager was involved from the start of our migration process, helping in making decisions, communicating status to external teams, unblocking the team by paving the way to deal with challenging data, and bridging the gap between the client facing data team and our team. In short, the value of having a Product Manager is tremendous during migrations.

Pipeline Trigger Automation

Everyday we were receiving new files from clients that were going into our backfill list, which made it difficult to keep track of in our initial inventory spreadsheet and ingest in the new pipeline while still maintaining the old pipeline until we had finished the backfill. Eventually the plan was to migrate clients to the new SFTP, but since this is a lengthy process that involves help from account management and our clients themselves, we decided to have a short term solution. (Shown in the architecture diagram earlier)

We reused one of our lambdas to listen to the cross account s3 source bucket and trigger the lambda on file drop. This would allow us to run our old and new pipeline concurrently with minimal manual work.

Dealing with clients data consent

Some of our clients had not given consent to store data in the new data warehouse, Snowflake. To accommodate this, we created a new DAG for such clients where we can easily avoid writing to Snowflake Sinks and switch over if and when clients give consent. End users will still be able to query the data by using internal tools and packages but not within the Snowflake environment, and future work will attempt to make these distinctions as invisible as possible to end users.

Parallel Systems and Redshift Cost

One of our goals was to move away completely from legacy Redshift clusters, which are more expensive to maintain and were not updated with our latest data sources. The switch from Redshift did two important things for us:

Cost savings
No more maintenance/ingestion in old pipelines

For the first few months, we kept running the pipelines in parallel. We also needed to give users and reporting tools like Looker time to switch over to Snowflake before we could fully deprecate the old pipeline. With lambda automation in place to copy files to the new s3 bucket and ingest in both pipelines, we could take our time with this, communicate expectations clearly, and deprecate Redshift at the right time.

End to End Alerting & Notification

We implemented alerting and notifications across multiple places using DataDog and Slack. We have three types: client file drop notification, pipeline trigger notification and pipeline alerts. This gives us end to end visibility of our process.

Migration Status

We had few ways to track our migration status, the ideal one was to do an apples to apples comparison per file between Redshift and Snowflake via a dashboard on Looker. However, this was messy given some limited metadata tracking in the legacy pipeline and differences in deduplication and validation logic which made exact counts vary between the pipelines. We did build some tracking dashboards but again relied heavily on our product manager to provide context when row counts did not match precisely.

UAT

During the backfill process, we also wanted to make sure to get feedback from the consumers, so we decided to start the user acceptance testing by involving Data Science more often. We asked a few people to help test the data and even involved a key stakeholder in our regular scrum rituals for about a month. This helped us identify issues early and provide enhancements before opening up the data to a larger group of users.

Onboarding Users

Recently, we have opened our data for consumption to everyone. To make the transition smooth, we wrote an onboarding guide on our knowledge library hosted on Confluence during the migration. This includes helpful information like getting access, schema comparisons between the old and new data, and example queries. We also worked with an internal tooling team which writes abstractions on top of SQL to make sure that our internal python libraries had nice wrappers around the new data. Finally, we reached out to product teams that use feedback data to provide a white glove service in helping their developers switch over to the new data warehouse and schema. In the world of data, “if you build it, they will come” is far from true and the work does not end when the pipeline is complete. Making this project successful has required excellent documentation, tutorials, and lots of time invested in making sure any teams dependent on this data are supported.

Future Items

SFTP Migration

As the original goal is to migrate clients to new SFTP which resides in the new AWS account, we would start that switch slowly with few clients in the upcoming months, this will be a cross team effort.

Iterating on Validations

We still have a lot to improve and iterate on validations, we have a very good validation foundation as mentioned before (Config Driven Framework on top of Great Expectations). We need to iterate on some business logics and use cases and add the functionality to go deeper with validations.

Automatic Sink Cleanup

We write data to Snowflake and S3 and sometimes either due to bad data, missed validations, duplicate data, consent challenges, we have to do an automated cleanup from the sinks, currently it’s rare and we do it manually once in a while.

Self Serve Pipelines

Long term goal is to have a fully self-serve pipeline, right now we maintain configs and run the data ingestion. This would free our time, rather than ingesting we would focus on writing abstractions and frameworks on top of Data Platform.

Separation of Configs

As discussed, our pipelines are config driven, as of now they live with the code which is not a good practice in general, we kept it to keep things simple and have less external dependencies. But the scalable solution is to separate them out and with the Terraform foundation now we can take complete ownership.

Data Alerts

We will soon be implementing alerts on top of our Snowflake data. We already have built reports on top of our Snowflake tables using Looker. We may end up using custom alerting on Looker reports or using a tool like Soda.

Lessons Learned

We learned many lessons along the way, some important ones are:

Keep users with you from the start

We did not really involve our data users consistently from the start, we missed on a few things which we caught up later.

Make sure clients are on same page

Client consent was important and required during the planning to address issues as early as possible to avoid delays.

Using IaaC the right way

We use Terraform for most of our services that live in a new aws account, however, legacy is still maintained by DevOps manually. We ended up overwriting some s3 notifications while creating the triggers in Terraform for our cross account Lambda trigger. An important lesson we learned is to never use IaaC if the resources were created manually before, keep relying on Devops in that case.

Conclusion

Generally speaking, I believe that it was a successful migration, there were few minor mishaps in terms of missing some product features and addressing some client concerns but we dealt with it easily by reprioritizing. Also, we were able to provide a more robust system, with full automation and cutting off our usage from legacy big systems especially Redshift. There are still important pieces we need to do in future, hopefully we will learn from our mistakes that we mentioned in the article and move forward. Stay Tuned.

This was a team effort which required cross collaboration with multiple teams from DevOps, Data Platform, and Data Scientists.

Authors: Junaid Effendi (Engineer) and Lucas Chapin (Manager)

Thanking the rest of members who contributed in this migration:

Engineers: Sindhusha Boyapati, Robert Smith, Girish Sukhwani
Product Manager: Naman Dhaliwal
Data Scientist: Louis Yansaud