How Trade Me uses Prefect to orchestrate hundreds of data pipelines

Published in

Trade Me Blog

7 min readDec 20, 2023

Trade Me is New Zealand’s largest online auction and classifieds website and is visited by 650,000 Kiwi everyday.

Trade Me’s new data platform journey started in 2020 when a team of BI developers were tasked with starting the migration of the company’s Data Warehouses to a more scalable and modern platform. You can read more about Trade Me’s cloud modernisation to Snowflake here.

Today we use a combination of SQL Servers and Snowflake Data warehouses to support our Analytics, Data Science, Marketing and Engineering teams across the business.

Analysts use these data warehouses to source their Power BI reports and to help answer questions from account managers with customer requests. The data helps them communicate the value the Trade Me website provides to their products.

Marketing uses the data warehouses to analyse how we can better support our customers with helpful engagement. Things like how many views a trending product gets over time or how many people have watchlisted an item during a promotion is invaluable information and can give our team insight into the market and a competitive edge.

These are just a few of many uses the Trade Me Data Platform is able to help with but how are we able to keep up this enormous flow of data from the website into our data warehouses fresh and consistent? Enter Prefect.

The Prefect 1 UI look features a prominent run history along with upcoming runs and success/failure %

Trade Me was an early adopter of Prefect in 2019 and our first iteration of a Data Warehouse as a Service (DWAAS) was on Prefect 0.6.5!

Prefect came with heaps of tools our team has used to efficiently manage and monitor over 100 Prefect flows for business critical pipelines that directly affect things such as revenue reporting. Here are some of the best features we were able to utilise to make this possible and reliable.

Scheduling

The main draw to any orchestration tool is of course the scheduling.

Prefect 1

For our legacy data platform we did this on a very granular level. Trade Me has 4 main business units we call verticals that we use to separate our core product offerings. Similarity we split these up in our repo folder structure as well:

├── flows
│   ├── marketplace
│   │   ├── config-prod.json
│   │   ├── config-test.json
│   ├── motors
│   ├── property
│   ├── jobs
│   ├── ...
├── .gitlab-ci.yaml
└── .gitignore

The key to scheduling was in our versatile config.json files. We implemented our own bespoke method to use these config files to deploy flows individually and to handle new deploys if there were any code changes. This also included our own scripts to build and host the Prefect compute agents that were responsible for the actual running of the jobs.

prefect-schedule takes in a cron job schedule so we are easily able to set times on each individual flow. Special shoutout to crontab guru for their super useful and intuitive website which makes this a breeze.

Prefect by default is configured in UTC so we have found that setting your timezone is essential to preventing issues like daylight savings changes from shifting your flow runs by an hour. We have had issues in the past where data would arrive an hour later due to this and our subsequent downstream pipelines were missing data! Definitely a nice feature to have.

Below is a config file we have for one of our flows:

{
    "name": "search-boost-seller-qualification",
    "parameters": "'{\"stage\": \"prod\", \"slack_channel\": \"data-alerts\"}'",
    "cloud-deploy": "true",
    "cloud-execution-environment": "fargate",
    "cpu": 512,
    "memory": 2048,
    "project-name": "DWAAS Prod Environment",
    "schedule-timezone": "Pacific/Auckland",
    "prefect-schedule": "30 10 1 * *"
}

Deployment Management

Prefect provides many ways to handle the deployment of these flows. As shown above, our legacy data platform made use of multiple config files per flow to handle deploys. We typically used two per flow, one for production and another for test. But as you can imagine, the more flows you have the more boilerplate files you will need for each.

Although advantageous to have many options it was difficult for us to find the best way to build out our new platform without the need for a highly specialised build like before — we needed to revisit how we managed our release process. This section talks about our approach and what we found was best for us.

Prefect 2

Prefect 2 now supports config based deployments natively and instead of our custom solution described earlier for Prefect 1, we now utilise a single new file called prefect.yaml to handle multiple deployments of flows across the entire repository.

Below is a snippet of the new config for one of our flows. Notice the simple schedule setup with a new schedule block in the deployments and the use of YAML anchors for customisable configuration:

# prefect.yaml
# File for configuring project / deployment build, push and pull steps

name: multi-deployment
prefect-version: 2.11.0

definitions:
    work_pools:
        default_job_variables: &default_job_variables
           image: # location of image to use
           env:
              STAGE: "{{ $ENVIRONMENT }}"

        intense_job_variables: &intense_job_variables  
            <<: *default_job_variables
            cpu: 6000m # for higher compute requirements
            memory: 16G

        default_work_pool: &default_work_pool
            name: cloud-run-push-pool
            job_variables:
              <<: *default_job_variables
    actions:
        docker_build: &docker_build
           - # steps for docker build push commands

deployments:
  - name: dealer-image-check
    description: Motors Dealer Image Check
    entrypoint: legacy_jobs.motors.dealer_image_check.flow:entrypoint
    parameters: { 'param_slack_secret': "slack-token" }
    schedule:
      cron: "0 9 * * MON"
      timezone: 'Pacific/Auckland'
    tags: [ 'motors' ]
   work_pool:
     <<: *default_work_pool
     job_variables:
       <<: *intense_job_variables

prefect.yaml works seamlessly with your traditional .gitlab-ci.yml file - all you need to do is just run this command while targeting your desired environment in Prefect and it will deploy all flows with your provided docker image containing your source code.

prefect -no-prompt deploy -all

This has allowed us to avoid using multiple config files and refrain from developing a custom built mechanism that checks for changes in the code to deploy. Using prefect.yml handles this complexity for us and drastically reduces the technical debt moving forward.

We have also moved away from using self-hosted Prefect Agents to push-work-pools which Prefect Cloud uses to submit flow runs for execution. This reduces our maintenance and the single point of failure from an infrastructure point of view as we no longer need to provision servers that operate the agents.

This new approach makes it a lot easier for our team to manage and release multiple flows. It also makes it far more user friendly for our end users wanting to set up a new flow for the first time. Less infra for us to manage and no more boilerplate configs for each new flow!

Cloud Agnostic

Prefect although it seems like your typical SaaS offering in the cloud, is actually run on a ‘bring your own compute’ model. This does have its advantages — namely in cost management. But for this section I particularly want to point out how this model helped a lot with our data migration journey.

When the data platform team first adopted Prefect, Trade Me was mostly an AWS usage company. However since Project Kapua, all new products built would see us moving our stack to Google Cloud Platform or GCP for short.

From what was AWS ECR and Fargate EC2 instances powering our Prefect agents and backend, we have now built our new Prefect 2 infrastructure using GCP’s Artifact Registry and Cloud Run services with seamless compatibility.

Because of this, our new data platform is now more aligned to the rest of the company’s long term vision and we get to reuse a lot of well built infrastructure design patterns from the rest of our company to efficiently and scalably manage our new Prefect 2 Data Platform.

Being cloud agnostic, Prefect has not only given us more freedom with the ownership of our data platform but also helped us optimise its cost and align it with the rest of our company’s infrastructure. If you are using public cloud you can bring your best practices and use it with Prefect!

Flow and Task Debugger

Prefect 2 also has a fantastic debugger on their UI. It shows the full run history of each of your flows and all tasks that are run in order of their dependencies. A task in prefect is a Python decorator that can be used on functions to split up the processing of a flow.

This provides an easy way to debug tasks that fail on specific processes and shows us all runnable components in a flow for a full picture.

Summary

Trade Me’s Data Platform houses data from across the business for all Kiwi in New Zealand. Being able to manage all our data pipelines that unlock business value is paramount to our success.

With Prefect we are able to utilise its highly customizable deployments, scheduling and verbose logging history to run and maintain over 100 Prefect flows (Python based data pipelines).

Prefect is cloud agnostic and has also given us the opportunity to easily migrate our existing data pipeline infrastructure from AWS to GCP and align our data platform team with the rest of the company’s vision for Trade Me.