Using Argo for Data Pipelines at Scale inside Tradeshift

Mike Williamson
Tradeshift Engineering
4 min readDec 16, 2021

The technology world has moved to containerization in order to simplify application and service deployments. Robust services and applications exploit container orchestration tools to ensure that services remain available when containers themselves fail. At Tradeshift we use the container orchestration tool Kubernetes. Data pipelines, however, necessitate complex dependencies in addition to robust orchestration. The launch of one data pipeline task depends upon the completion and output of a previous task. Argo allows for workflows that can capture and manage dependencies.

Why Argo?

Throughout all of our services, Tradeshift relies heavily on containers and Kubernetes. In the data and machine learning team, we wanted to rely upon the excellent orchestration infrastructure that was already available. There are many open source tools that can manage dependencies and which also integrate with Kubernetes. Several of these options could have worked in our situation. Argo rose to the top because:

  • Argo workflows are implemented as a custom resource definition within Kubernetes. This keeps it quite close to the core Kubernetes infrastructure we depend upon.
  • We need to be able to schedule batch jobs
  • Relying upon Python is not a plus in our case, because we do much of our work in Scala when working with Spark, so not all team members prefer Python over a simple configuration language like YAML.
  • Argo has already been around for a few years (since 2017) and we prefer tools with successful usage across a range of platforms and industries.
  • Argo has a beautifully simple and effective dashboard that is trivial to understand and use:

Data within Tradeshift

Tradeshift is a fully digital network of over 1 million commercial buyers and sellers spanning the entire globe. A Tradeshift customer’s purchasing team needn’t manage dozens or hundreds of separate invoicing platforms nor understand the tax rates across all their suppliers’ countries. This headache has been eliminated. Moreover, Tradeshift provides a platform where buyers and sellers can more easily find each other. We focus upon eliminating the friction in the B2B marketplace. Tradeshift digitizes trade, simply.

With the emphasis upon reducing the friction of digitizing B2B transactions, both Tradeshift’s software and its data are first class citizens throughout the organization. To provide a highly responsive user experience, we first ingest data following an infrastructure more closely matching the classic OLTP framework: maximize the number of transactions per second and the responsiveness of each of these transactions. But this data needs to be made available for reporting, both internal and external customer reporting, as well as being used to enable artificial intelligence solutions. For instance, automated matching of purchase order line numbers and invoice line numbers is performed to further ease the friction of B2B trade. Therefore, we utilize a data lake with numerous transformations and joins of data across the organization. Due to customer needs, varied data sources, and the unfortunate peculiarities of invoicing systems across the globe, the ELT (extract, load, and transform) pipelines inside Tradeshift are complex. To manage this complexity in a robust, fully CI/CD-enabled environment with robust job orchestration and restarts, we rely upon Argo as discussed above.

Using Argo at Tradeshift

Despite the relative simplicity of Argo being implemented as a custom resource definition within Kubernetes, examining the YAML files associated with an entire workflow is overly involved and detailed to discuss in full here. Tradeshift uses primarily AWS for our cloud computing resources and storage. A typical batch Argo workflow will use a scheduling trigger (an Argo CronWorkflow). This will pass some arguments to an argo WorkflowTemplate, which is standardized to allow it to function across several similar pipelines. This template will typically contain the pipeline / dependencies graph. Each of these dependencies are either Spark jobs submitted to AWS EMR, or they are less intensive jobs that can run on a single container, such as sending out an email report. The dependent jobs can become fairly complex; the image shown above has had some of the nodes hidden to reduce the complexity and it is still quite involved.

Nonetheless, Argo provides a simple interface to both construct our dependencies and to view and understand how they perform. Argo also allows developers to easily peer into the container logs to debug any problems or understand how it is performing. Clicking on any of the completed dependencies, either successful or failed tasks, will provide a log similar to the log shown below. (This is a simple report email triggered as the last step, with the email addresses, S3 bucket, and customer removed for privacy.)

Opportunities at Tradeshift

Tradeshift is a global company with offices across the world. The team involved in this project, which focuses upon data engineering and machine learning, is based in Denmark and actively looking for very experienced data engineers and machine learning engineers. We hope to meet you soon!

--

--