FinML — Data Versioning Pipelines

Published in

Tide Engineering Team

1 min readJan 19, 2021

About FinML

Tide started FinML in order to foster knowledge sharing for ML within financial applications. We are are meeting monthly to discuss various problems that come up when applying ML to financial problems — this has been the second iteration. If you want to join, subscribe to our substack and you’ll be invited automatically!

Data Versioning Pipelines with Jimmy (Pachyderm)

In the second instalment of the FinML, we have talked with Jimmy Whitaker on the importance of data versioning and the possibility to build CI/CD pipelines not only for code changes, but also for data changes.

Key take-aways for me were:

Besides the code lifecycle, there is a second, data lifecycle for ML applications
Given that ML applications are products of both code and data, we need to apply pipelines both to data and code
Building pipelines on the data allows you to not only apply tests to the code but also to the data and the code/data interactions
The artefacts produced in an ML lifecycle need to versioned together with the data
Ideally this interaction can be chained in a DAG-like fashion

Watch the full session here:

FinML 2 — Data Versioning Pipelines

Presentation
Original Blog
GitHub Repo for the demo
Pachyderm Hub (if they want to try the demo)

FinML — Data Versioning Pipelines

About FinML

Data Versioning Pipelines with Jimmy (Pachyderm)

Written by Hendrik