FinML — Data Versioning Pipelines

Hendrik
Tide Engineering Team
1 min readJan 19, 2021

About FinML

Tide started FinML in order to foster knowledge sharing for ML within financial applications. We are are meeting monthly to discuss various problems that come up when applying ML to financial problems — this has been the second iteration. If you want to join, subscribe to our substack and you’ll be invited automatically!

Data Versioning Pipelines with Jimmy (Pachyderm)

In the second instalment of the FinML, we have talked with Jimmy Whitaker on the importance of data versioning and the possibility to build CI/CD pipelines not only for code changes, but also for data changes.

Key take-aways for me were:

  • Besides the code lifecycle, there is a second, data lifecycle for ML applications
  • Given that ML applications are products of both code and data, we need to apply pipelines both to data and code
  • Building pipelines on the data allows you to not only apply tests to the code but also to the data and the code/data interactions
  • The artefacts produced in an ML lifecycle need to versioned together with the data
  • Ideally this interaction can be chained in a DAG-like fashion

Watch the full session here:

FinML 2 — Data Versioning Pipelines

Presentation
Original Blog
GitHub Repo for the demo
Pachyderm Hub (if they want to try the demo)

--

--