Versioning Data Science Solutions

#1 in the Evolving Data Science Series

John Aven
Hashmap, an NTT DATA Company
3 min readJan 7, 2020

--

I come from an academic background, so when I talk about how versioning is done within data science, I speak from experience. However, over my career, I have honed my skills as a software engineer, architect, and tech innovation leader. This gives me a unique perspective on how things are and have been done as well as how they can be done better, much better.

Origins of Versioning in Data Science

Data science inherited its practice of versioning from academia. And this approach is what most any computationally scientific field has done for many years. Don’t get me wrong, computer science went this way in the early digital days (and in many cases, this is still prevalent in academia), but has moved onto more advanced practices.

Avoid Faux Versioning

What is this practice? It is the practice of creating ‘versioning’ schemes through ‘smart’ file naming. These schemes generally produce the following kind of files

  • _new, _old, _old_new, _new_1, …
  • _1, _2, _1_1

This approach to versioning, which I refer to as faux versioning, is a disease and it is endemic. But breaking from these practices can be done, and should, as modern practices, such as model management, require it. This can be accomplished using a version control system (VCS)— with an industry preference for a git-based solution.

Go with a Git-Based VCS

Using a git-based VCS with an appropriate branching strategy, you are able to immutably track the version of your ML models, and other artifacts in your data science development life cycle. But, the versioning of code, unlike with other software engineering disciplines, is necessary but not sufficient.

Additional issues must be considered:

  • versioning of data
  • versioning of models
  • modeling metadata storage
  • etc.

While these additional concerns can be managed, a vanilla implementation will not suffice longer term and may cause more harm than help. Hashmap can help you down this path and put in place a solution that best fits your unique business needs.

This is part of the Evolving Data Science series.

Feel free to share on other channels and be sure and keep up with all new content from Hashmap here. To listen in on a casual conversation about all things data engineering and the cloud, check out Hashmap’s podcast Hashmap on Tap as well on Spotify, Apple, Google, and other popular streaming apps.

If you enjoyed reading this, some of John’s other recent stories are below:

John Aven, Ph.D., is the Director of Engineering at Hashmap providing Data, Cloud, IoT, and AI/ML solutions and consulting expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers. Be sure and connect with John on LinkedIn and reach out for more perspectives and insight into accelerating your data-driven business outcomes.

--

--

Hashmap, an NTT DATA Company
Hashmap, an NTT DATA Company

Published in Hashmap, an NTT DATA Company

NTT DATA acquired Hashmap in 2021 and will no longer be posting content here after Feb. 2023. You can view more content from innovative technologists and domain experts on data, cloud, IIoT/IoT, and AI/ML on NTT DATA’s blog: us.nttdata.com/en/blog

John Aven
John Aven

Written by John Aven

“I’d like to join your posse, boys, but first I’m gonna sing a little song.”