Versioning Data Science Solutions

#1 in the Evolving Data Science Series

Published in

Hashmap, an NTT DATA Company

3 min readJan 7, 2020

I come from an academic background, so when I talk about how versioning is done within data science, I speak from experience. However, over my career, I have honed my skills as a software engineer, architect, and tech innovation leader. This gives me a unique perspective on how things are and have been done as well as how they can be done better, much better.

Origins of Versioning in Data Science

Data science inherited its practice of versioning from academia. And this approach is what most any computationally scientific field has done for many years. Don’t get me wrong, computer science went this way in the early digital days (and in many cases, this is still prevalent in academia), but has moved onto more advanced practices.

Avoid Faux Versioning

What is this practice? It is the practice of creating ‘versioning’ schemes through ‘smart’ file naming. These schemes generally produce the following kind of files

_new, _old, _old_new, _new_1, …
_1, _2, _1_1

This approach to versioning, which I refer to as faux versioning, is a disease and it is endemic. But breaking from these practices can be done, and should, as modern practices, such as model management, require it. This can be accomplished using a version control system (VCS)— with an industry preference for a git-based solution.

Go with a Git-Based VCS

Using a git-based VCS with an appropriate branching strategy, you are able to immutably track the version of your ML models, and other artifacts in your data science development life cycle. But, the versioning of code, unlike with other software engineering disciplines, is necessary but not sufficient.

Additional issues must be considered:

versioning of data
versioning of models
modeling metadata storage
etc.

While these additional concerns can be managed, a vanilla implementation will not suffice longer term and may cause more harm than help. Hashmap can help you down this path and put in place a solution that best fits your unique business needs.

This is part of the Evolving Data Science series.

Feel free to share on other channels and be sure and keep up with all new content from Hashmap here. To listen in on a casual conversation about all things data engineering and the cloud, check out Hashmap’s podcast Hashmap on Tap as well on Spotify, Apple, Google, and other popular streaming apps.

Hashmap on Tap | Hashmap Podcast

A rotating cast of Hashmap hosts and special guests explore different technologies from diverse perspectives while enjoying a drink of choice.

www.hashmapinc.com

If you enjoyed reading this, some of John’s other recent stories are below:

Securely Using Snowflake’s Python Connector within an Azure Function

Why and How to use Key Vault

medium.com

Using DBT to Execute ELT Pipelines in Snowflake

DevOps and DataOps for Snowflake with DBT and Azure DevOps

medium.com

Orchestration and DAG Design in Apache Airflow — Two Approaches

Orchestration of ETL processes — aka data pipelines — is a conceptually simple exercise, it’s the implementation that…

medium.com

John Aven, Ph.D., is the Director of Engineering at Hashmap providing Data, Cloud, IoT, and AI/ML solutions and consulting expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers. Be sure and connect with John on LinkedIn and reach out for more perspectives and insight into accelerating your data-driven business outcomes.

Versioning Data Science Solutions

#1 in the Evolving Data Science Series

Origins of Versioning in Data Science

Avoid Faux Versioning

Go with a Git-Based VCS

Hashmap on Tap | Hashmap Podcast

A rotating cast of Hashmap hosts and special guests explore different technologies from diverse perspectives while enjoying a drink of choice.

If you enjoyed reading this, some of John’s other recent stories are below:

Securely Using Snowflake’s Python Connector within an Azure Function

Why and How to use Key Vault

Using DBT to Execute ELT Pipelines in Snowflake

DevOps and DataOps for Snowflake with DBT and Azure DevOps

Orchestration and DAG Design in Apache Airflow — Two Approaches

Orchestration of ETL processes — aka data pipelines — is a conceptually simple exercise, it’s the implementation that…

Published in Hashmap, an NTT DATA Company

Written by John Aven