Versioning Data Science Solutions
#1 in the Evolving Data Science Series
I come from an academic background, so when I talk about how versioning is done within data science, I speak from experience. However, over my career, I have honed my skills as a software engineer, architect, and tech innovation leader. This gives me a unique perspective on how things are and have been done as well as how they can be done better, much better.
Origins of Versioning in Data Science
Data science inherited its practice of versioning from academia. And this approach is what most any computationally scientific field has done for many years. Don’t get me wrong, computer science went this way in the early digital days (and in many cases, this is still prevalent in academia), but has moved onto more advanced practices.
Avoid Faux Versioning
What is this practice? It is the practice of creating ‘versioning’ schemes through ‘smart’ file naming. These schemes generally produce the following kind of files
- _new, _old, _old_new, _new_1, …
- _1, _2, _1_1
This approach to versioning, which I refer to as faux versioning, is a disease and it is endemic. But breaking from these practices can be done, and should, as modern practices, such as model management, require it. This can be accomplished using a version control system (VCS)— with an industry preference for a git-based solution.
Go with a Git-Based VCS
Using a git-based VCS with an appropriate branching strategy, you are able to immutably track the version of your ML models, and other artifacts in your data science development life cycle. But, the versioning of code, unlike with other software engineering disciplines, is necessary but not sufficient.
Additional issues must be considered:
- versioning of data
- versioning of models
- modeling metadata storage
While these additional concerns can be managed, a vanilla implementation will not suffice longer term and may cause more harm than help. Hashmap can help you down this path and put in place a solution that best fits your unique business needs.
This is part of the Evolving Data Science series.
Feel free to share on other channels and be sure and keep up with all new content from Hashmap here.
If you enjoyed reading this, some of John’s other recent stories are below:
Securely Using Snowflake’s Python Connector within an Azure Function
Why and How to use Key Vault
Using DBT to Execute ELT Pipelines in Snowflake
DevOps and DataOps for Snowflake with DBT and Azure DevOps
Orchestration and DAG Design in Apache Airflow — Two Approaches
Orchestration of ETL processes — aka data pipelines — is a conceptually simple exercise, it’s the implementation that…
John Aven, Ph.D., is Lead Regional Technical Expert at Hashmap providing Data, Cloud, IoT, and AI/ML solutions and consulting expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers. Be sure and connect with John on LinkedIn and reach out for more perspectives and insight into accelerating your data-driven business outcomes.