Data Science Repo Scaffolding for the Next Generation

#3 in the Evolving Data Science Series

Published in

Hashmap, an NTT DATA Company

3 min readJan 31, 2020

Welcome back to the Evolving Data Science series. If you haven’t already read the previous entries, please find them below and read them at your leisure. There is no ‘ordering’ in this series. Today we want to talk about how you as a company, as agents of change, can scale up your data science processes and reach a new scale.

Code Repo Variance

Some of the inherent complexity found in data science comes from the number of different steps that can be taken in any given solution. This generally results in code repositories that vary to a great degree. These code repositories can even vary across a team on a solution-by-solution resolution. Each scientist has their own preference in terms of what they consider a good structure and that changes over time. This can lead to chaos!

While this approach may be fine for smaller teams with a low velocity, this is the essence of technical debt. A simple solution is not always the best; the effort to move away from this approach is often more difficult than the effort it saved upfront for an appropriate and properly scoped/designed solution. Fortunately, this is not the case when it comes to data science; it is straight-forward to move from an ad hoc approach to a repository layout and then to a properly defined scaffolding.

Next-Gen Approach

You may be saying to yourself, or you will after you do some reading, that there is already a solution for this and it’s called ‘cookie cutter.’ Well, you aren’t wrong. In fact, it is a solution, but it is a solution for data science solutions of today and yesterday. It is not designed to help you move to the next generation of data science solutions or the next generation of ML Engineering.

The data science team here at Hashmap has worked on a fresh approach that will help you evolve your data science practice and operate at a scale you previously only wished you could.

We are ready to speak to you, are you ready to evolve? If so, then please contact us and arrange a time to speak about how we can help your organization meet the needs of tomorrow today.

This is part of the Evolving Data Science series.

Deploying Machine Learning Solutions

#2 in the Evolving Data Science Series

medium.com

Versioning Data Science Solutions

#1 in the Evolving Data Science Series

medium.com

Feel free to share on other channels and be sure and keep up with all new content from Hashmap here.

John Aven, Ph.D., is the Director of Engineering at Hashmap providing Data, Cloud, IoT, and AI/ML solutions and consulting expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers. Be sure and connect with John on LinkedIn and reach out for more perspectives and insight into accelerating your data-driven business outcomes.

Data Science Repo Scaffolding for the Next Generation

#3 in the Evolving Data Science Series

Code Repo Variance

Next-Gen Approach

Deploying Machine Learning Solutions

#2 in the Evolving Data Science Series

Versioning Data Science Solutions

#1 in the Evolving Data Science Series

Written by John Aven