Data Scientists…what do you mean that you’re not “programmers”?

Photo by Kevin Ku on Unsplash

My Introduction to Data Science

A few years ago, I was introduced to some real-world implementations of Machine Learning. The concept of “Machine Learning” and “Artificial Intelligence” has been around for as long as I can remember (granted, I have only been programming seriously for ~10 years now). They were simple explanations but they were attempting to provide solutions to real-world problems that were occurring in our space. As I write, I do not remember the problems or solutions that were presented, but I do remember one thought blaring in the back of my mind:

More Flashbacks To Come

In August 2019, I made the choice to join an amazing group of people at Quickpath as a Platform Architect. It was an incredible opportunity for me because I was given the unique opportunity to lead the design and development of a unique platform that exists nowhere else in the industry, the Quickpath Platform (if you want to learn more, reach out to me on LinkedIn). Over the last year, I have been able to learn a tremendous amount of information on what a “day-in-the-life-of” a data scientist looks like, as well as how they approach many different problems. While I still don’t fully understand all of the implementations of Machine Learning, I have learned so much more than I ever thought I would just a few years ago. With all of these interactions, I have realized a single thing that has remained true since I worked as an Undergraduate Research Assistant for the Data-Intensive Scalable Computing Laboratory under Dr. Yong Chen at Texas Tech University:

Scientists Are Not Software Engineers

No, and for good reason. I attempted to get into the world of linear regression, logistic regression, Naive Bayes classification, KNN, and more and was quickly overwhelmed. Now let us throw in that they have to understand the frameworks that are available in their provided programming language of choice (e.g. Python): TensorFlow, SciKit-Learn, Keras, Predictive Model Markup Language (PMML), and many more. Can you expect any person to know all of that functionality, as well as all of the things that we have to know as software engineers (e.g. RESTful APIs, database modeling, multiple programming languages, implementing authentication & authorization, etc.)? I wouldn’t and neither should you.

Who Cares If The Code Works?

If you are or were a software engineer, you have likely lost count of the number of times you heard someone say or asked this question. Regardless of your feelings towards this question, there is one thing that you should be aware of when you hear it: technical debt. If you are reading this as a scientist, you may have never heard of this concept before now and possibly even wondering why it matters to you. I might not know how your mind works when you’re designing and tweaking your data models based on your findings, but I do know how versioning code and data over time tend to go if you don’t pay attention. If you‘re still writing data models and not using Git, Subversion, or even Data Version Control, then you are likely doing something akin to this meme:

https://www.reddit.com/r/ProgrammerHumor/comments/72rki5/the_real_version_control/

What Do You Suggest I Do?

Before I get started, I am going to be referencing the “master” branch of Git quite a few times in the following section. Please know that I am not using this in any sort of derogatory manner, as I am 100% aware of and stand behind the idea to rename all “master” branches to something else. As an example, GitHub has announced that they are making a massive effort to abandon “master” as their primary protected branch in their repositories. Even the most recent release of Git (v2.28.0 as of this writing) has announced a configuration when creating a new repository to specify a default branch name of your choosing.

Git Flow

The first (my personal favorite) follows the Git Flow process using Git. In this process, while you are doing your initial design of your models, you are creating branches from the “develop” branch (one of the 2 “protected” branches) of your Git repository. When you decide that you want to make a change to the model or data, you simply create a new branch from the “develop” branch, commit your changes, and then merge those changes back into the “develop” branch. You keep following this pattern until you have a trained model that you believe is ready for final validation/testing. At that point, you create a “release” branch where you will do final validation, testing, and bug fixes to your model. When you feel as though it is ready for “production”, then you merge the release branch into your “master” branch (the other “protected” branch) and create a tag that follows Semantic Versioning guidelines. Here is a quick visualization that might help you understand it better if you are a visual person like myself:

Image depicting the process for managing source code using the Git Flow process model.
Image depicting the process for managing source code using the Git Flow process model.
https://jeffkreeftmeijer.com/images/gitflow.gif

GitHub Flow

The second option (my least favorite) is the GitHub Flow process using Git. In this process, you start from the “master” branch and you create all of your branches (fixes, features, releases, etc.) from there. Once you have made all of the changes you wish to make on your model and/or data, then you test / validate that model similar to the same way you would test/validate the “release” branches created in the Git Flow process. Once you are sure that the changes are ready to go to “production”, then you merge your changes into the “master” / “main” branch of your repository.

Visualization of the GitHub Flow process.
Visualization of the GitHub Flow process.
https://arccwiki.uwyo.edu/images/1/19/GitHub_Flow_steps.png

To Production…and beyond!

Deploying a Machine Learning model to “production” is such a hot topic that even VentureBeat recently reported that 87% of Machine Learning models never make it into production. Those that do are generally not repeatable or scalable across a large array of teams. As an example, here are just a few of them that have been discussed online:

  • What am I supposed to do with the output predictions?
  • What actions am I supposed to take based on the business rules around the predictions?
  • Where do I store those predictions associated with the provided inputs so I can analyze the run-time executions of my model?
  • How do I deal with releasing a new “version” of my ML model without breaking production?

But wait, there’s more?

That is all, for now. I will make other posts in the future to help disseminate some of the information I have learned when trying to help data scientists move models from source control to usable and executable in a production-grade environment. Here are a few ideas for future topics that I have bouncing around in my head. Respond below and let me know which ones you would like to know more about (or even suggest your own) and I will prioritize based on your responses!

  1. My IT organization wants me to do CI/CD for my project. What does that mean?
  2. A developer told me that Git is a terrible place to store my <insert 1GB+ file size> model because it is terrible with storing large files. What do I do now?

My thoughts are my own.