Member-only story
Engineering best practices for Data Science projects
Making your data science project more reliable, testable, and deployable
Introduction
In this post, we will learn some best practices to improve our code quality and reliability for the production Data Science code.
Note: Most of the things mentioned here are not new to the Software engineering world, but they often get ignored/missed in the experimental world of Data Science.
Here in this post, I will briefly mention the topics and things we can do to make our project more reliable and I will create a few follow-up posts to describe each of these steps in more detail using a project example. Also, I will be assuming a Python (pyspark) Data Science project for this post, but the ideas can be applied to any other programming language or project.
Hope you find them useful.
Code Refactoring
This is the first step for having better code. It is the process of simplifying the design of existing code, without changing its behavior.
Data science projects are written on jupyter notebooks most of the time and can get out-of-control pretty easily. A code refactoring step is highly recommended before moving the code to…