Good Software Engineering Practices for Data Scientists

Published in

Analytics Vidhya

10 min readAug 4, 2020

Photo by Christina @ wocintechchat.com on Unsplash

Everybody codes differently. There are no hard and fast rules of how you must approach a problem, how you should implement it, however there are some certain standards. Often, you will be working on a team, or might be working in an open source project where many others will work on the same program with you. Your code might even be used as production code. So there needs to be a certain standards to follow.

Data scientists might come from different backgrounds. I, for example, came from a background that had nothing to do with programing. And so while working with other software engineers, you might need to adopt to some common coding practices that can help you to become a better developer and work well with each other. In this article, I will discuss some of these common practices.

Clean Code

Clean code means readable, simple and concise code. It is crucial for collaboration and maintainability in software development. Simple, concise code makes it easier for others to understand the logic you have implemented. If you make it unnecessarily complex, then even you won’t be able to understand what’s going on when you search through your code for possible bug or improvement. Take this for example,

“You and I no longer engage in verbal confabulation”

The sentence will make much better sense if I just said,

“We don’t talk anymore”

Simple code makes everyone’s life easy. Here are some tips for writing clean codes,

Use descriptive names: Don’t use single letters for names like a, b, c etc. Use something like student, subject, grade. Have prefix like is_, has_ to indicate conditionals, like is_graduating, has_passed etc. Use verb to indicate function like change_grade(). Note that single letter names are sometimes acceptable when used in special cases, like x, y for independent or dependent variables (widely used in mathematics as variable name), n for number of things etc.
Don’t use too long name to be descriptive!: Be descriptive but don’t use too much description like is_the_student_graduating_in_this_term . This will make your code cumbersome.
Be consistent but differentiate clearly: student and students are easy to get confused and make mistake. Use student_list instead of students that makes it easier to differentiate and read.
Don’t use abbreviations: If the term that your variable or function is referring to has widely recognizable abbreviations, then you might use it. For example, CGPA is OK to use because everyone knows what it means. But BRB might not be recognizable to everyone so avoid these type of abbreviations. Some terms in data science might be well known among data scientists so you will want to use the abbreviation. But other engineers in your team might not be familiar with that term. In that case, use a descriptive name so that it’s easier for them to work with.
Use whitespace properly: If you are a python programmer, you already know that python emphasizes on good readability of the code. Unlike most other languages, instead of using brackets to indicate block of codes, it uses whitespace! So you see how white space makes your code easier to read. In addition to using four spaces for indentation, also use blank line to separate section of your code. That makes scrolling through your code much easier. See the example code bellow. Also limit each line of code to 79 characters (which is the guideline in the PEP 8 style guide). Many code editors show a vertical line to indicate the 79 character length. If you don’t see one, you might be able to turn it on from the settings.

Modular Code

Modular code means the code is logically broken up into functions and modules. It allows you to quickly find relevant pieces of code which helps understand the program and also reuse them. Some tips for writing modular code,

Don’t Repeat Yourself: If you find yourself using same type of tasks several times, create a function or use loop. It will not only make your code less repetitive but also more readable.

Minimize number of entities (functions, classes): It is possible to over modularize your code. Creating function for everything doesn’t always mean better. If you expect to use a set of logic only once, then there is no need to define a function for that instead of using inline logic. Creating unnecessary number of functions or modules will make you jump around everywhere while reading your code to understand it’s logic. So, make functions or classes only if it is necessary.
Function should do one thing: Your function should aim to accomplish one particular task. If your function name contains ‘and’ then it means you need to refactor your code. For example, change_grade_and_find_position() should be refactored to two different functions change_grade() and find_position(). It makes your function easier to generalize and reuse.
Use module: If you have several functions or classes that you used in several files, try writing them in a separate file and import them in other files where you need it. This keeps your code short, maintainable and reusable.

Refactoring

At the beginning of solving a problem, you might tend to focus on just getting things to work rather than organizing them nicely. Yes, that’s okay. But once you get it to work, before you jump to deploying your code, take a little time to organize it to make it nice, clean and modular. That is called refactoring. Refactoring means restructuring your code to improve its internal structure without changing external functionality. It might seem waste of time to spend on rearranging your code that already works than moving on to the next feature but this helps you in the long run. It will help you maintain your code in the future and reuse it. By practicing refactoring, you will soon develop an intuition of writing organized code and later you will discover that you are writing organized code even in your first try! So it will help you become a better developer in the future.

Efficient Code

Code efficiency can be increased in two way. By reducing the time it needs to execute and by reducing the memory space it needs. However, the importance of these factors are dependent on application. For example, in self-driving car, the analysis of the road while driving should happen as soon as possible whereas system update that might happen once a month may tolerate longer execution time.

For making your code more efficient, you need to experiment with different approaches. Try reducing the number of loops in your code by replacing them with vectorized implementation. And sometimes, even using a built-in data structure might do some tasks faster than the vectorized implementation.

How do you find out the best way? Google and Stackoverflow!

I am giving you an example below. Let’s say, I have two arrays of items and want to find the common items between them. Here are three approaches that I have tried to solve the problem.

The first intuitive approach uses two for loops. One for loop runs through the items from the first list and for each item, a second for loop runs through the items from the second list and compares the two items to find the common ones (code is given below).
The second solution uses Numpy, my go to python library for vectorized implementation. (code is given below)
The third solution is something else that I might have never thought of if someone didn’t tell me that it is way faster! It utilizes python’s built-in object type ‘set’.

I suggest you run the code yourself and see the difference.

Documentation

Have you ever written a program with so much attention and then months later, you decided to check your code for something and then, nothing in your code made any sense to you? It’s almost like somebody else wrote the code? I bet. That’s why documentation is important. Documentation is added text that comes with or embedded in your code. Documentation is helpful to clarify complex parts of your code and describe the use or purpose of specific components in your code to others (and possibly future you!)

You can add three types of documentation in your code — Line level documentation a.k.a. in line comments, Function or Module level documentation and Project level documentation.

Line level documentation: In line comments or line level documentation helps to describe the steps of your code. It helps others to understand what’s going on in your code without needing to understand all the functions or logic in depth. Be careful though, if your code depends too much on comments to make it understandable then it might be an indication of bad coding practice. Consider refactoring. Different programming languages use different symbols for writing comments. Python uses hash sign (#) for comment, JavaScript uses double slash (//). For example, in my previous code gist, you can see I’ve used several in line comments.
Function or Module level documentation: Docstring in python is used as function or modular level documentation. It is mainly used for providing valuable information about a function or module. It is used at the beginning of the function or module. Usually it describes the function’s purpose, explains arguments that the function takes, what it returns etc. This is all optional though. Docstring is written inside three quotation marks (single or double both works). Here is an example.

Project level documentation: Project level documentation is used to describe your project to others. It explains what your program does, how to use it and everything needed for anyone to understand how to make your project work. README file is a great way to include project documentation. If you go to GitHub and check any open source project, the first thing you might look for is the README. Without proper project documentation, no one can use your application or package because they won’t know how to. So always include a README file in your project that at least describes what your project does, its dependencies and sufficient instructions for using it. Checkout any popular opensource project in GitHub, for starters, here is the link to Pandas repository on GitHub.

Version Control

Version control deserves a separate article itself to explain its importance. What is version control? For example, in video games, when you reach a certain stage, you save your game. Before selecting a risky decision or trying something stupid, you save your progress so that you can come back after you have ruined your career and pretend nothing happened. Version control system allows you to do that for your coding projects. You can ‘save’ your progress at several ‘checkpoints’ and if some experiments fail, you can reset your project back to your saved checkpoints. Without it, your whole project might get ruined and you might need to start from the beginning, which is not very practical.

This is very important in data science. For example, in Machine Learning or Deep Learning, we try out lots of hyper parameters to find the best ones and hope we hit the jackpot! With version control system, you can experiment with different configurations as many times as you want and save it all with a commit message mentioning the performance scores. And then when you are done with experiments, you can check your list of commits to see what was the best score among all your experiments. Then you go back to that checkpoint and use that configuration for production.

There are many other use cases of version control. I can not fit them all here. If you want to get started, but are not familiar with any version control system, Version Control with Git is a great free online course that I can suggest from Udacity which helped me a lot.

Testing

Writing tests is a standard practice in software engineering but often many data scientists are unfamiliar with it. But testing is particularly important for data science. Because problems in data science are not easily detectable when they run. In other softwares, if there is a problem, it crashes or shows unintended behavior. So you immediately know there is a problem and can fix it. But errors in data science isn’t easily detectable. It can run pretty smoothly and give you results that may seem satisfactory. But behind the scene, some features or values might be used inappropriately, interpreted wrongly or processed differently than it needs to be. That’s why it is very important to test your implementation so that you don’t make business decisions based on the results that came out off wrong data interpretation.

Test Driven Development (TDD) is a development process where you write tests for tasks before you even write codes to do those tasks. There are two types of tests that you can implement — Unit Tests and Integration Tests.

Suppose you have implemented a function that separates City name and State name from user inputed string. Unit test will be used to test if the function you have implemented gives the expected results. You will test the function for various scenarios and edge cases. Unit Tests are independent isolated tests that don’t depend on any other functions, databases, APIs or other resources. You should write them in pure python. Integration Tests on the other hand tests if all the parts in the program are working properly with each other.

In test driven development, you first write test functions for different scenarios and edge cases. Then you start implementing your code and check it with the tests. Once your implementation passes all the tests, then you know your implementation is correct and you can move on to the next section. TDD is a broad topic and I encourage you to do some research on it. You can also check out this post for a more detail discussion.

Logging

Logging is a valuable tool that records the events and errors that occurred while running the program. Suppose your program crashed or created unexpected results in your absence and now you want to debug the problem. You need to know what your program was trying to do when the error occurred or what type of error it faced. You can find all the answers if you used logging. Python has a built-in library named ‘logging’ for this purpose.

Here are some tips for using proper log messages.

Be professional. Avoid logging messages like, “hmm, something failed”, “something happened!”.
Use normal capitalization and concise messages. Don’t make messages too long or too short.
There are several logging levels that you can use for your log messages to indicate what type of message it is. For example, CRITICAL, ERROR, WARNING, INFO etc. Use them properly.
Provide useful information. “Failed to load a file” doesn’t give you enough information, use “Failed to load file123”.

Conclusion

The techniques I have discussed in this article are not some mandatory rules but more of a guideline. If you don’t follow them, your program won’t break. But if you do follow them, it will make your life easier. Also, different teams and projects might follow different standards. It is important you follow the standard that your team follows to be a good team player. The standards discussed in this article were suggested by the data scientists from AWS and Udacity in the curriculum for AWS Machine Learning Foundational course.