Analytics Vidhya
Published in

Analytics Vidhya

Good Software Engineering Practices for Data Scientists

Photo by Christina @ wocintechchat.com on Unsplash

Clean Code

“You and I no longer engage in verbal confabulation”

“We don’t talk anymore”

  • Use descriptive names: Don’t use single letters for names like a, b, c etc. Use something like student, subject, grade. Have prefix like is_, has_ to indicate conditionals, like is_graduating, has_passed etc. Use verb to indicate function like change_grade(). Note that single letter names are sometimes acceptable when used in special cases, like x, y for independent or dependent variables (widely used in mathematics as variable name), n for number of things etc.
  • Don’t use too long name to be descriptive!: Be descriptive but don’t use too much description like is_the_student_graduating_in_this_term . This will make your code cumbersome.
  • Be consistent but differentiate clearly: student and students are easy to get confused and make mistake. Use student_list instead of students that makes it easier to differentiate and read.
  • Don’t use abbreviations: If the term that your variable or function is referring to has widely recognizable abbreviations, then you might use it. For example, CGPA is OK to use because everyone knows what it means. But BRB might not be recognizable to everyone so avoid these type of abbreviations. Some terms in data science might be well known among data scientists so you will want to use the abbreviation. But other engineers in your team might not be familiar with that term. In that case, use a descriptive name so that it’s easier for them to work with.
  • Use whitespace properly: If you are a python programmer, you already know that python emphasizes on good readability of the code. Unlike most other languages, instead of using brackets to indicate block of codes, it uses whitespace! So you see how white space makes your code easier to read. In addition to using four spaces for indentation, also use blank line to separate section of your code. That makes scrolling through your code much easier. See the example code bellow. Also limit each line of code to 79 characters (which is the guideline in the PEP 8 style guide). Many code editors show a vertical line to indicate the 79 character length. If you don’t see one, you might be able to turn it on from the settings.

Modular Code

  • Don’t Repeat Yourself: If you find yourself using same type of tasks several times, create a function or use loop. It will not only make your code less repetitive but also more readable.
  • Minimize number of entities (functions, classes): It is possible to over modularize your code. Creating function for everything doesn’t always mean better. If you expect to use a set of logic only once, then there is no need to define a function for that instead of using inline logic. Creating unnecessary number of functions or modules will make you jump around everywhere while reading your code to understand it’s logic. So, make functions or classes only if it is necessary.
  • Function should do one thing: Your function should aim to accomplish one particular task. If your function name contains ‘and’ then it means you need to refactor your code. For example, change_grade_and_find_position() should be refactored to two different functions change_grade() and find_position(). It makes your function easier to generalize and reuse.
  • Use module: If you have several functions or classes that you used in several files, try writing them in a separate file and import them in other files where you need it. This keeps your code short, maintainable and reusable.

Refactoring

Efficient Code

  • The first intuitive approach uses two for loops. One for loop runs through the items from the first list and for each item, a second for loop runs through the items from the second list and compares the two items to find the common ones (code is given below).
  • The second solution uses Numpy, my go to python library for vectorized implementation. (code is given below)
  • The third solution is something else that I might have never thought of if someone didn’t tell me that it is way faster! It utilizes python’s built-in object type ‘set’.

Documentation

  • Line level documentation: In line comments or line level documentation helps to describe the steps of your code. It helps others to understand what’s going on in your code without needing to understand all the functions or logic in depth. Be careful though, if your code depends too much on comments to make it understandable then it might be an indication of bad coding practice. Consider refactoring. Different programming languages use different symbols for writing comments. Python uses hash sign (#) for comment, JavaScript uses double slash (//). For example, in my previous code gist, you can see I’ve used several in line comments.
  • Function or Module level documentation: Docstring in python is used as function or modular level documentation. It is mainly used for providing valuable information about a function or module. It is used at the beginning of the function or module. Usually it describes the function’s purpose, explains arguments that the function takes, what it returns etc. This is all optional though. Docstring is written inside three quotation marks (single or double both works). Here is an example.
  • Project level documentation: Project level documentation is used to describe your project to others. It explains what your program does, how to use it and everything needed for anyone to understand how to make your project work. README file is a great way to include project documentation. If you go to GitHub and check any open source project, the first thing you might look for is the README. Without proper project documentation, no one can use your application or package because they won’t know how to. So always include a README file in your project that at least describes what your project does, its dependencies and sufficient instructions for using it. Checkout any popular opensource project in GitHub, for starters, here is the link to Pandas repository on GitHub.

Version Control

Testing

Logging

  • Be professional. Avoid logging messages like, “hmm, something failed”, “something happened!”.
  • Use normal capitalization and concise messages. Don’t make messages too long or too short.
  • There are several logging levels that you can use for your log messages to indicate what type of message it is. For example, CRITICAL, ERROR, WARNING, INFO etc. Use them properly.
  • Provide useful information. “Failed to load a file” doesn’t give you enough information, use “Failed to load file123”.

Conclusion

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Nazmul Ahsan

Software engineer at Optimizely. Find me on twitter @AhsanShihab_