This Code is Clean!

Chantelle Whelan
10 min readDec 6, 2023

--

As I mentioned in my previous post, I have been learning a lot about how to write ‘clean’ code over the last year or so. I have run training sessions for the other analysts in my team at work and thought it would make a good second blog post. Hopefully this can help others who are new to coding or seasoned coders who would like to improve the structure of their code. So without further ado, here are the key concepts I’ve come across.

What is ‘Clean Code’ and why is it important?

Clean Code is essentially a way of writing code that is easy to read, understand and maintain. It is applicable to all coding languages (although certain concepts may be specific to a particular language). Writing code in this way has a range of benefits, both to the writer and those who make use of the code once it’s written. Code that is clearly written makes it a lot easier to figure out what the code is doing. That means if others read it (or you come back to read it after some time has passed), they will spend less time trying to decipher what is going on. Many of us have been in the position where we have inherited incomprehensible code and subsequently spent a good chunk of time figuring it out. This is heightened when it comes to debugging. Imagine you have two pieces of code, one clean and easy to understand, the other a bit of a jungle that you’ll need to pick apart to decipher. If both pieces run into an error, it will probably take a lot less time to debug the clean code versus the messy code. Having clarity in your code also helps ensure the code is robust and does what is intended as it is easier to test. Another benefit is that it future proofs the code. Clean code is less likely to reference specific objects/data, etc and therefore if it is automated or run again in future, it requires less or better still, no maintenance to run. In summary, it makes everyone’s life easier!

Key definitions and elements

To start off with, here are some terms that are frequently used in relation to clean code:

Refactoring — A technique for restructuring code without changing its functionality

DRY — Don’t Repeat Yourself!

Magic Numbers — Numbers that are used in code without a clear explanation of what they are

Function — a piece of code that makes a piece of logic reusable in a consistent and readable way

Modules — Code broken up into parts (usually groups of functions that accomplish similar tasks, e.g. data cleanse module)

Dead code — Code that isn’t executed, or if it is executed has no effect on the overall code’s functionality

Lint/Linter — Tool that highlights syntactical and stylistic problems in code i.e. assesses the quality of code (e.g. lintr package in R)

Test Driven Development (TDD) — A framework that defines the best practices for development

In the remainder of this post, I will touch on the key elements I have come across to writing clean code. As I am a data analyst, the examples given are geared more towards the type of tasks undertaken for analysis. Hopefully they will be useful to those using code for other purposes too. The areas I will cover are:

  • General code structure
  • Documentation
  • Testing

General Code Structure

Folder and File Naming Conventions

So before we even begin to write code, it is good practice to consider where your code is saved. Ideally it is saved in a central repository within a version control software, e.g. Git.

It is also good practice to ensure folder and file names have a consistent format. Ideally names should be short and readable. To allow for machine readability eliminate spaces and make use of underscores(_), dashes(-) or CamelCase instead. If there are files with multiple versions it can help to begin the file name with a date to easily identify the most up-to-date version (date should begin with the year e.g. YYYY-MM-DD). If sequencing is involved, allow for padding e.g. 001. This ensures easy readability, particular if going beyond double digits.

Variable usage

When I began coding, I wasn’t aware of the concept of creating standalone variables. To me variables were synonymous with columns in a dataset (for any others who come from a scientific research background — there are independent and dependent variables right?). But once I learned you could create a variable as a type of object, completely separate to any data you might be using, it changed my code completely. Instead of writing code littered with ‘magic numbers’ (numbers with no explanation) and random values, I replaced these with variables. In the following example is some R code that filters a dataframe by the ‘age’ and ‘employmentStatus’ columns. The new dataframe should only include records where the respondent is over 25 and are employed full-time. The first snippet does not make use of variables and the second does. If someone unfamiliar with the code read this, it would be quite clear to them what is being filtered for. In addition, if at a later date I decide to raise the age filter or include another employment status, instead of going through the code and changing these values everywhere they are used (if this were a long piece of code these filters may have been used hundreds of times!) I would just need to change the values once where they are assigned to variables.

# No variables assigned
dfFiltered <- df %>%
filter(.$age > 25) %>%
filter(.$employmentStatus == 'Full-time')

# Variables assigned
EligibleAgeLimit <- 25
EligibleEmploymentStatus <- 'Full-time'

dfFiltered <- df %>%
filter(.$age > EligibleAgeLimit) %>%
filter(.$employmentStatus == EligibleEmploymentStatus)

There are a couple of points to keep in mind with regards to naming variables. The names should always reveal the intent of creating it. For example, if I’d named the variables ‘var1’ and ‘var2’ the reader would not be given any information as to what they mean and how they should be used in the remainder of the code.

Modular Code

Code can be very complex and difficult to understand. It is important that code is made as easy to read as possible to ensure it is reproducible and can be interpreted by others. Modular code is a technique used in software engineering that breaks up code into small independent sections based on functionality. There are two stages to modular code:

  1. Create functions with specific inputs and outputs
  2. Group functions into modules based on usability

So lets dive a little deeper into functions and modules…

Functions

A function is a piece of code that makes a piece of logic reusable in a consistent and readable way. They are a great way to keep code DRY (no repitition) and to test logic. Some things to consider when creating a function include:

  • Giving it a name that makes it clear what the function does
  • It should do one thing (it can do more, but it is more understandable to the reader if it is kept to one)
  • In line with the previous point, try to keep it to no more than two levels of indentation
  • For something complex, create a high-level function made up of smaller functions
  • Should not depend on or affect variables that have not been fed in as arguments

Modules

A module is a script containing a set of related functions to be used in other scripts. Modules are commonly called from a master/main script. Below is an example of a script in Python that cleanses data in preparation for modelling, fits a model to the data, gets predictions from the fitted model and evaluates the results. Four modules are created — 1.DataCleanse (functions to clean the data), 2.Preprocessing (functions to prepare the data for modelling), 3.Modelling (functions to fit a model to data) and 4.Evaluation (functions to evaluate the model). Each of these modules are run in the master/main script. The functions from the modules are then used.

import pandas as pd
from DataCleanse import clean
from Preprocessing import scaler
from Modelling import fit, predict
from Evaluation import accuracy

df = pd.read_csv(‘filepath/data.csv’)
df = scaler(clean(df))
model = fit(data)
results = predict(model, data)
accuracy = accuracy(testValues, results)

This style of coding took me a long time to get to drips with. It required getting into the habit of creating a function for everything I wanted to do in my code. I had only created a few functions prior to learning about modular code and I found it difficult to envision what code made up of just functions would look like. A technique that I’ve found that helped was writing a function for each bit of code as I would have if I was writing without creating functions. So if I wanted to add 1 to all the values in a dataframe column, instead of writing that out as:

df['newCol'] = df['oldCol'] + 1

I could create a function that does the calculation, like this:

def addValue(column, value):
x = column + value
return x

Once that had been done for all the code, I’d go through and group them into related functions to create modules. It was a bit more time consuming doing it this way, but I soon got the hang of it and it comes almost second nature to me now.

Style / Format

Each coding language has it’s own style guidelines. Adhering to these ensures your code is formatted correctly and generally nice to look at. It is definitely worth looking these up and following them as much as possible. One area in particular where it comes in handy is for documentation, which brings us on to the next section of this post…

Documentation

Aaah, good old documentation. Every coder’s favourite topic! I know, it can be quite laborious to go through your code writing comments, putting together Read Me documents, ensuring instructions are clear, etc. But at the risk of sounding like a nagging parent, it’s important. And can save both you and others so much time and energy down the road.

Comments

These are the most common form of documentation within code. They are short excerpts of code that are not executed when the programme runs. They generally provide context for what the code is doing and should be used sparingly to signpost the start or end of a section of code or to flag something e.g. highlighting that a table sometimes includes duplicate records, so needs to be deduped.

Docstrings

Docstrings are multi-line descriptions that are generally written at the beginning of a script and within functions. They provide an overview of the code purpose, action and outcomes. For example, if I were to add a docstring to the ‘addValue’ function in the earlier code snippet, it would look something like this:

def addValue(column, value):
"""
Adds the given value to the given column and returns the result.
Parameters:
column (numeric): The column value to be added.
value (numeric): The value to be added to the column.
Returns:
numeric: The result of adding the value to the column.
"""
x = column + value
return x

If I haven’t sold you on documentation, it’s become a lot easier to get documentation automatically generated for you using things like PyCharm and/or ChatGPT.

Plan for production

The majority of my learnings on writing clean code stemmed from some work I undertook, which involved writing code that would go into production. When your code is going to be triggered automatically on a regular basis and errors could mean key deliverables aren’t met, it is extremely important to write code that isn’t going to break! When writing code for production the following is worth bearing in mind:

  • It’s important to apply test driven development (TDD) where possible. Regular testing throughout your code helps to ensure the code is performing as you would expect.
  • Remove ‘dead code’. Code that is not required shouldn’t be left in. It clutters the code and could cause an error when the code is triggered from a different environment.
  • Scripts are preferable to notebooks. It is more difficult to productionise notebooks and it is more difficult to modularise code in this format, so it’s not ideal for production.
  • Make small and frequent commits to ensure the code repository is as up-to-date as possible, thus reducing the risk of losing unsaved changes.

Final thoughts

So I think I have now written down everything I have learnt about how to write clean code. I hope you have found it helpful/interesting. I’m sure there’s plenty I haven’t covered, so please do let me know in the comments — I’m always eager to learn more on this topic. I’d like to give a shout out to my colleague, Peter who ran the clean code training session with me at work — Thank you Pete!

Here are some of the sources I’ve come across that helped me on my clean code journey.

1 Structuring your project — Quality Assurance of Code for Analysis and Research (best-practice-and-impact.github.io)

2 https://best-practice-and-impact.github.io/qa-of-code-guidance/intro.html

3 https://www.thoughtworks.com/insights/blog/coding-habits-data-scientists

4 https://github.com/davified/clean-code-ml/blob/master/docs/functions.md

5 https://towardsdatascience.com/how-to-write-a-production-level-code-in-data-science-5d87bd75ced

6 https://towardsdatascience.com/cleaning-refactoring-and-modular-the-must-foundations-to-improve-your-python-code-and-carrer-65ef71cdb264

7 https://towardsdatascience.com/5-reasons-why-you-should-switch-from-jupyter-notebook-to-scripts-cb3535ba9c95

8 https://dev.to/levivm/coding-best-practices-chapter-one-functions-4n15

9 Modular code — Quality Assurance of Code for Analysis and Research (best-practice-and-impact.github.io)

10 https://best-practice-and-impact.github.io/qa-of-code-guidance/code_documentation.html

11 https://www.programiz.com/python-programming/docstrings

12 styleguide | Style guides for Google-originated open-source projects

13 Five Tips for Automatic Python Documentation | by Louis de Bruijn | Towards Data Science

14 Pylint Usage in Python — Read and Learn with Examples (techbeamers.com)

15 Version control (researchcodingclub.github.io)

16 What is code testing and why is it important? (zeolearn.com)

17 https://towardsdatascience.com/unit-testing-for-data-scientists-dc5e0cd397fb

18 https://www.peterbaumgartner.com/blog/testing-for-data-science/?utm_campaign=Data_Elixir&utm_source=Data_Elixir_368/

--

--

Chantelle Whelan

Academic turned data analyst with a passion for DataOps and ethical machine learning