Dall-E generated image

Tidying Up (Python) Scripts

Arieda Muço
5 min readNov 19, 2023

--

Last week, a bundle of Python scripts landed on my desk(top). Each script was a year-specific file for data cleaning and analysis:

data_cleaning2010.py
analysis_2010.py
data_cleaning2011.py
analysis_2011.py
data_cleaning2012.py
analysis_2012.py
data_cleaning2013.py
analysis_2013.py
.. (and so on)

Neatly labeled, sure, but it was a house of cards waiting to collapse. After running data_cleaning_2013.py, I knew frustration wasn't just an emotion—it was my reality.

The Annoyance of Inefficiency

Copy-paste is a siren song for the busy programmer. Yet, assume that I do a critical tweak in data_cleaning_2012.py that is absent in the rest of the similarly labeled files. The resulting analysis will be a bit of a mess.

Anticipating the analysis mess may have been the reason for my frustration.

A Cleaner Approach

Here’s a radical thought 😉: What if your project folder contained just one data_cleaning.py and one analysis.py script? Many steps are common across all years, and you could simply add year-specific exceptions as needed.

Many do this, but then end up writing very long files. So my second suggestion is a rule of thumb that circulates among software engineers.

Code should be no longer than 100 lines.

I know that you’re thinking now: “Impossible, there are so many functions and lines of code I have to put into those files, I can’t do the whole data cleaning in 100 lines of codes.”

Let me explain what people mean and don’t mean when they give the “100 lines of code” as a rule of thumb.

  • Your code can get a bit longer. The “100 lines of code” rule is arbitrary and may not apply to all situations. The number of lines is less important than the organization and clarity of the code.
  • The principle behind the rule is to keep code readable and maintainable. If you don’t trust me, try to debug a file that has 1000 lines instead. I promise you that it’s going to be your own debugging nightmare.
  • The rule doesn’t mean you can’t break the code in other files that take on aspects of the code. The number of such files is actually unlimited.

Modularization to the Rescue

Let’s assume that I am cleaning scraped newspaper data and extracting entities from each article.

For the project, I have a file called data_cleaning.py If I have several functions that I am using in that file, then I’ll have a file named nerFolhaArticles.py from which I will import the functions I need in my data_cleaning.py

So far, the structure of my folder will look like:

data_cleaning.py
nerFolhaArticles.py
analysis.py

So in mydata_cleaning.py I will have an import that will call the functions in nerFolhaArticles.py Here’s a snippet of what my modularized code looks like.

import nerFolhaArticles

entities = nerFolhaArticles.extract_entities(text)

Now, you can imagine what the function extract_entities does. It does exactly what its name says: grabs the text, extracts the entities in it (and returns them in a list).

Here’s the original code that’s included in the nerFolhaArticles.py

import spacy
nlp = spacy.load('pt_core_news_sm')

def extract_entities(text):
doc = nlp(text)
cities = [ent.text for ent in doc.ents if ent.label_ in {'LOC', 'MISC', 'GPE'}]
return cities

This is a toy example and I just “converted” what would have been several lines of code into a couple, but you can see how this approach can have benefits if your functions are more complicated and include nested functions.

Note that you can import functions from another file, only if such file is plain Python, namely has a .py extension. For example, you can’t import functions from files that have .ipynb extensions. (As if I needed another reason to dislike Jupyter Notebooks.)

My nerFolhaArticles.py has several other functions, and I can import them in the same way as I did with extract_entities. Each function has a single reason to exist.

The folder structure of the project looks like this:

├── code
│ ├── data_cleaning.py
│ ├── nerFolhaArticles.py
│ └── analysis.py
├── data
└── output

Within code, directories like intermediate, figures, and tables if you need to add more functionalities.

Naming: A Game Changer

I realized, that what works for me is that If I need to plot X thing, then having aplotting_X.py makes it easier for me to find the file later. (The output is named plotting_X.pdfor plotting_X.pngSimilarly, If I need to produce a table, the file, ideally, should have the name of the table produced. So reduced_form.do produces a table calledreduced_form.tex ready to be imported into my paper.

Every file and function has a purposeful, self-explanatory name. Care in naming saves hours of confusion.

Embrace Change, Even Late in the Game

If a file name no longer fits, don’t hesitate to change it. Yes, this means updating multiple references, but that’s where a version control system like Git becomes invaluable — namely, embracing changes without losing our minds.

Change is a fundamental constant in coding, data analysis, and life. Have you ever named a file 01-data-cleaning.py, only to find its sequence in the project has shifted? Suddenly, what’s in 01-data-cleaning.pyis not the first step not the second, but the third. Or perhaps you've labeled a figure Figure1.pdf, but as insights evolve and the project grows, the figure ends up in the appendix. (Seriously, just recently I placed aFigure1.pdf in the appendix and in a previous paper, due to the editor’s suggestion we took a figure from the appendix, modified it, and placed it as the main figure of the paper. It’s not unusual. Not at all.)

Personally, I don’t like the numerical file naming conventions. They suggest a fixed workflow in what is invariably a dynamic process. Anyone who’s wrangled with data knows that operations will shift, new data will come to light, and analyses will take unexpected detours.

That’s why, even with journal submission guidelines, I use descriptive over numerical file names, making sure that the project reflects its current trajectory, not just its conception. This way, I don’t need to engage in a game of perpetual renaming and instead focus on the work itself.

Closing Thoughts

Good code isn’t just about working code — it’s about sustainable code. Code that you, or someone else, can pick up six months down the line and not curse the day your paths crossed. Please let’s write code that makes future us grateful, not resentful.

Thank you for taking the time to read about my thoughts. If you enjoyed the article feel free to reach out at arieda.muco@gmail.com or on Twitter, Linkedin, or Instagram. Feel free to also share this article with others.

--

--