Code quality in academic research

Wesley Wei Qian
Thoughts and Notes
Published in
5 min readFeb 13, 2016

As a computer science undergraduate, I am trained to write maintainable, refactored and well-documented high quality code because that’s the industry standard.

However, in academic research, writing code is about making experiments and proving a point. Therefore, code is sometimes written in an “ASAP” manner: no commenting, bad-structured and probably lots of for loop if you have a formula with lots of summation symbol.

People seem to be OK with it. In the end of the day, academic research is more about idea and people won’t use your code anyway.

But I after some rather “frustrating” experience, I have a different take on this.

Your idea will develop, and your code should be ready to adapt.

In research, we tend to write code in a stream line fashion. If I want something, I just write down the script, run it and throw it away.

Therefore, I wrote many python scripts with no function, no module and sometime no comment. I did get to start my experiment very soon into the project, but the debts ended up biting me.

So one month into the experiment, I came up with another exciting idea andI need to extract a different features from the original data. However, when I opened up the scripts I wrote before, I had no idea where I should start and there was no a good place for coding up my new features. In the end, I wrote a new script with lots of copy and paste. It is inefficient and frustrated.

In software engineering, one huge reason why we care about modulation and object oriented programming is because they are easy for us to add new features and refactor our code. I think the same idea should also apply to academic research, except “new features” in research is “new idea”.

It is inevitable to develop new ideas during research (and that’s why it’s so fun) and having modulated code that can easily adapt your new idea will save you a lot of time.

This semester, we started a natural language processing project and my first step is to convert Chinese sentence into Word2Vec data.

Instead of writing the code right away, I start by imaging a preprocessing pipeline with data packet and modification components and end up writing a rather complex preprocessing pipeline: a Sentence class where I can store all the features I want, functions that can calculate/produce this features and a output component which can take in a list of Sentence class and give me a file according to my configuration.

The script takes my whole afternoon, but when I have a new feature idea in the future(as it often happens in computational linguistic), I just need to add a field to my Sentence class, a function to extract the features, and plug it in my output component. This is definitely faster than writing a new script.

Plus, it makes me feel better…

While correctness is often more important than performance in research, bad performance code can obstruct your correctness.

“Let’s make this work first.”

In research, we often care more about whether the model work than how long does it take to run, so more often than you can imagine, you will see something like this:

A snapshot of a nested-nested-nested-nested-nested for-loop in my Matlab code

And it costs me.

People familiar with Matlab know that nested for loop generally have a worse performance than matrix operation. But when your formula has four summation symbols and your data is layout in a complex way, it takes a lot of engineering power to convert such for loop into matrix operation. When you are in a fast pace, you probably will be reluctant to pay such effort.

Last summer, I started writing a spatial pattern recognition model based on a paper my advisor wrote and we hope to apply this model to protein domain matching. I finished the coding very quickly and the model works perfectly when we have a relatively small size samples. Since the model requires a lot of computational power with these kind of nested for loop, we are not able to test it on larger size samples. But the math makes sense and the code is working, so we move on to the next stage happily and decide to worry about the performance later.

However, when we are forced to improve the performance later, we discover some problem that we have never thought of. Wrong that we could not discover because of the code performance.

This is really a trade off problem instead of a yes or no problem. In the end, all we want is using the shortest time to develop a good enough performance and confirm our model. But how do we define “good enough”? I guess the answer can only come with experience.

Research is a community effort, so your code should be for the community

Recently, I took over a project from a graduating PhD and get into a situation like this:

In his defense, he has a lot of pressure on time and he tries his best.

People often complain that academic research is done in the dark far too often. Therefore, many brilliant and inspiring ideas fail to generate traction and end up in some dark corner of the internet, which is really really sad.

Today, many open source projects get to thrive on Github and help many many developers because they are maintained by many many people. Because of the well documentation, people are happy to help.

On the other hand, if your code has no comment and no documentation, people will just find it painful to work on your code and probably will start from scratch if they have to.

Research is a community effort, and people should not redo another’s work just because there isn’t a documentation. I mean, who don’t want to read a document like this and avoid writing two months worth of code:

The repo is currently in private status

Again, it makes me feel better …

Sometimes, we really should just go for that extra mile.

Followings are some discussions I found on the web that are also worth reading if you are already at the end of this.

Best-practice models for “research” code?

How can I write good “research code”?

Why do many talented scientists write horrible software?

--

--

Wesley Wei Qian
Thoughts and Notes

I study intelligence to learn more about brain and biology ML PhD @IllinoisCS & formerly @BrandeisU, @Uber www.wesleyq.me