Lessons learned from contributing to open-source software

Stéphane Couvreur
Feedzai Techblog
Published in
4 min readNov 14, 2019

At Feedzai, we reserve four hours each week for learning and development. From a suggestion in my yearly performance review, I decided to use this time to contribute to two open-source machine learning projects, namely scikit-learn — a ML library for python and Optuna — a hyperparameter tuning package.

Read on to discover what lessons I’ve learned.

Working distributed

The first project I worked on was scikit-learn. As I started to use it extensively, I wanted to learn more about the library by working on it, as I was using the API’s abstractions without understanding what was going on behind the scenes.

I quickly understood that the most efficient way to contribute is not by implementing a complex new algorithm, but rather by:

  • fixing bugs,
  • deprecating/renaming parameters in favor of clearer ones
  • finding better default values for common parameters,
  • writing documentation and making small enhancements.

Indeed, the value of any open-source project is determined by its ease of use: it needs to be accessible to a beginner user. Hence, and you can see this from the GitHub issue trends, contributors place a lot of emphasis on creating good docs and example code.

In addition, contributors with more than 5 pull requests are encouraged to review the code of newer contributors. Not only does this contribute to the project, but it also helps relieve some of the pressure on the core developers.

In scikit-learn, core developers are a tight-knit group of approximately 40 people, mostly coming from academia, distributed all across the world. This means for instance that as the primary maintainer is based in Australia, I would usually try to submit changes before evening in Western Europe, so that they could be reviewed around morning Australian time.

In Optuna, although the core development team is smaller and from a single company called Preferred Networks based in Japan, the same principle of being aware of time differences would apply. However, with both the project being smaller and having much less issues in the backlog, I found it much faster to communicate with reviewers and get feedback.

Some key lessons I’ve learned are to always:

  • start off on a small project with a well-contained contribution (check issues labelled as “good first issue” for instance);
  • finish your code review feedback before other people start their days — a recommendation in Google’s own engineering practices;
  • always be polite, encouraging and responsive in code reviews.

Writing good code

My background originally being in mechanical engineering, software engineering was a path that deviated from my training. The first language I learned to program was Fortran 95, a language so outdated my university professors had learned it in their day by writing instructions on punch card interfaces.

The first large coding project I was assigned at university didn’t use version control — the filenames just looked something like v1, v23, v35_final, v124_the_final_one_for_sure. I didn’t write tests, and I would regularly duplicate functions thinking the code would execute faster.

I later learned C++ and Python, but I was never focused on writing easily readable, structured, well-thought-out programs. I just wanted the code to work, and I was fine with writing code that only I could understand. Elegant programming was a foreign concept to me.

Open source requires contributors to hold themselves to a much higher standard however. Your code is read by thousands of people. Reviewers are often short on time, so it’s best to make their lives easy, or your code risks going out the window. This can make it a little daunting for first-time contributors, especially if their background is not in computer science.

This is why it is particularly important to read the project’s contributing guidelines and code style before getting to work. Again, contributing to a project with a smaller team may help speed up you getting feedback, and delivering a quality pull request.

Staying concise

There are many new algorithms and techniques emerging from the machine learning research space. With platforms such as arXiv publishing more than 50 machine learning articles per day, it would be impossible to implement all the innovation in scikit-learn and stick to the mission statement of being user-friendly. A lot of enhancement suggestion issues open in scikit-learn propose the addition of new functionality. Interestingly, it is often the choice of the core team, however, to not implement these, as they could either:

  • increase maintenance load on the core developers, which has limited bandwidth; or
  • increase complexity for API users.

I learned that design in software engineering is sometimes about choosing not to implement a functionality, and there can be a wide variety of reasons for that.

Feedzai ❤️ Open source

Writing open-source code has helped me become a better data scientist. Not only has it given me a deeper understanding of the code I use on a daily basis, but it’s also provided me with the training in software engineering I missed at university.

Feedzai has many other people from the customer success, product and research teams working on open-source projects such as MML Spark, LightGBM, and more.

So if you are thinking about joining Feedzai as a data scientist, know that you will have opportunities to contribute to widely used codebases — no matter if you are in customer success, product or research.

--

--