Planning a beginner open source sprint day for data scientists
I’ve always been an open source fan-girl. Besides the increased security and trustworthiness that come along by design with an open project, crowd-sourcing code from a group of experts who are so passionate about contributing that they volunteer their time lends itself to a grander sense of community. Working with other like-minded individuals together towards a common goal for good gives me warm and fuzzy feelings.
However, even though I proudly used Ubuntu for years and always championed open source tools wherever possible, as a scientist and not a software engineer by training, the idea of being a contributor myself always seemed daunting. That feeling compounded over the last 5 years or so since I moved away from writing code on a regular basis for work and into more management roles.
I meet regularly with an informal data science study group of friends who are mostly previous colleagues and are all roughly around the same point in their careers, though with varying backgrounds and skill-sets. We enjoy getting together pretty regularly to discuss data science topics, which helps us keep our skills up in a supportive environment. So far, everyone who has participated in the study group happens to be female (even if we’re not intentionally trying to be a closed club!).
Some time late-January, just when I started at Mozilla, a colleague of mine shared a link about a cool partnership between Andreas Mueller, a core contributor to scikit-learn, and the NYC chapter of Women in Machine Learning and Data Science (WiMLDS).
After an article came out in 2016 that only 2% of contributors to open source python libraries on GitHub were women, Mueller approached WiMLDS to set up an open source “sprint day”, where participants worked on a scikit-learn issue in GitHub labeled as “easy” or “good first issue”. Most of the participants were brand new to open source.
I casually shared the link with the group, thinking “hey, it might be fun to do something like this” and immediately got a positive response. Within a day, we had picked a date, location, schedule, and even who was bringing which snacks and goodies. I volunteered a veggie quiche.
Maybe unsurprisingly, a few days later when the reality sank in about what we were attempting to do, there was some apprehension about whether or not the event would be successful (spoiler alert, it was). What if we couldn’t find an issue to work on? (it was fine). What if we didn’t get to a pull request by the end of the day? (totally fine). What if we did, but the person responsible for approving it thought our contribution was meaningless or dumb and ignored the PR? (everyone was completely lovely and welcoming).
Getting over that fear, jumping right in and having fun while doing it is what this post is all about. This is a great exercise for a data science team already working together, a group of friends who are interested in developing their skills, or even as a means of kicking off a study group. If you want to get into the open source community but are struggling to figure out where to dip your toe in the water, then read on.
Besides figuring out who can host (a friend’s place or a cafe with good internet will do) and a date that works for everyone, there wasn’t really too much prep work to do.
A week or so before the event, I spent a couple of hours one evening searching through various projects for suitable issues. GitHub allows issue labels like “easy” or “good first issue”, which makes it simple to find ideas for things to work on. The downside is that a lot of the labeled issues have often already been grabbed in the comments by other new contributors, and it isn’t always clear whether or not they are still working on them, sometimes months later. We were originally going to try to claim issues ahead of time, but in the end, everyone chose what they wanted to work on the day of the sprint.
I also contacted some friends who are seasoned contributors to get their advice and support. Most of that advice is incorporated in this post, but it was also nice to get that early encouragement.
It’s a good idea to set clear expectations about the goals of the event ahead of time with the participants. The day is going to be about setting up your environment and cloning and compiling the repository you choose, not about writing some novel code that fundamentally changes the project. To ensure that there is a good attendance rate, it’s best to make it as informal and inviting as possible.
In our group’s Slack team, we created a special channel for the day and invited everyone to comment leading up to the event. It was also great for the day of, when people could share links and resources they found as well as document progress throughout the day.
I do have to credit my friend Alyssa Fu-Ward for pushing me to do more planning than my lazy self was initially intending via various planning docs and persistent pinging. Having a couple people put the event together rather than feeling the burden of everyone’s success on your own shoulders definitely made the whole thing go more smoothly. Plus, co-organizing is fun!
Sprint day — February 18, 2019
We all showed up around 10:30 am, caught up and chatted over bites and coffee, and sat down with our laptops around 11 am.
To warm up and give everyone an early success, we all did the first-contributions tutorial. It was simple to do, and made sure that everyone had their GitHub accounts set up. There was definitely a positive energy vibe in the room after everyone submitted their pull requests that were automatically merged.
“I have a fairly new machine so setting up my machine to communicate with GitHub was a great exercise to get started. It was like riding a bike after not doing it for a while. The other thing I learned was that package managers get better and better as the years go by! I remember when Macports was the only thing available. I had to set up Homebrew and pip and pipenv to even be ready to get started.”
For most of the people participating, the next couple of hours were about finding the right project and issue to work on. This was definitely the most difficult part of the day. It took a lot of time to understand enough context for each issue before being able to choose it, even the ones I had prepared ahead of time. Additionally, a lot of the threads for beginner issues in GitHub ended in discussion around whether or not the issue was actually an issue.
“First I tried to do an R repo but there didn’t seem to be many issues. So then I looked at statsmodels but none of the issues spoke to me. So then I looked at pandas. There were more issues to pick from that were related to documentation and good for beginners. I also feel most comfortable with pandas and so thought it would be a good repo to dig into.”
I chose to work on a documentation issue related to telemetry in Firefox, available thanks to a colleague at Mozilla. Some people started off understanding the code related to an issue in one repository before deciding to switch to another. At the end, there were three people who worked on pandas, one on matplotlib, one on scikit-learn and myself on mozilla.
“I poked around on numpy and scikit-learn before settling on pandas — their ‘how to contribute’ docs were generally helpful and friendly”
Then it was all about cloning the repository and compiling. I started on the instructions to build Firefox at around 11:30 am, and “build complete” happened at 2:25 pm, when my laptop fan finally calmed down. Changing a line in the “event” ping documentation took a couple minutes, and then it was time to commit the change.
As an added trickiness, Firefox is on Mercurial, not GitHub, so I needed to get that set up, then set up my Phabricator account and learn how to use it before I could finally commit the change at 3:30 pm.
“Be bold! Not sure if you have a working C compiler and can’t quite understand the instructions? Just try building and see if it works! (It did!) It’s pretty hard to seriously screw up your environment — it’s why we have local copies that we can nuke and start over if things go wrong. Things are always much more fun with support from a group!”
Two of us managed to submit PRs on the day of the sprint, which were reviewed and accepted within a day or two, and made for a pretty cool moment for each of us.
“AHHH MY CHABGED WAS MEGED DGEKEMD [sic]”
— Alyssa, seconds after her PR was accepted.
Even for a small first change, the reviewers were extremely kind and welcoming:
Everyone felt that they had learned a lot, and those who were still working on their chosen issues at the end of the sprint day still said they had a really rewarding experience. We plan to get together again in a month or two to continue working on our respective projects.
“Pull-requests can be easy and I started out knowing nothing, now I know a little and can push myself forward to learn a lot. I feel much more prepared to contribute. Start simple. Give yourself some credit. Eat cookies. Follow up.” — Katie
- Find a group of people who you can support and who support you! Also pairing up on a project with someone else is a great way to work though the setup together.
- Find a project that you’ll likely want to contribute to in the future, rather than thinking of this as a one-off task. One colleague said that as a mentor, he likes to make sure there is some place for the new contributor to go next. As an example, I think there is an exciting opportunity for data scientists to add to packages that are available in python for statistical analysis (see for example the list of enhancements wanted in scikit-learn). There’s an incentive for the folks maintaining the repository to make it easy for a new contributor to agree to work on more stuff, so choosing the project you want over a random issue in isolation is the way to go.
- Before you start cloning a repository, find out if there are any “how to contribute” docs/resources for the particular project you want to contribute to. For example, Mozilla has this great doc on how to get started, and then also a step-by-step guide on how to submit a patch.
- Your first issue should be very very simple. I was told by various advisers that a good “good first bug” has zero challenge to the bug. We found that the biggest challenge and where most of our time was spent was in getting our environment set up, as there are often a lot of packages that need to be installed/compiled before even opening a text editor to address the actual bug. For a first issue, documentation edits are great, which in the end is what all of us went for.
Thanks to tdsmith, my colleague who also contributes to Homebrew, for sharing the link to the WiMLDS Open Source Sprint Day in our internal data science Slack team channel and giving me some of my first tickets at Mozilla ;)
Thanks to my Mozilla colleague chutten whose documentation bug I picked up, and who told me about the advice he gives as a mentor to new contributors.
And especially thanks to alyssafuward, wsoofi, ehines623, kcamrine, and aggiezx for the great day, the moral support, the snacks, coffee, donuts, etc and for throwing themselves headlong into this open source journey :)
I’d highly suggest these two relevant episodes on the awesome podcast Linear Digressions for data scientists who want to get into open source:
- Happy Hacktoberfest (an intro to open source)
- Open Source Software for Data Scientists (with Tim Head of scikit-optimize)
Getting into contributing:
- First timers only — it’s how I found the first-contributions repository
- Open source guide on how to contribute
- How to start contributing to Open Source from developer.com
A couple places to look for projects to work on