Sprinting towards a more equitable future: The 2019 NYC WiMLDS Scikit-learn Sprint

Kelly Carmody
9 min readSep 4, 2019

--

I woke up bright and early, eager to start the day, as I made my way down to the Microsoft building, 11 Time Square, at bustling, chaotic 42nd St. It was Saturday, August 24th, and I was here to both volunteer at, and participate in, the third Women in Machine Learning and Data Science (WiMLDS) Scikit-learn sprint in New York City. Microsoft had very generously offered us a space to work in for the day. The main goal of the sprints are to get more women and gender minorities contributing code to scikit-learn, an open source software library for machine learning.

Open source software, for anybody who doesn’t know, is software that everyone can modify and contribute to, as opposed to proprietary software, where only the original authors can contribute. As I explained it to some non-techie friends, it is kind of like Wikipedia, but for code. Reshama, the organizer of the sprint on the WiMLDS side, had a list of statistics on her GitHub to illustrate the issue.

Over 50% of the U.S. population is made up of women. That percentage dwindles down to 20% of the tech sector. Only 12% of those working in machine learning are women. 2% of Python open source contributions come from women, and fewer than 1% contributing code to the scikit-learn library specifically are female.

Our cheerful greeters, giving us our keys to the building

When I first walked into the building, I was greeted by Carissa and Eszter, the volunteers in the front lobby, and given my access tag. Heading upstairs, I began my day at the sprint by setting up a little booth; off to the side with sign in sheets and name tags to distribute to all of the participants.

WiMLDS exists to promote the mission of supporting women and gender minorities in machine learning and data science, but welcomes all. Men are allowed to come to every event as long as they are supportive of the cause and mindful and respectful of the fact that we are creating safe spaces where women are encouraged to speak first, which so rarely happens in the world at large, and especially in tech.

The participants slowly began to trickle in, and I had them sign off their names on the sheet, and fill out name tags with their name and professional affiliation. Prithvi, an organizer for WiMLDS, and another helper for the day, assisted in manning (or womanning, should I say, especially at an event like this? Or is it better to throw out the gendered speak and only use gender neutral terminology? Working, perhaps?) the check-in booth with me.

At some point during all of this, Reshama ran off to go meet the FreshDirect people, and brought us back a bounty of clementines and bananas and yoghurt to fuel us for the long day ahead. Now I was able to direct the steadily increasing stream of Pythonistas to all of the fruit and dairy available at the back, plus non-dairy yogurt options (almond milk).

Delicious bagels and fruit

To bolster the supplies of fruit and yoghurt, we received bagels, courtesy of Bloomberg, our other sponsor for the day, along with Microsoft. Apparently, the Bagel delivery person had gotten a bit lost and had been wandering around the Microsoft building, looking for us, for a while.

Finally, they arrived, with a couple dozen fresh, beautiful bagels of many varieties. There were Onion, Garlic, Everything, Poppyseed, some sort of sweet concoction, multiple cream cheeses, and even butter to boot. Along with the other helpers, I assisted Noemi and Prithvi in getting the bagels all nicely arranged, and scouring the Microsoft kitchen for appropriate bagel platters.

By this time, the scikit-learn core contributors had arrived. Andreas Mueller, Thomas Fan and Nicolas Hug. They are all based at the Data Science Institute at Columbia University, and develop and maintain the scikit-learn library. I learned later that day that most of the scikit-learn team is actually based in France, but the aforementioned group is the main group of contributors in the U.S., and the rest of contributors are scattered throughout the globe.

Andy went up to the front of the room, addressing the sprint audience, and giving us a crash course on contributing to open source projects. Andy very helpfully gave instructions on pull requests on GitHub, open source contributions in general, contributions to scikit-learn in particular, the testing process, and various open issues available to work on to attempt to fix.

We learned about how to set up the scikit-learn development environment, forking and cloning the scikit-learn repo, and running regression tests. After the crash course, everyone broke for bagels and coffee.

Andreas teaching us a crash course in open source

Now it was time to settle down to work, and I switched from the role of helper to that of participant. We all broke up into programming pairs and began the process. I ended up running into quite a few installation issues for all of the Sci-kit development packages, mostly related to the virtual machine setup on my computer, that took up the remainder of the morning. Thomas Fan very graciously helped me through it, and we eventually got it figured out.

Speaking of lunch, that was Dominos pizza, served up with brownies (yes, Dominos does make brownies, for those who were wondering, and surprisingly scrumptious ones) for dessert. There were even dairy free and gluten free pizza options!

Pizza really does seem to go hand and hand with programming, it is the de facto meal at so many of the coding events that I attend. I read a joke on the internet that goes “Definition, Programmer: An organism that turns caffeine and pizza into software”. I think the centrality of pizza to the Developer culture is quite interesting from a sociological point of view, personally. People helped themselves to a couple of slices, then headed back inside to get back to work.

All the pizzas and brownies

After lunch, I returned to my workspace. We had been instructed to work on the simplest issues possible, ideally labelled with “easy”, “good first issue” and/or “attention requested”. This was to get a sense of the general workflow, before attempting to take on more complicated issues. The issue that I chose was inspired by what my partner, Anuja Kelkar, was working on, an awesome and super competent data scientist working as a manager at a company developing software for clinical trials. This was also her first time doing any sort of open source contribution.

We dipped our toes into open source by trying to make the documentation a bit more orderly, transforming it from big chunks of text, image, and code, into more manageable paragraphs interspersed with individual images and smaller codeblocks, with subtitle headings where applicable.

The example I chose had to do with anomaly detection algorithms for outlier detection. I knew that while I was re-ordering the documentation, I would absorb some of the other information on the page, so it would be wise to choose documentation with material I wanted to learn. I had been preparing for the sprint for the previous week.

Not having had experience with the scikit-learn library before, I knew that this sprint would be an excellent opportunity to learn it, and it was; the sprint pushed me, and forced me to grow. It provided motivation to learn the basics of scikit-learn, and practice using it both before in preparation, and of course during, and after the sprint. These are some of the technical highlights I learned about during the sprint:

● Importance of maintaining rigor of Pep8 formatting conventions for passing tests

● About Sphinx, creating html files from python files

● Upstream git repos

● CircleCI platform for continuous integration with version control

● Vagrantbox synced folders between host and virtual machines

● How to separate images in matplotlib

Us hard at work on our issues

Soon, the pull requests started coming in. Andy, Nicolas, and Thomas were in high demand, going back and forth between every corner of the room to check on the requests, and answer questions. Shout out to Bloomberg! In addition to sending us bagels, Bloomberg also sent us live human tutors to assist us if the contributors were occupied. The energy in the room soon seemed to settle down a bit. The first of the participants began to leave, after they had either gotten in a successful pull request, or felt that they had put in enough effort for the day.

Anuja got her pull request in and showed me the suite of tests she was running, I had never run into CircleCI before, and she explained what it was to me, a continuous integration platform used to integrate code into the shared repository several times a day

I did not get my pull request in during the sprint, but continued working on it after the sprint was over. It was not one of the examples that was meant to be changed it turned out, so I had to choose a new example to work on later. I ended up going in and refactoring a lot of code, unravelling the guts of loops to pull images apart.

I thought that it seemed pretty intensive for what was supposed to be an easy documentation fix, but it was my first time doing any work on scikit-learn or open source, so I just kind of assumed that it was all hard and intensive.I think the take home point, so others can learn something from my mistake, is that it would have been better to open a Work in Progress [WIP] Pull Request sooner rather than later, so I could have gotten feedback earlier and avoided putting so much work in.

However, I don’t have any regrets, it was a great learning process. Even if the pull request won’t be used, I got to improve my refactoring abilities, learned more about manipulating figures in matplotlib, anomaly detection algorithms, and now all of the other documentation fixes will seem easy by comparison. After this, I ended up working on a documentation restructuring that did indeed seem like a cakewalk after.

Towards the end of the day, when things were starting to wind down, Reshama reminded us to support open source by donating to NumFOCUS. As a surprise, those who did received a signed copy of Andreas Mueller’s book, “Introduction to Machine Learning with Python”, published by O’Reilly.

NumFOCUS is a nonprofit organization that supports open source scientific computing and software. We began forming an orderly line to have Andy sign our copies of the book. He was very gracious to us, taking the time to write an individual message for every person. After the excitement of the book signing was over, that was pretty much a wrap for the day, we finished with our issues, cleaned up, and said our goodbyes.

Me with Andreas, looking happy and holding a book that he has freshly signed

Throughout the course of the sprint, I got the chance to meet some of our partners at Microsoft and O’Reilly, major core contributors of the scikit-learn package, and a variety of amazing, inspiring female data scientists, doing really impressive things in the world. We traded information and advice, laughed and joked around, and I learned so much from them all. It was a really rejuvenating, supportive environment to be in. There are always many side benefits to an event like a sprint, in addition to the original intended purpose. The day was definitely a success.

I would like to thank all of the organizers, Reshama Shaikh on the WiMLDS side, Nitya Narasimhan with Microsoft, and Andreas Mueller as a core contributor, the scikit-learn core contributors Thomas Fan and Nicolas Hug, and helpers, Noemi Derzsy, Carissa Shafto, Prithvi Gandhi, and Eszter Schoell for making this event happen through all of their hard work. A special thank you to Thomas Fan for helping me with all of my installation and Vagrantbox issues, and to my co-worker Andrea Molina for helping me with edits, and crafting a title!

All of the Organizers and Core Contributors

Additional Links:

https://github.com/WiMLDS/nyc-2019-scikit-sprint

--

--

Kelly Carmody

Data scientist with a background in Neuroscience, Epidemiology, and Sociology.