Over three days in early February, core developers and pandas enthusiasts (of the python library, not the furry creatures) met in London for the first ever European Pandas Summit to discuss development direction and resolve technical issues through code sprints. As one of the panellists during the corporate uses forum, I had the opportunity to share how companies such as dunnhumby use open-source tools, what frustrations data scientists face, and how can companies contribute back to the community. Over the course of preparing for the summit and the summit itself, I learned a lot about the open-source ecosystem and here are my five key takeaways:
1. Data science would be a lot harder without pandas
Python-flavoured data scientists rely heavily on the pandas library for our day-to-day work, yet most of us don’t know who are maintaining and improving not only the library but also the associated documentations and API guides. Sure, there are minor inconveniences and issues we sometimes never get to the bottom of (
SettingWithCopyWarning, anyone?), but we eventually figure it out and discover it was most likely user error. And we silently salute the contributors and developers for saving us, again.
2. Many ways of contributing back to pandas
Taking our gratitude to the next level, we all know actions speak louder than words. Even if we are not elite programmers who can fix the inconsistencies behind
SettingWithCopyWarning, we can still contribute to open-source projects via other means. For example, individuals and companies can back open-source projects financially, as dunnhumby had done last year by sponsoring PyData London conference, where proceeds go to the non-profit organisation NumFocus to support sustainable development of projects including pandas and Jupyter.
And surely, after all that googling and stackoverflowing, the least we could do is commit our knowledge and share with the world by contributing back to the documentations and user guides, right?
3. Learn the workflow for open source contribution
Data scientists are great at hacking together prototypes, experienced in maths and science, and know a great deal of business context, but one of the barriers to contributing to open projects is, well, how do you technically do so?
In one afternoon of sprinting, I learned the process isn’t as daunting as I had feared. You fork the project on GitHub, clone your fork, checkout a new branch, fix the things you want to fix, make sure you add the files before committing, push the changes to your own fork, and go on GitHub to submit a pull request. After a few iterations, which is perfectly normal, a pull request approved by two core developers will then be merged and released in the next stable version.
If you have not used git extensively, that may sound like a load of gibberish, but this is exactly why I would encourage all data scientists to attend a python sprint to learn from other contributors. And don’t be fooled by the succinct messages you see on pull requests; core developers are some of the friendliest people I have met, and they seem to possess saintly patience reserved for eager newbies at code sprints.
4. Core developers are humans, even if they possess rock star qualities
The pandas team can receive up to 10 pull requests a day, so if you have found a bug/need a fix/want your pull request merged and you haven’t heard from them in the 12 minutes (hours, or even days) since you hit submit, chill out and remember, core developers are humans. And in fact, the majority of them are unpaid volunteers. Be nice to them, and they will help where they can.
Fun fact: there is one “silent” core developer who tirelessly work on the library, but none of the others have met this individual, in person or on teleconference. The enigma certainly adds to the mysterious aura of pandas rock stars.
5. The future of pandas in data science
The three core developers at the summit expressed the ambition of releasing pandas 1.0 some time this year, and there are already discussions and debate on what pandas 2.0 could look like. For example, should it handle distributed computation, or should that be dealt with by growing technologies like Spark? Should pandas collaborate with newcomers, like vaex to handle lazy evaluation of very large datasets? Should pandas be responsible for incorporating new data types, or should that be left to other developers to create custom dtypes through Extension Arrays, like GeoPandas has done for points and polygons?
There are no clear answers, and the roadmaps, if they exist at all, are in the minds of core developers, each of them with their technical wish lists and priorities. A sure-fire way of seeing a new feature is to build it yourself, whether as a new library based on existing framework like pandas, or within the pandas framework itself.
That may sound impossible at first, but then I learned that the three core developers I met weren’t programmers by trade. They wanted pandas to do more than it was doing, and they were tenacious enough to build what they need (and as a result benefiting many of us), which I hope proves inspirational for data scientists to give back to pandas, or other open-source libraries, that we benefit so much from.